'#scan suddenly returns an empty array

I am creating a scraper for articles from www.dev.to, which should read in the title, author and body of the article. I am using #scan to get rid of white space and other characters after the author name. At first i assumed the author name would consist of first name and last name, then realized some only have one name listed. Now that I changed the regex accordingly, the method stopped working and #scan returns an empty array. How can I fix this?

  def scrape_post(path)
    url = "https://dev.to/#{path}"
    html_content = open(url).read
    doc = Nokogiri::HTML(html_content)
    doc.search('.article-wrapper').each do |element|
      title = element.search('.crayons-article__header__meta').search('h1').text.strip
      author_raw = element.search('.crayons-article__subheader').text.strip
      author = author_raw.scan(/\A\w+(\s|\w)\w+/).first
      body = doc.at_css('div#article-body').text.strip
      @post = Post.new(id: @next_id, path: path, title: title, author: author, body: body, read: false)
    end
    @post
  end

Example of input data:

path = rahxuls/preventing-copying-text-in-a-webpage-4acg

Expected output:

title = "Preventing copying text in a webpage 😁"

author_raw = "Rahul\n              \n\n              \n                  Nov  6\n\n\n                ・2 min read"

author = "Rahul"


Solution 1:[1]

From the scan docs.

If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.

By adding the parentheses to the middle of your regex, you created a capturing group. Scan will return whatever that group captures. In the example you gave, it will be 'u'.

"Rahul\n \n\n \n Nov 6\n\n\n ?2 min read".scan(/\A\w+(\s|\w)\w+/) #=> [["u"]]

The group can be marked as non-capturing to return to your old implementation

"Rahul\n \n\n \n Nov 6\n\n\n ?2 min read".scan(/\A\w+(?:\s|\w)\w+/) #=> ["Rahul"]
#                                                       ^

Or you can add a named capture group to what you actually want to extract.

"Rahul\n \n\n \n Nov 6\n\n\n ?2 min read".match(/\A(?<name>\w+(\s|\w)\w+)/)[:name] #=> "Rahul"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Siim Liiser