'#scan suddenly returns an empty array
I am creating a scraper for articles from www.dev.to, which should read in the title, author and body of the article. I am using #scan to get rid of white space and other characters after the author name. At first i assumed the author name would consist of first name and last name, then realized some only have one name listed. Now that I changed the regex accordingly, the method stopped working and #scan returns an empty array. How can I fix this?
def scrape_post(path)
url = "https://dev.to/#{path}"
html_content = open(url).read
doc = Nokogiri::HTML(html_content)
doc.search('.article-wrapper').each do |element|
title = element.search('.crayons-article__header__meta').search('h1').text.strip
author_raw = element.search('.crayons-article__subheader').text.strip
author = author_raw.scan(/\A\w+(\s|\w)\w+/).first
body = doc.at_css('div#article-body').text.strip
@post = Post.new(id: @next_id, path: path, title: title, author: author, body: body, read: false)
end
@post
end
Example of input data:
path = rahxuls/preventing-copying-text-in-a-webpage-4acg
Expected output:
title = "Preventing copying text in a webpage 😁"
author_raw = "Rahul\n \n\n \n Nov 6\n\n\n ・2 min read"
author = "Rahul"
Solution 1:[1]
From the scan
docs.
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
By adding the parentheses to the middle of your regex, you created a capturing group. Scan will return whatever that group captures. In the example you gave, it will be 'u'
.
"Rahul\n \n\n \n Nov 6\n\n\n ?2 min read".scan(/\A\w+(\s|\w)\w+/) #=> [["u"]]
The group can be marked as non-capturing to return to your old implementation
"Rahul\n \n\n \n Nov 6\n\n\n ?2 min read".scan(/\A\w+(?:\s|\w)\w+/) #=> ["Rahul"]
# ^
Or you can add a named capture group to what you actually want to extract.
"Rahul\n \n\n \n Nov 6\n\n\n ?2 min read".match(/\A(?<name>\w+(\s|\w)\w+)/)[:name] #=> "Rahul"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Siim Liiser |