'Extract News article content from stored .html pages

I am reading text from html files and doing some analysis. These .html files are news articles.

Code:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

EDIT: To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general



Solution 1:[1]

There are libraries for this in Python too :)

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

and

https://github.com/grangier/python-goose

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

EDIT: here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

Solution 2:[2]

Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.

The quickstart only shows loading from a URL, but you can load from a HTML string with:

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)

Solution 3:[3]

Try something like this by visiting the page directly:

##Import modules
from bs4 import BeautifulSoup
import urllib2


##Grab the page
url = http://www.example.com
req = urllib2.Request(url)
page = urllib2.urlopen(req)
content = page.read()
page.close()  

##Prepare
soup = BeautifulSoup(content) 

##Parse (a table, for example)

for link in soup.find_all("table",{"class":"myClass"}):
    ...do something...
pass

If you want to load a file, just replace the part where you grab the page with the file instead. Find out more here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Solution 4:[4]

There are many ways to organize html-scaraping in Python. As said in other answers, the tool #1 is BeautifulSoup, but there are others:

Here are useful resources:

There is no universal way of finding the content of the article. HTML5 has article tag, hinting on the main text, and it is maybe possible to tune scraping for pages from specific publishing systems, but there is no general way to get the accurately guess text location. (Theoretically, machine can deduce page structure from looking at more than one structurally identical, different in content articles, but this is probably out of scope here.)

Also Web scraping with Python may be relevant.

Pyquery example for NYT:

from pyquery import PyQuery as pq
url = 'http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general'
d = pq(url=url)
text = d('.story-content').text()

Solution 5:[5]

You can use htmllib or HTMLParser you can use these to parse your html file

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

A sample of code tooken from HTMLParser page

Solution 6:[6]

I can highly recommend using Trafilatura. Super easy to implement and it's fast!

import trafilatura
url = 'www.example.com'
downloaded = trafilatura.fetch_url(url)
article_content = trafilatura.extract(downloaded)

Which gives:

'This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\nMore information...'

You can also give it the HTML directly, like this:

trafilatura_text = trafilatura.extract(html, include_comments=False)

If you're interested in more fields, like authors / publication date, you can use bare_extraction:

import trafilatura
url = 'www.example.com'
downloaded = trafilatura.fetch_url(url)
trafilatura.bare_extraction(downloaded, include_links=True)

Which will give you:

{'title': 'Example Domain',
 'author': None,
 'url': None,
 'hostname': None,
 'description': None,
 'sitename': None,
 'date': None,
 'categories': [],
 'tags': [],
 'fingerprint': None,
 'id': None,
 'license': None,
 'body': None,
 'comments': '',
 'commentsbody': None,
 'raw_text': None,
 'text': 'This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\nMore information...'}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 datasci
Solution 4 Community
Solution 5
Solution 6 Muriel