'HTMLParser.HTMLParser().unescape() doesn't work
I would like to convert HTML entities back to its human readable format, e.g. '£'
to '£', '°'
to '°' etc.
I've read several posts regarding this question
Converting html source content into readable format with Python 2.x
Decode HTML entities in Python string?
Convert XML/HTML Entities into Unicode String in Python
and according to them, I chose to use the undocumented function unescape(), but it doesn't work for me...
My code sample is like:
import HTMLParser
htmlParser = HTMLParser.HTMLParser()
decoded = htmlParser.unescape('© 2013')
print decoded
When I ran this python script, the output is still:
© 2013
instead of
© 2013
I'm using Python 2.X, working on Windows 7 and Cygwin console. I googled and didn't find any similar problems..Could anyone help me with this?
Solution 1:[1]
Apparently HTMLParser.unescape
was a bit more primitive before Python 2.6.
Python 2.5:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
'©'
Python 2.6/2.7:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
u'\xa9'
UPDATE: Python 3.4+:
>>> import html
>>> html.unescape('©')
'©'
See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation
Solution 2:[2]
Starting in python 3.9 using HTMLParser()unescape(<str>)
will result in the error AttributeError: 'HTMLParser' object has no attribute 'unescape'
You can update it to:
import html
html.unescape(<str>)
Solution 3:[3]
This site lists some solutions, here's one of them:
from xml.sax.saxutils import escape, unescape
html_escape_table = {
'"': """,
"'": "'",
"©": "©"
# etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}
def html_unescape(text):
return unescape(text, html_unescape_table)
Not the prettiest thing though, since you would have to list each escaped symbol manually.
EDIT:
How about this?
import htmllib
def unescape(s):
p = htmllib.HTMLParser(None)
p.save_bgn()
p.feed(s)
return p.save_end()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | andorov |
Solution 3 |