'Why I'm getting "UnicodeEncodeError: 'charmap' codec can't encode character '\u25b2' in position 84811: character maps to <undefined>" error?

I'm getting UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to error while running this code::

from bs4 import BeautifulSoup
import requests
r = requests.get('https://stackoverflow.com').text
soup = BeautifulSoup(r, 'lxml')
print(soup.prettify())

and the output is:

Traceback (most recent call last):
  File "c:\Users\Asus\Documents\Hello World\Web Scraping\st.py", line 5, in <module>
    print(soup.prettify())
  File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>

I'm using python 3.8.1 and UTF-8 in vs code. How to solve this?



Solution 1:[1]

There are hints in the full error message... I will keep here what seems most important:

Traceback ...
  File "...\cp1252.py", ...
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' ...

The error is caused by the print call. Somewhere in you text, you have a ZERO WIDTH SPACE character (Unicode U+200B), and if you print to a Windows console, the string is internally encoded into the Windows console code page (cp1252 here). And the ZERO WIDTH SPACE is not represented in that code page. BTW the default console is not really unicode friendly in Windows.

There is little to do in a Windows console. I would advise you to try one of these workarounds:

  • do not print to the console but write to a (utf8) file. You will then be able to read it with a utf8 enabled text editor like notepad++

  • manually encode anything before printing it, with errors='ignore' or errors='replace'. That way, possibly offending characters will be ignored and no error will arise

      print(soup.prettify().encode('cp1252', errors='ignore'))
    

Solution 2:[2]

You can explore little bit on your own... but for python 2.7 what i usually do is use this to clean my text:

text = text.encode('utf-8').decode('ascii', 'ignore')

python 3 equivalent for this is simply:

text = str(text)

For your case, try this:

r = requests.get('https://stackoverflow.com').text.encode('utf8').decode('ascii', 'ignore')

otherwise normally:

r = requests.get('https://stackoverflow.com')
soup = BeautifulSoup(r.content, 'lxml')
print soup

(I don't think this should give any error.)

Solution 3:[3]

For python3, you can use:

text = str(text.encode('utf-8'))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Serge Ballesta
Solution 2
Solution 3 Suraj Rao