'How to fix UTF-8 decoded with ISO-8859-1 in Redshift
I assumed a dataset was ISO-8859-1 encoded, while it was actually encoded in utf-8. I wrote a python script where i decoded the data with ISO-8859-1 and wrote it into a redshift sql database. I wrote the messed up characters into the redshift table, the decoding did not happen while writing into the table. (used python and pandas with wrong encoding)
Now the datasource is not available anymore but the data in the table has a lot of messed up characters.
E.g. 'Hello Günter' -> 'Hello GĂŒnter'
What is the best way to resolve this issue? Right now i can only think of collecting a complete list of messed up characters and their translation, but maybe there is a way i have not thought of. So my questions:
First of all i would like to know if information was lost when the decoding happened.. Also i would like to know if there might be a way in redshift to solve such a decoding issue. Finally i have been searching for a complete list, so i do not have to create it myself. I could not find such list.
Thank you
EDIT: I pulled a part of the table and found out i have to do the following thing:
"Ð\x97амÑ\x83ж вÑ\x8bÑ\x85оди".encode('iso-8859-1').decode('utf8')
The table has billions of rows, would it be possible to do that in redshift?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|