'Downloading Kaggle Dataset with Chinese folder names converts names to something useless

What the folder names look like AFTER downloading:

What the folder names look like on Kaggle:

Downloading Kaggle Dataset with Chinese folder names converts names to something useless. The folder names in the dataset on Kaggle are pretty much just in Chinese. The Kaggle page mentions something about the encoding being in gb2312 but that didn't really help because now I have 30 GB of data with folder names that are completely useless for labelling in a CNN.

Below is some stuff I tried on another stack overflow page, but I can't find any other page that has a problem even remotely similar to this.

>>> data = 'µ£▒'
>>> data.decode('utf8')
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    data.decode('utf8')
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?


>>> data.decode('utf8').encode('latin1').decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    data.decode('utf8').encode('latin1').decode('gb2312')
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
>>> data = 'µ£▒'
>>> data.decode('utf8').encode('latin1').decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    data.decode('utf8').encode('latin1').decode('gb2312')
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
>>> 'µ£▒'.encode('latin1').decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    'µ£▒'.encode('latin1').decode('gb2312')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2592' in position 2: ordinal not in range(256)
>>> 'µ£▒'.encode('utf8').decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    'µ£▒'.encode('utf8').decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xe2 in position 4: illegal multibyte sequence
>>> 'µ£▒'.decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    'µ£▒'.decode('gb2312')
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
>>> 'µ£▒'.encode('gb2312').decode('utf8')
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    'µ£▒'.encode('gb2312').decode('utf8')
UnicodeEncodeError: 'gb2312' codec can't encode character '\xb5' in position 0: illegal multibyte sequence

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Downloading Kaggle Dataset with Chinese folder names converts names to something useless

Sources

Related Questions