'Python FTP return corrupted name of file: じし->し ゙し (how can i properly decode/encode a string)

In my open source application, I use rented hosting with FTP. My application needs to read a list of files from a directory and parse it. However, some of the files have erroneous names. How can I recover the names or ask FTP to give them out correctly.

import ftplib

ftp_domain = "japcards.ru"
ftp_login = "u1670424_jap_db"
ftp_pass = "Jap2DbPass"

if __name__ == '__main__':
    ftp = ftplib.FTP(ftp_domain)
    ftp.encoding = 'utf-8'
    ftp.login(ftp_login, ftp_pass)
    ftp.cwd("audio/jp")
    ftpList = ftp.nlst()
    ftpList.sort()
    for i in ftpList:
        print(i)
        print(i.encode('utf-8'))
        

From the output of the example:

し ゙しょけい.wav
b'\xe3\x81\x97\xe3\x82\x99\xe3\x81\x97\xe3\x82\x87\xe3\x81\x91\xe3\x81\x84.wav'


Solution 1:[1]

This appears to just be an issue with how your OS / STDOUT is handling the characters.

As you have shown with the encoded output, the character ?? is represented by \xE3\x81\x97\xE3\x82\x99 and when you inspect the variable in memory, those bytes are treated as a single UTF-8 character.

If you paste the "desired" text into your IDE, PyCharm Windows in my case, the \xE3\x81\x97\xE3\x82\x99 is replaced with \xE3\x81\x97\x20\xE3\x82\x99

Outside of printing on the screen, this issue shouldn't affect the rest of your code. As in memory the file names are preserved fine.

Solution 2:[2]

Correct answer: you need to run the script with a version of at least Python 3.10.4. For compare strings (duplicate from website):

$str = "??????????????????";
 
# Hiragana
$str =~ s/\xE3\x81\x8B\xE3\x82\x99/\xE3\x81\x8C/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x8D\xE3\x82\x99/\xE3\x81\x8E/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x8F\xE3\x82\x99/\xE3\x81\x90/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x91\xE3\x82\x99/\xE3\x81\x92/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x93\xE3\x82\x99/\xE3\x81\x94/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x95\xE3\x82\x99/\xE3\x81\x96/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x97\xE3\x82\x99/\xE3\x81\x98/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x99\xE3\x82\x99/\xE3\x81\x9A/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x9B\xE3\x82\x99/\xE3\x81\x9C/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x9D\xE3\x82\x99/\xE3\x81\x9E/g;  # ?+??=> ?
$str =~ s/\xE3\x81\x9F\xE3\x82\x99/\xE3\x81\xA0/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xA1\xE3\x82\x99/\xE3\x81\xA2/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xA4\xE3\x82\x99/\xE3\x81\xA5/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xA6\xE3\x82\x99/\xE3\x81\xA7/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xA8\xE3\x82\x99/\xE3\x81\xA9/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xAF\xE3\x82\x99/\xE3\x81\xB0/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xAF\xE3\x82\x9A/\xE3\x81\xB1/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xB2\xE3\x82\x99/\xE3\x81\xB3/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xB2\xE3\x82\x9A/\xE3\x81\xB4/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xB5\xE3\x82\x99/\xE3\x81\xB6/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xB5\xE3\x82\x9A/\xE3\x81\xB7/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xB8\xE3\x82\x99/\xE3\x81\xB9/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xB8\xE3\x82\x9A/\xE3\x81\xBA/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xBB\xE3\x82\x99/\xE3\x81\xBC/g;  # ?+??=> ?
$str =~ s/\xE3\x81\xBB\xE3\x82\x9A/\xE3\x81\xBD/g;  # ?+??=> ?
 
# Katakana
$str =~ s/\xE3\x82\xAB\xE3\x82\x99/\xE3\x82\xAC/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xAD\xE3\x82\x99/\xE3\x82\xAE/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xAF\xE3\x82\x99/\xE3\x82\xB0/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xB1\xE3\x82\x99/\xE3\x82\xB2/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xB3\xE3\x82\x99/\xE3\x82\xB4/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xB5\xE3\x82\x99/\xE3\x82\xB6/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xB7\xE3\x82\x99/\xE3\x82\xB8/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xB9\xE3\x82\x99/\xE3\x82\xBA/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xBB\xE3\x82\x99/\xE3\x82\xBC/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xBD\xE3\x82\x99/\xE3\x82\xBE/g;  # ?+??=> ?
$str =~ s/\xE3\x82\xBF\xE3\x82\x99/\xE3\x83\x80/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x81\xE3\x82\x99/\xE3\x83\x82/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x84\xE3\x82\x99/\xE3\x83\x85/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x86\xE3\x82\x99/\xE3\x83\x87/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x88\xE3\x82\x99/\xE3\x83\x89/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x8F\xE3\x82\x99/\xE3\x83\x90/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x8F\xE3\x82\x9A/\xE3\x83\x91/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x92\xE3\x82\x99/\xE3\x83\x93/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x92\xE3\x82\x9A/\xE3\x83\x94/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x95\xE3\x82\x99/\xE3\x83\x96/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x95\xE3\x82\x9A/\xE3\x83\x97/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x98\xE3\x82\x99/\xE3\x83\x99/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x98\xE3\x82\x9A/\xE3\x83\x9A/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x9B\xE3\x82\x99/\xE3\x83\x9C/g;  # ?+??=> ?
$str =~ s/\xE3\x83\x9B\xE3\x82\x9A/\xE3\x83\x9D/g;  # ?+??=> ?

I rewrote this for Python3:

def hiragana_and_katakana_normalization(file_name):
    byte_msk = file_name.encode('utf-8')
    # Hiragana
    byte_msk = byte_msk.replace(b'\xE3\x81\x8B\xE3\x82\x99', b'\xE3\x81\x8C')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x8D\xE3\x82\x99', b'\xE3\x81\x8E')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x8F\xE3\x82\x99', b'\xE3\x81\x90')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x91\xE3\x82\x99', b'\xE3\x81\x92')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x93\xE3\x82\x99', b'\xE3\x81\x94')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x95\xE3\x82\x99', b'\xE3\x81\x96')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x97\xE3\x82\x99', b'\xE3\x81\x98')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x99\xE3\x82\x99', b'\xE3\x81\x9A')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x9B\xE3\x82\x99', b'\xE3\x81\x9C')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x9D\xE3\x82\x99', b'\xE3\x81\x9E')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\x9F\xE3\x82\x99', b'\xE3\x81\xA0')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xA1\xE3\x82\x99', b'\xE3\x81\xA2')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xA4\xE3\x82\x99', b'\xE3\x81\xA5')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xA6\xE3\x82\x99', b'\xE3\x81\xA7')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xA8\xE3\x82\x99', b'\xE3\x81\xA9')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xAF\xE3\x82\x99', b'\xE3\x81\xB0')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xAF\xE3\x82\x9A', b'\xE3\x81\xB1')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xB2\xE3\x82\x99', b'\xE3\x81\xB3')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xB2\xE3\x82\x9A', b'\xE3\x81\xB4')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xB5\xE3\x82\x99', b'\xE3\x81\xB6')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xB5\xE3\x82\x9A', b'\xE3\x81\xB7')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xB8\xE3\x82\x99', b'\xE3\x81\xB9')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xB8\xE3\x82\x9A', b'\xE3\x81\xBA')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xBB\xE3\x82\x99', b'\xE3\x81\xBC')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x81\xBB\xE3\x82\x9A', b'\xE3\x81\xBD')  # ?+??=> ?
    # Katakana
    byte_msk = byte_msk.replace(b'\xE3\x82\xAB\xE3\x82\x99', b'\xE3\x82\xAC')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xAD\xE3\x82\x99', b'\xE3\x82\xAE')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xAF\xE3\x82\x99', b'\xE3\x82\xB0')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xB1\xE3\x82\x99', b'\xE3\x82\xB2')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xB3\xE3\x82\x99', b'\xE3\x82\xB4')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xB5\xE3\x82\x99', b'\xE3\x82\xB6')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xB7\xE3\x82\x99', b'\xE3\x82\xB8')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xB9\xE3\x82\x99', b'\xE3\x82\xBA')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xBB\xE3\x82\x99', b'\xE3\x82\xBC')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xBD\xE3\x82\x99', b'\xE3\x82\xBE')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x82\xBF\xE3\x82\x99', b'\xE3\x83\x80')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x81\xE3\x82\x99', b'\xE3\x83\x82')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x84\xE3\x82\x99', b'\xE3\x83\x85')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x86\xE3\x82\x99', b'\xE3\x83\x87')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x88\xE3\x82\x99', b'\xE3\x83\x89')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x8F\xE3\x82\x99', b'\xE3\x83\x90')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x8F\xE3\x82\x9A', b'\xE3\x83\x91')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x92\xE3\x82\x99', b'\xE3\x83\x93')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x92\xE3\x82\x9A', b'\xE3\x83\x94')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x95\xE3\x82\x99', b'\xE3\x83\x96')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x95\xE3\x82\x9A', b'\xE3\x83\x97')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x98\xE3\x82\x99', b'\xE3\x83\x99')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x98\xE3\x82\x9A', b'\xE3\x83\x9A')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x9B\xE3\x82\x99', b'\xE3\x83\x9C')  # ?+??=> ?
    byte_msk = byte_msk.replace(b'\xE3\x83\x9B\xE3\x82\x9A', b'\xE3\x83\x9D')  # ?+??=> ?
    return byte_msk.decode('utf-8')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bigbob556677
Solution 2