'Write data in unknown encoding

Is it possible write data to a file in an unknown encoding? I cannot decode email headers, for example message-id, because if I use handler ignore or a replace https://docs.python.org/3/library/codecs.html#error-handlers non-RFC header will be RFC-compliant and antispam don't increase spam score.

I get string from postfix in milter protocol. I cannot save this data unchanged for antispam, raise UnicodeError. Examples:

cat savefile

#!/usr/bin/python3

import sys
fh = open('test', 'w+')
fh.write(sys.argv[1])
echo žlutý | xargs ./savefile && cat test
žlutý
echo žlutý | iconv -f UTF-8 -t ISO8859-2 - | xargs ./savefile 
Traceback (most recent call last):
  File "/root/./savefile", line 5, in <module>
    fh.write(sys.argv[1])
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbe' in position 0: surrogates not allowed

Input may be a lot of unknown encoding. Milter application in python2 works well.



Solution 1:[1]

You want to handle raw bytes then, not strings. open the output file in binary mode. Note this:

sys.argv

..

Note: On Unix, command line arguments are passed by bytes from OS. Python decodes them with filesystem encoding and “surrogateescape” error handler. When you need original bytes, you can get it by [os.fsencode(arg) for arg in sys.argv].

https://docs.python.org/3/library/sys.html#sys.argv

So:

import sys
import os

with open('test', 'wb+') as fh:
    fh.write(os.fsencode(sys.argv[1]))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 deceze