'zipfile header language encoding bit set differently between Python2 and Python3

I would like this code to work the same when run with Python 2 or Python 3

from zipfile import ZipFile, ZipInfo

with ZipFile("out.zip", 'w') as zf:
    content = "content"
    info = ZipInfo()
    info.filename = "file.txt"
    info.flag_bits = 0x800
    info.file_size = len(content)
    zf.writestr(info, content)

However, under Python 2 out.zip starts:

50 4b 03 04 14 00 00 08

Under Python3, it starts:

50 4b 03 04 14 00 00 00

The differing part is flag_bits, set to 0x800 for Python 2, 0x00 for Python 3. That's BIT11: language encoding. BIT11 seems to get set if filename.encode("ascii") throws.

I tried to force this bit on by setting the flag after creating the ZipInfo object, but it gets reset back to 0x00 in _open_to_write().

I wonder if anyone here has a good solution. Ideally I'd like both outputs to have the flag set, because that mirrors what the jar utility does.

EDIT: Updated to add the info.flag_bits = 0x800 line just to spell out what I'm trying to achieve. I've reproduced this on Windows: ActivePython 3.6.0.3600, vs ActivePython 2.7.14.2717, Windows 10. And on Linux: Python 3.6.6 vs Python 2.7.11 In case it matters, I am running this exactly as my example, no hashbang, invoking the interpreter directly:

pythonX test.py


Solution 1:[1]

Edit: Here's code that works for me with Python 2.7 but not with 3.6 (a bit of a mystery, it seemed to work earlier this evening):

$ cat zipf.py
from __future__ import print_function

from zipfile import ZipFile, ZipInfo

with ZipFile("out.zip", 'w') as zf:
    content = "content"
    info = ZipInfo()
    info.filename = "file.txt"
    info.flag_bits = 0x800
    # don't set info.file_size here: zf.writestr() does that
    zf.writestr(info, content)

with open('out.zip', 'rb') as stream:
    byteseq = stream.read(8)
    for i in byteseq:
        if isinstance(i, str): i = ord(i)
        print('{:02x}'.format(i), end=' ')
    print()

Run as:

$ python2.7 zipf.py
50 4b 03 04 14 00 00 08 

but:

$ python3.6 zipf.py
50 4b 03 04 14 00 00 00 

It's certainly possible to make it work, by making sure the file is opened before creating the info entry. However, then you must avoid writestr, and this only works with Python 3.6 (and seems rather abusive):

from __future__ import print_function

from zipfile import ZipFile, ZipInfo

with ZipFile("out.zip", 'w') as zf:
    info = ZipInfo()
    info.filename = "file.txt"
    content = "content"
    if not isinstance(content, bytes):
        content = content.encode('utf8')
    info.file_size = len(content)
    with zf.open(info, 'w') as stream:
        info.flag_bits = 0x800
        stream.write(content)

with open('out.zip', 'rb') as stream:
    byteseq = stream.read(8)
    for i in byteseq:
        if isinstance(i, str): i = ord(i)
        print('{:02x}'.format(i), end=' ')
    print()

It's probably the case that 3.6 resetting all the info.flag_bits (through the internal open that it does) is just incorrect, although it's not really clear to me.

Original answer below

I cannot reproduce this, but you're right that bit 11 in the flag bits is set if the file name is Unicode and encoding as ASCII fails:

def _encodeFilenameFlags(self):
    if isinstance(self.filename, unicode):
        try:
            return self.filename.encode('ascii'), self.flag_bits
        except UnicodeEncodeError:
            return self.filename.encode('utf-8'), self.flag_bits | 0x800
    else:
        return self.filename, self.flag_bits

(Python 2.7 zipfile.py source) or:

def _encodeFilenameFlags(self):
    try:
        return self.filename.encode('ascii'), self.flag_bits
    except UnicodeEncodeError:
        return self.filename.encode('utf-8'), self.flag_bits | 0x800

(Python 3.6 zipfile.py source).

To get the bit set you need a filename that cannot be encoded directly in ASCII, e.g.:

info.filename = u"sch\N{latin small letter o with diaeresis}n" # "file.txt"

(this notation works with both Python 2.7 and 3.6).

I tried to force this bit on by setting the flag after creating the ZipInfo object, but it gets reset back to 0x00 in _open_to_write().

If I add:

info.filename = "file.txt"
info.flag_bits |= 0x0800

(just after setting the filename to u"schön") and run this under Python 2.7 or 3.6, I get the bit set in the header (of course the file name in the zip directory changes back to file.txt).

Solution 2:[2]

I am using something like this for the time being:

from zipfile import ZipFile, ZipInfo
import struct

orig_function = ZipInfo.FileHeader

def new_function(self, zip64=None):
    header = orig_function(self, zip64)
    fmt = "B"*len(header)
    blist = list(struct.unpack(fmt, header))
    blist[7] |= 0x8
    return struct.pack(fmt, *blist)

setattr(ZipInfo, "FileHeader", new_function)

with ZipFile("out.zip", 'w') as zf:
    content = "content"
    info = ZipInfo()
    info.filename = "file.txt"
    info.file_size = len(content)
    zf.writestr(info, content)

Hopefully it won't break too soon, FileHeader() seems like something that won't be changing in the future.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Keeely