'Why does Python zipfile not give the same output .zip file size as command-line zip?

Here is the size of the file generated by zip:

$ seq 10000 > 1.txt 
$ zip 1 1.txt
  adding: 1.txt (deflated 54%)
$ ls -og 1.zip 
-rw-r--r-- 1 22762 Aug 29 10:04 1.zip

Here is an equivalent python script:

import zipfile
z = zipfile.ZipFile(sys.argv[1], 'w', zipfile.ZIP_DEFLATED)
fn = sys.argv[1]
z.writestr(zipfile.ZipInfo(fn), sys.stdin.read())
z.close()

The size of the zip file generated is the following:

$ seq 10000 | ./main.py 2.zip 2.txt
$ ls -go 2.zip 
-rw-r--r-- 1 49002 Aug 29 10:15 2.zip

Does anybody know why the python version does not generate the zip file as small as the one generated by zip?



Solution 1:[1]

It turns out (checked in python 3) that when ZipInfo is used, writestr() will not use compression and compresslevel of zipfile.ZipFile.__init(). This an example of bad API design. It should have been designed whether ZipInfo is used, compression and compresslevel from the constructor are always used.

When passing a ZipInfo instance as the zinfo_or_arcname parameter, the compression method used will be that specified in the compress_type member of the given ZipInfo instance. By default, the ZipInfo constructor sets this member to ZIP_STORED.

Because of this, there is basically no compression in the python code shown on the original post. Therefore, the file size generated by the python code is large.

Another problem of this API design is the parameter compression from the constructor is the same as compress_type of .writestr() but they are not named the same. This is another poor design. There is no reason to give different names for literally the same thing.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 smci