'ZipFile.testzip() returning different results on Python 2 and Python 3
Using the zipfile
module to unzip a large data file in Python works correctly on Python 2 but produces the following error on Python 3.6.0:
BadZipFile: Bad CRC-32 for file 'myfile.csv'
I traced this to error handling code checking the CRC values.
Using ZipFile.testzip()
on Python 2 returns nothing (all files are fine). Running it on Python 3 returns 'myfile.csv'
indicating a problem with that file.
Code to reproduce on both Python 2 and Python 3 (involves a 300 MB download, sorry):
import zipfile
import urllib
import sys
url = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"
if sys.version_info >= (3, 0, 0):
urllib.request.urlretrieve(url, "vertnet_latest_amphibians.zip")
else:
urllib.urlretrieve(url, "vertnet_latest_amphibians.zip")
archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.testzip()
Does anyone understand why this difference exists and if there's a way to get Python 3 to properly extract the file using:
archive.extract("vertnet_latest_amphibians.csv")
Solution 1:[1]
The CRC value is OK. The CRC of 'vertnet_latest_amphibians.csv' recorded in the zip is 0x87203305. After extraction, this is indeed the CRC of the file.
However, the given uncompressed size is incorrect. The zip file records compressed size of 309,723,024 bytes, and uncompressed size of 292,198,614 bytes (that's smaller!). In reality, the uncompressed file is 4,587,165,910 bytes (4.3 GiB). This is bigger than the 4 GiB threshold where 32-bit counters break.
You can fix it like this (this worked in Python 3.5.2, at least):
archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.getinfo("vertnet_latest_amphibians.csv").file_size += 2**32
archive.testzip() # now passes
archive.extract("vertnet_latest_amphibians.csv") # now works
Solution 2:[2]
I was unable to get Python 3 to extract from the archive. Some results from an investigation (on Mac OS X) that might be helpful.
Check the health of the archive
Make the file read-only in order to prevent accidental changes:
$ chmod -w vertnet_latest_amphibians.zip
$ ls -lh vertnet_latest_amphibians.zip
-r--r--r-- 1 lawh 2045336417 296M Jan 6 10:10 vertnet_latest_amphibians.zip
Check the archive using zip
and unzip
:
$ zip -T vertnet_latest_amphibians.zip
test of vertnet_latest_amphibians.zip OK
$ unzip -t vertnet_latest_amphibians.zip
Archive: vertnet_latest_amphibians.zip
testing: VertNet_Amphibia_eml.xml OK
testing: __MACOSX/ OK
testing: __MACOSX/._VertNet_Amphibia_eml.xml OK
testing: vertnet_latest_amphibians.csv OK
testing: __MACOSX/._vertnet_latest_amphibians.csv OK
No errors detected in compressed data of vertnet_latest_amphibians.zip
As also found by @sam-mussmann, 7z
reports a CRC error:
$ 7z t vertnet_latest_amphibians.zip
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)
Scanning the drive for archives:
1 file, 309726398 bytes (296 MiB)
Testing archive: vertnet_latest_amphibians.zip
--
Path = vertnet_latest_amphibians.zip
Type = zip
Physical Size = 309726398
ERROR: CRC Failed : vertnet_latest_amphibians.csv
Sub items Errors: 1
Archives with Errors: 1
Sub items Errors: 1
My zip
and unzip
are both rather old; 7z
is pretty new:
$ zip -v | head -2
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
$ unzip -v | head -1
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
$ 7z --help |head -3
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)
Extract
Using unzip
:
$ time unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv
Archive: vertnet_latest_amphibians.zip
inflating: vertnet_latest_amphibians.csv
real 0m17.201s
user 0m14.281s
sys 0m2.460s
Extract using Python 2.7.13, using zipfile
's command-line interface for brevity:
$ time ~/local/python-2.7.13/bin/python2 -m zipfile -e vertnet_latest_amphibians.zip .
real 0m19.491s
user 0m12.996s
sys 0m5.897s
As you found, Python 3.6.0 (also 3.4.5 and 3.5.2) reports a bad CRC
Hypothesis 1: The archive contains a bad CRC that zip
, unzip
and
Python 2.7.13 are failing to detect; 7z
and Python 3.4-3.6 are all doing the
right thing.
Hypothesis 2: The archive is fine; 7z
and Python 3.4-3.6 all contain a bug.
Given the relative ages of these tools, I would guess that H1 is correct.
Workaround
If you are not using Windows and trust the contents of the archive, it might be more straightforward to use regular shell commands. Something like:
wget <the-long-url> -O /tmp/vertnet_latest_amphibians.zip
unzip /tmp/vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv
rm -rf /tmp/vertnet_latest_amphibians.zip
Or you could execute unzip
from within Python:
import os
os.system('unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv')
Incidental
It is slightly neater to catch ImportError
than to check the version of the
Python interpreter:
try:
from urllib.request import urlretrieve
except ImportError:
from urllib import urlretrieve
Solution 3:[3]
As @Kundor, setting the file_size to the maximum (2**32 - 1) will work but fail for any file greater than 4 GiB(4 GiB minus 1 byte) hence set it to the maximum size for ZIP64 (16 EiB minus 1 byte)
Tested on (927MB compresed and 11GB of file_to_extract)
file: vertnet_latest_birds.csv
import zipfile
import urllib
import sys
url = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"
zip_path = "vertnet_latest_amphibians.zip"
file_to_extract = "vertnet_latest_amphibians.csv"
if sys.version_info >= (3, 0, 0):
urllib.request.urlretrieve(url, zip_path)
else:
urllib.urlretrieve(url, zip_path)
archive = zipfile.ZipFile(zip_path)
if archive.testzip():
# reset uncompressed size header values to maximum
archive.getinfo(file_to_extract).file_size += (2 ** 64) - 1
open_archive_file = archive.open(file_to_extract, 'r')
# or archive.extract(file_to_extract)
Solution 4:[4]
In my case the issue was the wrong ZipInfo.file_size
(Python 2.7) when compared to the actual size of the file when extracted (as @nick-matteo discovered). I found out that the cause of the file size mismatch was in passing unicode string to zipfile.writestr()
function.
In my case solution was to encode unicode to utf8 before passing to writestr()
function:
zf = zipfile.ZipFile(...)
if isinstance(file_contents, unicode):
file_contents = file_contents.encode("utf8")
zf.writestr("filename.txt", file_contents)
...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Lawrence |
Solution 3 | Community |
Solution 4 | Robert Lujo |