'python gzip.open: zlib.error: Error -3 while decompressing data: too many length or distance symbols
I want to decompress a huge gz file (wikidata json dump latest-all.json.gz
, 104GB compressed) on the fly in python with gzip.open.
It works fine for a while. However, after reading 39.7 million lines it yields the error:
zlib.error: Error -3 while decompressing data: too many length or distance symbols
The function where I do the decompressing and reading looks like this:
import gzip
...
def wikidata(filename):
    with gzip.open(filename, mode='rt') as f:
        f.read(2) # skip first two bytes: "{\n"
        for line in f:
            try:
                yield json.loads(line.rstrip(',\n'))
            except json.decoder.JSONDecodeError:
                continue
The error in full is:
Traceback (most recent call last):
  File "parse.py", line 95, in <module>
    for line in lines:
  File "parse.py", line 21, in wikidata
    for line in f:
  File "/usr/lib/python3.8/gzip.py", line 305, in read1
    return self._buffer.read1(size)
  File "/usr/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.8/gzip.py", line 487, in read
    uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: too many length or distance symbols
What can be the reason for this? How can I solve the problem?
Solution 1:[1]
It means that the compressed data is corrupted at that point, or some short distance before it. The only way to solve the problem is to replace the input with a gzip file that is not corrupted.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|---|
| Solution 1 | Mark Adler | 
