'fileinput.hook_compressed gives me strings sometimes, bytes other times
I'm trying to read lines from a number of files. Some are gzipped, and others are plain text files. In Python 2.7, I have been using the following code and it worked:
for line in fileinput.input(filenames, openhook=fileinput.hook_compressed):
match = REGEX.match(line)
if (match):
# do things with line...
Now I moved to Python 3.8, and it still works ok with plain text files, but when it encounters gzipped files I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
What's the best way to fix this? I know I can check if line
is a bytes object and decode it into a string, but I would rather do it with some flag to automatically always iterate lines as string, if possible; and, I would prefer to write code that works with both Python 2 and 3.
Solution 1:[1]
fileinput.input
does fundamentally different things depending on whether it gets a gzipped file or not. For text files, it opens with regular open
, which effectively opens in text mode by default. For gzip.open
, the default mode is binary, which is sensible for compressed files of unknown content.
The binary-only restriction is artificially imposed by fileinput.FileInput
. From the code of the __init__
method:
# restrict mode argument to reading modes if mode not in ('r', 'rU', 'U', 'rb'): raise ValueError("FileInput opening mode must be one of " "'r', 'rU', 'U' and 'rb'") if 'U' in mode: import warnings warnings.warn("'U' mode is deprecated", DeprecationWarning, 2) self._mode = mode
This gives you two options for a workaround.
Option 1
Set the _mode
attribute after __init__
. To avoid adding extra lines of code to your usage, you can subclass fileinput.FileInput
and use the class directly:
class TextFileInput(fileinput.FileInput):
def __init__(*args, **kwargs):
if 'mode' in kwargs and 't' in kwargs['mode']:
mode = kwargs.pop['mode']
else:
mode = ''
super().__init__(*args, **kwargs)
if mode:
self._mode = mode
for line in TextFileInput(filenames, openhook=fileinput.hook_compressed, mode='rt'):
...
Option 2
Messing with undocumented leading-underscore is pretty hacky, so instead, you can create a custom hook for zip files. This is actually pretty easy, since you can use the code for fileinput.hook_compressed
as a template:
def my_hook_compressed(filename, mode):
if 'b' not in mode:
mode += 't'
ext = os.path.splitext(filename)[1]
if ext == '.gz':
import gzip
return gzip.open(filename, mode)
elif ext == '.bz2':
import bz2
return bz2.open(filename, mode)
else:
return open(filename, mode)
Option 3
Finally, you can always decode your bytes to unicode strings. This is clearly not the preferable option.
Solution 2:[2]
Extending the answer by Mad Physicist to include xz
and zst
extensions.
def my_hook_compressed(filename, mode):
"""hook for fileinput so we can also handle compressed files seamlessly"""
if 'b' not in mode:
mode += 't'
ext = os.path.splitext(filename)[1]
if ext == '.gz':
import gzip
return gzip.open(filename, mode)
elif ext == '.bz2':
import bz2
return bz2.open(filename, mode)
elif ext == '.xz':
import lzma
return lzma.open(filename, mode)
elif ext == '.zst':
import zstandard, io
compressed = open(filename, 'rb')
decompressor = zstandard.ZstdDecompressor()
stream_reader = decompressor.stream_reader(compressed)
return io.TextIOWrapper(stream_reader)
else:
return open(filename, mode)
I have not tested on 2.7, but this works with 3.8+
for line in fileinput.input(filenames, openhook=my_hook_compressed):
...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | lab115 |