'How does Pyarrow read_csv handle different file encodings?
I have a .dat file that I had been reading with pd.read_csv
and always needed to use encoding="latin"
for it to read properly / without error. When I use pyarrow.csv.read_csv
I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|"
(with ParseOptions) and auto_dict_encode=True
with (ConvertOptions).
How is pyarrow handling different encoding types?
Solution 1:[1]
pyarrow
currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.
A small example:
# writing a small file with latin encoding
with open("test.csv", "w", encoding="latin") as f:
f.writelines(["col1,col2\n", "u,ù"])
Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:
>>> from pyarrow import csv
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary
With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):
>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
>>> pd.read_csv("test.csv", encoding="latin")
col1 col2
0 u ù
Solution 2:[2]
It's now possible to specify encodings with pyarrow.read_csv. According to the pyarrow docs for read_csv:
The encoding can be changed using the ReadOptions class.
A minimal example follows:
from pyarrow import csv
options = csv.ReadOptions(encoding='latin1')
table = csv.read_csv('path/to/file', options)
From what I can tell, the functionality was added in this PR, so it should work starting with pyarrow 1.0.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | joris |
Solution 2 | daviewales |