'Can I store a Parquet file with a dictionary column having mixed types in their values?
I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": "Value3" }
]
})
df.to_parquet("test.parquet")
Now, that works perfectly fine, the problem is when one of the nested values of the dictionary has a different type than the rest. For instance:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
df.to_parquet("test.parquet")
This throws the following error:
ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')
Notice how, for the last row of the DF, the Field
property of the ColC
dictionary is a list instead of a string.
Is there any workaround to be able to store this DF as a Parquet file?
Solution 1:[1]
ColC
is a UDT (user defined type) with one field called Field
of type Union of String, List of String
.
In theory arrow supports it, but in practice it has a hard time figuring out what the type of ColC
is. Even if you were providing the schema of your data frame explicitly, it wouldn't work because this type of conversion (converting unions from pandas to arrow/parquet) isn't supported yet.
union_type = pa.union(
[pa.field("0",pa.string()), pa.field("1", pa.list_(pa.string()))],
'dense'
)
col_c_type = pa.struct(
[
pa.field('Field', union_type)
]
)
schema=pa.schema(
[
pa.field('ColA', pa.int32()),
pa.field('ColB', pa.string()),
pa.field('ColC', col_c_type),
]
)
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
pa.Table.from_pandas(df, schema)
This gives you this error:
('Sequence converter for type union[dense]<0: string=0, 1: list<item: string>=1> not implemented', 'Conversion failed for column ColC with type object'
Even if you create the arrow table manually it won't be able to convert it to parquet (again, union are not supported).
import io
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
xs = pa.array(["Value", "Value2", None], type=pa.string())
ys = pa.array([None, None, ["value3"]], type=pa.list_(pa.string()))
types = pa.array([0, 0, 1], type=pa.int8())
col_c = pa.UnionArray.from_sparse(types, [xs, ys])
table = pa.Table.from_arrays(
[col_a, col_b, col_c],
schema=pa.schema([
pa.field('ColA', col_a.type),
pa.field('ColB', col_b.type),
pa.field('ColC', col_c.type),
])
)
with io.BytesIO() as buffer:
pq.write_table(table, buffer)
Unhandled type for Arrow to Parquet schema conversion: sparse_union<0: string=0, 1: list<item: string>=1>
I think your only option for now it to use a struct where fields have got different names for string value and list of string values.
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field1": "Value" },
{ "Field1": "Value2" },
{ "Field2": ["Value3"] }
]
})
df.to_parquet('/tmp/hello')
Solution 2:[2]
I just had the same problem and fixed by converting ColC
to string:
df['ColC'] = df['ColC'].astype(str)
I am not sure this would not create a problem in the future, don't quote me.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | 0x26res |
Solution 2 | Henrique Brisola |