'How to flatten nested dataclass while serializing to pandas dataframe?
I have dataclass containing other dataclass as its field:
@dataclass
class Bar:
abc: int
bed: int
asd: int
@dataclass
class Foo:
xy: int
yz: Bar
then I try to serialize it to csv by pandas like this:
dataset = [Foo(xy=1, yz=Bar(abc=1, bed=2, asd=3))]
pd_dataset = pandas.DataFrame(vars(row) for row in dataset)
pd_dataset.to_csv('dataset_example.csv', index=False)
but the result I get is kinda different than I want to achieve. To be precise I now get:
xy,yz
1,"Bar(abc=1, bed=2, asd=3)"
and I want:
xy,yz_abc,yz_bed,yz_asd
1,1,2,3
Can you help me getting it right? I tried to write my own serialization function and do something like:
pandas.DataFrame(asdict(row, dict_factory=row_to_dict) for row in dataset)
but I can't get how to correctly write it.
Solution 1:[1]
There is no need for using an external library as in this answer, Pandas provides you with everything you need in the form of pd.json_normalize
:
>>> import pandas as pd
... from dataclasses import asdict, dataclass
...
... @dataclass
... class Bar:
... abc: int
... bed: int
... asd: int
...
... @dataclass
... class Foo:
... xy: int
... yz: Bar
...
... dataset = [
... Foo(xy=1, yz=Bar(abc=1, bed=2, asd=3)),
... Foo(xy=10, yz=Bar(abc=10, bed=20, asd=30)),
... ]
>>> dataset
[Foo(xy=1, yz=Bar(abc=1, bed=2, asd=3)),
Foo(xy=10, yz=Bar(abc=10, bed=20, asd=30))]
>>> df = pd.json_normalize(asdict(obj) for obj in dataset)
>>> df
xy yz.abc yz.bed yz.asd
0 1 1 2 3
1 10 10 20 30
>>> print(df.to_csv(index=False))
xy,yz.abc,yz.bed,yz.asd
1,1,2,3
10,10,20,30
I personally prefer the above default "."
separator, but if you feel strongly about underscores, Pandas also got you covered:
>>> pd.json_normalize((asdict(obj) for obj in dataset), sep="_")
xy yz_abc yz_bed yz_asd
0 1 1 2 3
1 10 10 20 30
Solution 2:[2]
Carefully create desired key from vars(Bar)
can do what you want.
dataset = [Foo(xy=1, yz=Bar(abc=1, bed=2, asd=3))]
res = []
for obj in dataset:
d = {}
for k, v in vars(obj).items():
if isinstance(v, Bar):
for k_, v_ in vars(vars(obj)[k]).items():
d[f'{k}_{k_}'] = v_
else:
d[k] = v
res.append(d)
print(res)
'''
[{'xy': 1, 'yz_abc': 1, 'yz_bed': 2, 'yz_asd': 3}]
'''
pd_dataset = pd.DataFrame.from_records(res)
print(pd_dataset)
'''
xy yz_abc yz_bed yz_asd
0 1 1 2 3
'''
Solution 3:[3]
Ok I figured it myself while after posting a question. To solve this problem I needed to download a library called flatten-dict. Then use it like this:
pd_dataset = pandas.DataFrame(flatten(asdict(row), reducer='underscore') for row in dataset)
If there's room for improvement to this approach let me know, but I find it really clean and simple.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Thrastylon |
Solution 2 | Ynjxsjmh |
Solution 3 | Gustaw Ohler |