'How to Define Schema for Nested attributes with existing pyspark dataframe with string type and then print with json
I have a data frame with column model_code as string type and I have convert column to model.code and when I print df.toJSON()
then it is {"model.code":"xyz"}
but I have requirement to print the json like- {"model":{"code":"xyz"}}
. How can achieve this using either pyspark or pandas?
Sample dataframe:
model.code | model.name |
---|---|
700 | Desktop |
250 | Tablet |
expected output:
[{"model":{"code":"700","name":"Desktop"}},{"model":{"code":"250","name":"Tablet"}}]
Solution 1:[1]
One way to do it is using toJSON()
and then use a map
on the output RDD to manipulate the JSON string.
df = spark.createDataFrame(data=[(700, 'Desktop'), (250, 'Tablet')], schema = ['model.code', 'model.name'])
>>> df.show()
+----------+----------+
|model.code|model.name|
+----------+----------+
| 700| Desktop|
| 250| Tablet|
+----------+----------+
>>> import json
>>> df.toJSON() \
... .map(lambda x: json.loads('{"model":'+x.replace('model.','')+'}')) \
... .collect()
[{'model': {'code': 700, 'name': 'Desktop'}}, {'model': {'code': 250, 'name': 'Tablet'}}]
Solution 2:[2]
You need to construct the nested struct
and then use the to_json
function to convert to a json string.
df = df.select(F.collect_list(F.to_json(
F.struct(F.struct(F.col('`model.code`').alias('code'), F.col('`model.name`').alias('name')).alias('model'))))
)
df.show(truncate=False)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | user2314737 |
Solution 2 | 过过招 |