'How to Define Schema for Nested attributes with existing pyspark dataframe with string type and then print with json

I have a data frame with column model_code as string type and I have convert column to model.code and when I print df.toJSON() then it is {"model.code":"xyz"} but I have requirement to print the json like- {"model":{"code":"xyz"}}. How can achieve this using either pyspark or pandas?

Sample dataframe:

model.code model.name
700 Desktop
250 Tablet

expected output:

[{"model":{"code":"700","name":"Desktop"}},{"model":{"code":"250","name":"Tablet"}}]


Solution 1:[1]

One way to do it is using toJSON() and then use a map on the output RDD to manipulate the JSON string.

df = spark.createDataFrame(data=[(700, 'Desktop'), (250, 'Tablet')], schema = ['model.code', 'model.name'])
>>> df.show()
+----------+----------+
|model.code|model.name|
+----------+----------+
|       700|   Desktop|
|       250|    Tablet|
+----------+----------+

>>> import json
>>> df.toJSON() \
...   .map(lambda x: json.loads('{"model":'+x.replace('model.','')+'}')) \
...   .collect()
[{'model': {'code': 700, 'name': 'Desktop'}}, {'model': {'code': 250, 'name': 'Tablet'}}]

Solution 2:[2]

You need to construct the nested struct and then use the to_json function to convert to a json string.

df = df.select(F.collect_list(F.to_json(
    F.struct(F.struct(F.col('`model.code`').alias('code'), F.col('`model.name`').alias('name')).alias('model'))))
)
df.show(truncate=False)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user2314737
Solution 2 过过招