'Efficient way to parse a file with different json schemas in spark
I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet
{"type":1, "data":{"data_of_type1" : 1}}
{"type":2, "data":{"data_of_type2" : "can be any type"}}
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}
I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...
Current idea (not working):
- Loaded the json and parse only the type ( "data" is loaded as a string)
- attach to each row the corresponding schema (a DDL as string in a new column)
- try to parse the "data" with the DDL from the previous column (method from_json)
=> Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema
Do you have any idea if it's possible?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|