'Efficient way to parse a file with different json schemas in spark

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet

{"type":1, "data":{"data_of_type1" : 1}} 
{"type":2, "data":{"data_of_type2" : "can be any type"}} 
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}

I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...

Current idea (not working):

  1. Loaded the json and parse only the type ( "data" is loaded as a string)
  2. attach to each row the corresponding schema (a DDL as string in a new column)
  3. try to parse the "data" with the DDL from the previous column (method from_json)
    => Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema

Do you have any idea if it's possible?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source