'Comparing two StructType schemas with differing number of columns
In Spark 3.1.1+
Is there a way to diff two StructType schemas if they have a different number of columns, where column types can also differ for the same column name?
For example:
Schema 1:
StructType {
column_a: Int,
column_b: StructType {
column_c: Int,
column_d: String
}
}
Schema 2:
StructType {
column_a: String
}
So I essentially want to know that column_a
was updated to a different type, and column_b
(along with everything under it) was removed.
Solution 1:[1]
Just use the diff
function available in standard scala arrays.
This will give elements in the first schema but not in the second.
Tree Structure
schema1.printTreeString()
/*
root
|-- column_a: integer (nullable = true)
|-- column_b: struct (nullable = true)
| |-- column_c: integer (nullable = true)
| |-- column_d: string (nullable = true)
|-- column_e: integer (nullable = true)
*/
schema2.printTreeString()
/*
root
|-- column_a: string (nullable = true)
|-- column_e: integer (nullable = true)
*/
Perform Diff
val diff = schema1.fields.diff(schema2.fields)
val diffSchema = StructType(diff)
diffSchema.printTreeString()
/*
root
|-- column_a: integer (nullable = true)
|-- column_b: struct (nullable = true)
| |-- column_c: integer (nullable = true)
| |-- column_d: string (nullable = true)
*/
Refer to this link for more info on the diff function.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |