'Comparing two StructType schemas with differing number of columns

In Spark 3.1.1+

Is there a way to diff two StructType schemas if they have a different number of columns, where column types can also differ for the same column name?

For example:

Schema 1:

StructType {
  column_a: Int,
  column_b: StructType {
    column_c: Int,
    column_d: String
  }
}

Schema 2:

StructType {
  column_a: String
}

So I essentially want to know that column_a was updated to a different type, and column_b (along with everything under it) was removed.



Solution 1:[1]

Just use the diff function available in standard scala arrays. This will give elements in the first schema but not in the second.

Tree Structure


schema1.printTreeString()
/*
root
|-- column_a: integer (nullable = true)
|-- column_b: struct (nullable = true)
|    |-- column_c: integer (nullable = true)
|    |-- column_d: string (nullable = true)
|-- column_e: integer (nullable = true)
*/

schema2.printTreeString()
/*
root
|-- column_a: string (nullable = true)
|-- column_e: integer (nullable = true)
*/

Perform Diff

val diff = schema1.fields.diff(schema2.fields)
val diffSchema = StructType(diff)

diffSchema.printTreeString()
/*
root
 |-- column_a: integer (nullable = true)
 |-- column_b: struct (nullable = true)
 |    |-- column_c: integer (nullable = true)
 |    |-- column_d: string (nullable = true)
*/

Refer to this link for more info on the diff function.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1