'Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use:

from pyspark.sql.functions import *
df1.printSchema()

And I get the following result:

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

Sometimes the schema changes (the column type or name):

df2.printSchema()


 #root
        # |-- name: array (nullable = true)
        # |-- gender: integer (nullable = true)
        # |-- age: long (nullable = true)

I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position). The results should be a table (or data frame) something like this:

   column                df1          df2     diff                       
    name:               string       array     type                             
    gender:              N/A         integer   new column 

(age column is the same and didn't change. In case of omission of column there will be indication 'omitted') How can I do it if efficiently if I have many columns in each?



Solution 1:[1]

You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below

pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])

and then join those two pandas dataframes through 'outer' join?

Solution 2:[2]

Without any external library, we can find the schema difference using

from pyspark.sql.session import SparkSession
from pyspark.sql import DataFrame

def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
    s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
    s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
    difference = (
        s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
        .where(s1.d1_type.isNull() | s2.d2_type.isNull())
        .select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
        .fillna("")
    )
    return difference

  • fillna is optional. I prefer to view them as empty string.
  • in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
  • this will also show all columns that are in second dataframe but not in first dataframe

Usage:

diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)

Solution 3:[3]

A custom function that could be useful for someone.

def SchemaDiff(DF1, DF2):
  # Getting schema for both dataframes in a dictionary
  DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
  DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
    
   # Column present in DF1 but not in DF2
  DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
  for column_name in DF1MinusDF2:
    DF1MinusDF2[column_name] = DF1Schema[column_name]
  

  # Column present in DF2 but not in DF1
  DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
  for column_name in DF2MinusDF1:
    DF2MinusDF1[column_name] = DF2Schema[column_name]
  
  # Find data type changed in DF1 as compared to DF2
  UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
  UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
  DF1DataTypesChanged = {}
  for column_name in UpdatedDF1Schema:
    if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
      DF1DataTypesChanged[column_name] = DF2Schema[column_name]
  
  
  return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged

Solution 4:[4]

you can simply use

df1.printSchema() == df2.printSchema()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cts_superking
Solution 2 Koby
Solution 3 Ali Gohar
Solution 4 Pytrick