'How to compare two dataframes and extract unmatched rows in pyspark?
Hi I have two dataframes. One is parent dataframe and second is incremental dataframe. I just want to extract those records which is present in incremental dataframe but not present in parent dataframe based on the key column.
Example:
Key Column : call_id
parent_dataframe:
call_id call_nm src
100 QC Darzalex MM
105 XY INVOKANA
107 CZ Simponi RA
117 NM Guselkumab PSA
118 YC STELARA
126 RF INVOKANA
Incremental Dataframe:
call_id call_nm src
118 YC STELARA
126 RF INVOKANA
131 VG STELARA
135 IJ Stelara CD
Unmatched Dataframe:
call_id call_nm src
131 VG STELARA
135 IJ Stelara CD
Solution 1:[1]
Use left_anti join with Incremenatl coming first. Left_anti checks to see if the values exist in the second df, they then keep values missing in df.
Incremental.join(parent_dataframe,on='call_nm', how='left_anti').show()
+-------+-------+----------+
|call_nm|call_id| src|
+-------+-------+----------+
| IJ| 135|Stelara CD|
| VG| 131| STELARA|
+-------+-------+----------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | wwnde |