'Delete multiple rows from a delta table/pyspark data frame givien a list of IDs

I need to find a way to delete multiple rows from a delta table/pyspark data frame given a list of ID's to identify the rows. As far as I can tell there isn't a way to delete them all using a list, but only one at a time. Any advice/help would be greatly appreciated.



Solution 1:[1]

Let's say you have two dataframes, one being your data and the other one just a column with the IDs of rows to delete. A left-anti-JOIN can help you filter out the rows you want to delete.

df = df.join(dfWithIdsToDelete, "<idColumnName>", "left_anti")

This JOIN gives you all the rows of the df where the ID does not exist in the dfWithIdsToDelete, therefore filtering out all the rows you want to delete.

If your list of IDs to delete is a python list, you can just convert it to a dataframe.

Solution 2:[2]

As per spark architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
So you cannot change it, to delete rows from data frame you can filter the row that you do not want and save in another dataframe.
You can delete multiple rows from the pyspark dataframe by using the filter and where.

Here I am using a Delta lake table in Databricks:

enter image description here

I am deleting the rows using below list of IDs.

id_list=[2,3,5,7]

Deleting rows using Filter:
Follow this code:

id_list=[2,3,5,7]
df2=df2.filter(df2.Id.isin(id_list)==False)
df2.show()

You can see the Ids in the list are deleted in the resulting dataframe below.

enter image description here

Deleting rows using where:
Code:

df2=df.where(df.Id.isin(id_list)==False)
df2.show()

Used the same id_list in this case also.
Resulted dataframe:

enter image description here

Another alternate method:

from pyspark.sql.functions import when
df=df.withColumn("Result",when(df.Id.isin(id_list)==False,"True")).filter("Result==True").drop("Result")
df.show()

The Output Result:

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 restlessmodem
Solution 2 RakeshGovindula-MT