'SQL like NOT IN clause for PySpark data frames

In SQL, we can for example, do select * from table where col1 not in ('A','B');

I was wondering if there is a PySpark equivalent for this. I was able to find the isin function for SQL like IN clause, but nothing for NOT IN.



Solution 1:[1]

I just had the same issue and found solution. If you want to negate any condition (in pySpark represented as Column class) there is negation operator ~, for example:

df.where(~df.flag.isin(1, 2, 3)) # records with flag NOT IN (1, 2, 3)

Solution 2:[2]

Alternatively, you can do this:

df = df.filter(df.sbg_digitalId.isin(1, 2, 3) == False)

Solution 3:[3]

It is also possible to call the isin() method from F.col() like this:

import pyspark.sql.functions as F

df.filter(~F.col("col1").isin('A','B'))

Giving access to the 'isin' documentation within your IDE (of course if you are using PySpark-Stubs).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mariusz
Solution 2 Tshilidzi Mudau
Solution 3 Purushothaman Srikanth