'SQL like NOT IN clause for PySpark data frames
In SQL, we can for example, do select * from table where col1 not in ('A','B');
I was wondering if there is a PySpark equivalent for this. I was able to find the isin function for SQL like IN clause, but nothing for NOT IN.
Solution 1:[1]
I just had the same issue and found solution. If you want to negate any condition (in pySpark represented as Column class) there is negation operator ~, for example:
df.where(~df.flag.isin(1, 2, 3)) # records with flag NOT IN (1, 2, 3)
Solution 2:[2]
Alternatively, you can do this:
df = df.filter(df.sbg_digitalId.isin(1, 2, 3) == False)
Solution 3:[3]
It is also possible to call the isin() method from F.col() like this:
import pyspark.sql.functions as F
df.filter(~F.col("col1").isin('A','B'))
Giving access to the 'isin' documentation within your IDE (of course if you are using PySpark-Stubs).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mariusz |
| Solution 2 | Tshilidzi Mudau |
| Solution 3 | Purushothaman Srikanth |
