'SQL like NOT IN clause for PySpark data frames
In SQL, we can for example, do select * from table where col1 not in ('A','B');
I was wondering if there is a PySpark equivalent for this. I was able to find the isin
function for SQL like IN
clause, but nothing for NOT IN
.
Solution 1:[1]
I just had the same issue and found solution. If you want to negate any condition (in pySpark represented as Column
class) there is negation operator ~
, for example:
df.where(~df.flag.isin(1, 2, 3)) # records with flag NOT IN (1, 2, 3)
Solution 2:[2]
Alternatively, you can do this:
df = df.filter(df.sbg_digitalId.isin(1, 2, 3) == False)
Solution 3:[3]
It is also possible to call the isin()
method from F.col()
like this:
import pyspark.sql.functions as F
df.filter(~F.col("col1").isin('A','B'))
Giving access to the 'isin
' documentation within your IDE (of course if you are using PySpark-Stubs).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Mariusz |
Solution 2 | Tshilidzi Mudau |
Solution 3 | Purushothaman Srikanth |