'How to quickly check if row exists in PySpark Dataframe?
I have a PySpark dataframe like this:
+------+------+
| A| B|
+------+------+
| 1| 2|
| 1| 3|
| 2| 3|
| 2| 5|
+------+------+
I want to do a lookup on the table to see if a specific row exists. For example, for the test of A = 2
, B = 5
the code should return True
and for A = 2
,B = 10
the code should return False
.
I tried this:
df[(df['A'] == 1) & (df['B'] == 2)].rdd.isEmpty()
Unfortunately, this code takes a long time to execute, and since this is a lookup that will be performed many times (for different values of A and B), I would like to have a quicker method of accomplishing this task.
Other solutions that I am considering are:
- Converting the PySpark dataframe to a Pandas dataframe because the row lookups are faster
- Using
.where()
or.filter()
though from what I have tried, I do not anticipate either being substantially faster - Using
.count()
overisEmpty()
Solution 1:[1]
It would be better to create a spark dataframe from the entries that you want to look up, and then do a semi join
or an anti join
to get the rows that exist or do not exist in the lookup dataframe. This should be more efficient than checking the entries one by one.
import pyspark.sql.functions as F
df = spark.createDataFrame([[2,5],[2,10]],['A','B'])
result1 = df.join(lookup, ['A','B'], 'semi').withColumn('exists', F.lit(True))
result2 = df.join(lookup, ['A','B'], 'anti').withColumn('exists', F.lit(False))
result = result1.unionAll(result2)
result.show()
+---+---+------+
| A| B|exists|
+---+---+------+
| 2| 5| true|
| 2| 10| false|
+---+---+------+
Solution 2:[2]
Spark function ANY
offers a very quick way to check if a record exists inside a dataframe.
check = df.selectExpr('ANY((A = 2) AND (B = 5)) as chk')
check.show()
# +----+
# | chk|
# +----+
# |true|
# +----+
check = df.selectExpr('ANY((A = 2) AND (B = 10)) as chk')
check.show()
# +-----+
# | chk|
# +-----+
# |false|
# +-----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | mck |
Solution 2 |