'Pyspark select multiple columns from list and filter on different values
I have a table with ~5k columns and ~1 M rows that looks like this:
ID | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Col11 |
---|---|---|---|---|---|---|---|---|---|---|---|
ID1 | 0 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 2 | 2 | 0 |
ID2 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 0 | 2 | 1 |
ID3 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 1 | 1 |
ID4 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
ID5 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
I want to select different columns matching column names from different lists and subset the rows according to different criteria. For example if my list1 has col1, col3, col4, col11 and list2 has col2, col6, col9, col10. I want to filter rows as list1 == 0 AND list2 == 1. E.g df1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1))
. Instead of adding column name each time, I want these columns to be selected from two different lists. How can I achieve this using PySpark?
Solution 1:[1]
I found the solution that I was looking for: I was able to filter large number of columns using following steps:
- Created separate lists with column IDs. My column IDs were in a Dataframe. So I converted the column to list for filtration process.
list1 = ped.filter((ped.pheno == 2)).select("IID")
list1 = list1.select('IID').rdd.map(lambda row : row [0]).collect()`
list2 = ped.filter((ped.pheno == 1)).select("IID")
list2 = list2.select('IID').rdd.map(lambda row : row [0]).collect()
## 2. Using these lists I filtered the columns as follows:
df1 = df.where ("AND".join([(%s ==1)"%(col) for col in list1]))
df1 = df.where ("AND".join([(%s ==0)"%(col) for col in list2]))
Thanks for other helpful solutions as well!
Solution 2:[2]
If you need to compare a lot of columns like this consider tuple-wise comparisons like this:
from pyspark.sql.functions import lit, struct
source_tuple = struct(col("col1"), col("col3"), col("col4"), col("col6"), col("col9"), col("col10"))
target_tuple1 = struct([lit(0), lit(0), lit(0), lit(1), lit(1), lit(1)])
df1 = df.where(source_tuple.isin([target_tuple1]))
You can build the left and right sides dynamically and add multiple tuples to the right-hand side. If you need multiple left-hand sides combine the resulting DataFrames using UNION ALL.
Solution 3:[3]
Try this, This worked for me
tmp_cols = ['col1','col2']
filter_conditions = [col(c) ==1 for c in tmp_cols]
nestor = lambda u, v : (u) | (v)
df = df.filter(reduce(nestor, filter_conditions))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Carter Shanklin |
Solution 3 | Saurabh Verma |