'dropDuplicates with non-numeric condition
I have a dataframe that looks like this (I have a few more columns, but they aren't relevant):
+-----------+-----------+---------------+
|item_id |location_id|decision |
+-----------+-----------+---------------+
| 111111| A | True |
| 111111| A | False |
| 111111| A | False |
| 222222| B | False |
| 222222| B | False |
| 333333| C | True |
| 333333| C | True |
| 333333| C | Unsure |
+-----------+-----------+---------------+
I would like to do dropDuplicates("item_id", "location_id")
so I can remove rows that have the same item_id
and location_id
, but I want to keep a row that contains True
OR Unsure
if it exists. If none of the duplicate rows contain True
or Unsure
, any row with False
is fine. For the above example, I would like the resulting dataframe to look like this:
+-----------+-----------+---------------+
|item_id |location_id|decision |
+-----------+-----------+---------------+
| 111111| A | True |
| 222222| B | False |
| 333333| C | Unsure |
+-----------+-----------+---------------+
For item_id
111111 and location_id
A, I want the row with decision
True since one such row exists. For item_id
222222 and location_id
B, since none of the rows contain True, selecting either is fine. For item_id
333333 and location_id
C, all rows contain the desired values of either True
or Unsure
, so selecting any one of the three is fine.
I am using Scala, so solution in Scala would be appreciated.
Solution 1:[1]
Going off of your comment on Lingaraj's answer, there's a slightly more sophisticated way of solving it than adding literals to the decision
. You can instead create a boolean:
val decisionIsImportant = col("decision") === "True" || col("decision") === "Unsure" || col("decision") === Zebra || col("decision") == Xylophone
This boolean will evaluate to True
if the decision
column is equal to any of the values you want to prioritize.
Then, you can create a Window and apply it to the DataFrame:
val decisionSortWindow = Window
.partitionBy(col("item_id"), col("location_id")))
.orderBy(decisionIsImportant.desc)
df.withColumn("row_number", row_number().over(MyWindow))
.filter(col("row_number").equalTo(1))
.drop(col("row_number"))
.orderBy(col("item_id"))
.show(false)
Since all the decision
values that evaluate to True
will be sorted first, you can just filter by col("row_number").equalTo(1)
, since row number 1 is guaranteed to be one of the important decisions if one is present.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | sandbar |