'dropDuplicates with non-numeric condition

I have a dataframe that looks like this (I have a few more columns, but they aren't relevant):

+-----------+-----------+---------------+                                        
|item_id    |location_id|decision       |
+-----------+-----------+---------------+
|     111111|  A        |        True   |
|     111111|  A        |        False  |
|     111111|  A        |        False  |
|     222222|  B        |        False  |
|     222222|  B        |        False  |
|     333333|  C        |        True   |
|     333333|  C        |        True   |
|     333333|  C        |        Unsure |
+-----------+-----------+---------------+

I would like to do dropDuplicates("item_id", "location_id") so I can remove rows that have the same item_id and location_id, but I want to keep a row that contains TrueOR Unsure if it exists. If none of the duplicate rows contain True or Unsure, any row with False is fine. For the above example, I would like the resulting dataframe to look like this:

+-----------+-----------+---------------+                                        
|item_id    |location_id|decision       |
+-----------+-----------+---------------+
|     111111|  A        |        True   |
|     222222|  B        |        False  |
|     333333|  C        |        Unsure |
+-----------+-----------+---------------+

For item_id 111111 and location_id A, I want the row with decision True since one such row exists. For item_id 222222 and location_id B, since none of the rows contain True, selecting either is fine. For item_id 333333 and location_id C, all rows contain the desired values of either True or Unsure, so selecting any one of the three is fine.

I am using Scala, so solution in Scala would be appreciated.



Solution 1:[1]

Going off of your comment on Lingaraj's answer, there's a slightly more sophisticated way of solving it than adding literals to the decision. You can instead create a boolean:

val decisionIsImportant = col("decision") === "True" || col("decision") === "Unsure" || col("decision") === Zebra || col("decision") == Xylophone

This boolean will evaluate to True if the decision column is equal to any of the values you want to prioritize.

Then, you can create a Window and apply it to the DataFrame:

val decisionSortWindow = Window
      .partitionBy(col("item_id"), col("location_id")))
      .orderBy(decisionIsImportant.desc)

df.withColumn("row_number", row_number().over(MyWindow))
  .filter(col("row_number").equalTo(1))
  .drop(col("row_number"))
  .orderBy(col("item_id"))
  .show(false)

Since all the decision values that evaluate to True will be sorted first, you can just filter by col("row_number").equalTo(1), since row number 1 is guaranteed to be one of the important decisions if one is present.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 sandbar