'Random Sampling base on 1 column after Groupby
I have a Spark Table, which contains 400+ millions records/rows. I used spark.table
to convert it into a DF.
The DF looks like this below
id pub_date version unique_id c_id p_id type source
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 2 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 2 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 3 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 3 64WP-UI-POLI 002 P02 org internet
lni002 20220301 85 64WP-UI-POLI 002 P02 org internet
lni002 20220301 85 64WP-UI-POLI 002 P02 org internet
lni002 20220301 5 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 1 64WP-UI-CFGT 012 K21 location internet
::
::
I am trying to randomly select rows base on the id column. I want to be able to randomly select group of id rows data after doing a groupBy or partitionBy on the id column.
If I want want 2 random samples, then I should get back with all the rows associate to the id column. For example, under id column, "lni001" has 7 records and "lni002" has 4 records. I will need all the records under "lni001" and "lni002"
I am trying to use groupBy and patritionBy but still couldnt figure out how to do it. That will be great if you all have give me some ideas or suggestions. Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|