'bucketing with QuantileDiscretizer using groupBy function in pyspark
I have a large dataset like so:
| SEQ_ID|RESULT|
+-------+------+
|3462099|239.52|
|3462099|239.66|
|3462099|239.63|
|3462099|239.64|
|3462099|239.57|
|3462099|239.58|
|3462099|239.53|
|3462099|239.66|
|3462099|239.63|
|3462099|239.52|
|3462099|239.58|
|3462099|239.52|
|3462099|239.64|
|3462099|239.71|
|3462099|239.64|
|3462099|239.65|
|3462099|239.54|
|3462099| 239.6|
|3462099|239.56|
|3462099|239.67|
The RESULT
column is grouped by SEQ_ID
column.
I want to bucket/bin the RESULT
based on the counts of each group. After applying some aggregations, I have a data frame with the number of buckets that each SEQ_ID
must be binned by. like so:
| SEQ_ID|num_buckets|
+-------+----------+
|3760290| 12|
|3462099| 5|
|3462099| 5|
|3760290| 13|
|3462099| 13|
|3760288| 10|
|3760288| 5|
|3461201| 6|
|3760288| 13|
|3718665| 18|
So for example, this tells me that the RESULT
values that belong to the 3760290 SEQ_ID
must be binned in 12 buckets.
For a single group, I would collect()
the num_buckets
value and do:
discretizer = QuantileDiscretizer(numBuckets=num_buckets, inputCol='RESULT', outputCol='buckets')
df_binned=discretizer.fit(df).transform(df)
I understand that when using QuantileDiscretizer
, each group would result in a separate dataframe, I can then union them all.
But how can I use QuantileDiscretizer
to bin the various groups without using a for
loop?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|