'bucketing with QuantileDiscretizer using groupBy function in pyspark

I have a large dataset like so:

| SEQ_ID|RESULT|
+-------+------+
|3462099|239.52|
|3462099|239.66|
|3462099|239.63|
|3462099|239.64|
|3462099|239.57|
|3462099|239.58|
|3462099|239.53|
|3462099|239.66|
|3462099|239.63|
|3462099|239.52|
|3462099|239.58|
|3462099|239.52|
|3462099|239.64|
|3462099|239.71|
|3462099|239.64|
|3462099|239.65|
|3462099|239.54|
|3462099| 239.6|
|3462099|239.56|
|3462099|239.67|

The RESULT column is grouped by SEQ_ID column. I want to bucket/bin the RESULT based on the counts of each group. After applying some aggregations, I have a data frame with the number of buckets that each SEQ_ID must be binned by. like so:

| SEQ_ID|num_buckets|
+-------+----------+
|3760290|        12|
|3462099|         5|
|3462099|         5|
|3760290|        13|
|3462099|        13|
|3760288|        10|
|3760288|         5|
|3461201|         6|
|3760288|        13|
|3718665|        18|

So for example, this tells me that the RESULT values that belong to the 3760290 SEQ_ID must be binned in 12 buckets.

For a single group, I would collect() the num_buckets value and do:

discretizer = QuantileDiscretizer(numBuckets=num_buckets, inputCol='RESULT', outputCol='buckets')
df_binned=discretizer.fit(df).transform(df)

I understand that when using QuantileDiscretizer, each group would result in a separate dataframe, I can then union them all. But how can I use QuantileDiscretizer to bin the various groups without using a for loop?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source