'Spark pivot groupby performance very slow

I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum):

x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
                .pivot("g")
                .agg(sum(raw_df("h")
                .cast(DoubleType))
                .alias(""), sum(raw_df("i"))
                .alias("i"))

When I changed the aggregate function to first it started taking 1.5 hours. Could you please help me understand why the aggregation function is impacting the performance and how I can improve the performance?



Solution 1:[1]

For the best performance, specify the distinct values of your pivot column (if you know them). Otherwise, a job will be immediately launched to determine them.

like this, in as List

x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
    .pivot("g",["V1","V2","V3"])
    .agg(sum(raw_df("h")
    .cast(DoubleType))
    .alias(""), sum(raw_df("i"))
    .alias("i"))

V1,V2,V3 are Distinct Values from "g" column.

Solution 2:[2]

The above answer will improve the performance slightly but if I'm correct using a list will give you a O(n^2) time complexity. For better efficiency store distinct would be column names into a list and use a for loop to filter and pivot one column at a time AKA (one list element at a time). This should then give you a O(n) time complexity I believe. I was able to turn a job that takes a few hours into one that runs in 2 minutes now within databricks.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Everton Robinson