'Spark pivot groupby performance very slow
I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum):
x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
.pivot("g")
.agg(sum(raw_df("h")
.cast(DoubleType))
.alias(""), sum(raw_df("i"))
.alias("i"))
When I changed the aggregate function to first it started taking 1.5 hours. Could you please help me understand why the aggregation function is impacting the performance and how I can improve the performance?
Solution 1:[1]
For the best performance, specify the distinct values of your pivot column (if you know them). Otherwise, a job will be immediately launched to determine them.
like this, in as List
x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
.pivot("g",["V1","V2","V3"])
.agg(sum(raw_df("h")
.cast(DoubleType))
.alias(""), sum(raw_df("i"))
.alias("i"))
V1,V2,V3 are Distinct Values from "g" column.
Solution 2:[2]
The above answer will improve the performance slightly but if I'm correct using a list will give you a O(n^2) time complexity. For better efficiency store distinct would be column names into a list and use a for loop to filter and pivot one column at a time AKA (one list element at a time). This should then give you a O(n) time complexity I believe. I was able to turn a job that takes a few hours into one that runs in 2 minutes now within databricks.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Everton Robinson |