'How to avoid showString error after dataframe groupBy
I have a dataframe
in the following schema, that I extract from a Hive table using the SQL below:
Id | Group_name | Sub_group_number | Year_Month |
---|---|---|---|
1 | Active | 1 | 202110 |
2 | Active | 3 | 202110 |
3 | Inactive | 4 | 202110 |
4 | Active | 1 | 202110 |
The T-SQL to extract the information is:
SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = 202110
AND id IN (SELECT Id FROM table WHERE Year_Month = 202109 AND Sub_group_number = 1)
After extract this information I want to group by Sub_group to extract the Id quantity as below:
df = (df.withColumn('FROM', F.lit(1))
.groupBy('Year_Month', 'FROM', 'Sub_group_number')
.count())
The result is a table as below:
Year_Month | From | Sub_group_number | Quantity |
---|---|---|---|
202110 | 1 | 1 | 2 |
202110 | 1 | 3 | 1 |
202110 | 1 | 4 | 1 |
Until this point there is no issue on my code and I'm able to run and execute action commands with Spark. The issue happens when I try to make the year_month and sub_group as parameters of my T-SQL in order to have a complete table. I'm using the following code:
sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
for month in year_months:
for group in sub_groups:
query = f"""SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = {month + 1}
AND id IN (SELECT Id FROM table WHERE Year_Month = {month} AND Sub_group_number = {group})"""
df_temp = (spark.sql(query)
.withColumn('FROM', F.lit(group))
.groupBy('Year_Month', 'FROM', 'Sub_group_number')
.count())
df = df.union(df_temp).dropDuplicates()
When I execute a df.show()
or try to write as Table I have the issue:
An error occurred while calling o8522.showString
Any ideas of what is causing this error?
Solution 1:[1]
You're attempting string interpolation.
If using Python, maybe try this:
query = "SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = {0}
AND id IN (SELECT Id FROM table WHERE Year_Month = {1}
AND Sub_group_number = {2})".format(month + 1, month, group)
Solution 2:[2]
The error states it is StackOverflowError that can happen when DAG plan grows too much. Because of Spark's lazy evaluation, this could easily happen with for-loops, especially you have nested for-loop. If you are curious, you can try df.explain()
where you did df.show()
, you should see pretty long physical plans that Spark cannot handle to run in actual.
To solve this, you want to avoid for-loop as much as possible and in your case , it seems you don't need it.
sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
# Modify this to use datetime lib for more robustness (ex: handle 202112 -> 202201).
month_plus = [x+1 for x in year_months]
def _to_str_elms(li):
return str(li)[1:-1]
spark.sql("""
SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month IN ({','.join(_to_str_elms(month_plus))})
AND id IN (SELECT Id FROM table WHERE Year_Month IN ({','.join(_to_str_elms(month))}) AND Sub_group_number IN ({','.join(_to_str_elms(sub_groups))}))
""")
UPDATE:
I think I understood why you are looping. You need "parent" group where along with the Sub_group_number
of its record and you are using lit
with looped value. I think one way is that you can rethink about this problem by first query to fetch all records that are in [202101, 202102, 202103, 202104], then use some window functions to figure out the parent group. I am not yet foreseeing how it looks like, so if you can give us some sample records and logic how you want to get the "group", I can perhaps provide updates.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | user2314737 |
Solution 2 |