'How to avoid showString error after dataframe groupBy

I have a dataframe in the following schema, that I extract from a Hive table using the SQL below:

Id Group_name Sub_group_number Year_Month
1 Active 1 202110
2 Active 3 202110
3 Inactive 4 202110
4 Active 1 202110

The T-SQL to extract the information is:

SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = 202110
AND id IN (SELECT Id FROM table WHERE Year_Month = 202109 AND Sub_group_number = 1)

After extract this information I want to group by Sub_group to extract the Id quantity as below:

df = (df.withColumn('FROM', F.lit(1))
        .groupBy('Year_Month', 'FROM', 'Sub_group_number')
        .count())

The result is a table as below:

Year_Month From Sub_group_number Quantity
202110 1 1 2
202110 1 3 1
202110 1 4 1

Until this point there is no issue on my code and I'm able to run and execute action commands with Spark. The issue happens when I try to make the year_month and sub_group as parameters of my T-SQL in order to have a complete table. I'm using the following code:

sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
for month in year_months:
    for group in sub_groups:
        query = f"""SELECT Id, Group_Name, Sub_group_number, Year_Month
                    FROM table
                    WHERE Year_Month = {month + 1}
                    AND id IN (SELECT Id FROM table WHERE Year_Month = {month} AND Sub_group_number = {group})"""
        df_temp = (spark.sql(query)
                        .withColumn('FROM', F.lit(group))
                        .groupBy('Year_Month', 'FROM', 'Sub_group_number')
                        .count())
        df = df.union(df_temp).dropDuplicates()

When I execute a df.show() or try to write as Table I have the issue:

An error occurred while calling o8522.showString

Full error trace

Any ideas of what is causing this error?



Solution 1:[1]

You're attempting string interpolation.

If using Python, maybe try this:

query = "SELECT Id, Group_Name, Sub_group_number, Year_Month
                    FROM table
                    WHERE Year_Month = {0}
                    AND id IN (SELECT Id FROM table WHERE Year_Month = {1} 
                    AND Sub_group_number = {2})".format(month + 1, month, group)

Solution 2:[2]

The error states it is StackOverflowError that can happen when DAG plan grows too much. Because of Spark's lazy evaluation, this could easily happen with for-loops, especially you have nested for-loop. If you are curious, you can try df.explain() where you did df.show(), you should see pretty long physical plans that Spark cannot handle to run in actual.

To solve this, you want to avoid for-loop as much as possible and in your case , it seems you don't need it.

sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]

# Modify this to use datetime lib for more robustness (ex: handle 202112 -> 202201). 
month_plus = [x+1 for x in year_months]

def _to_str_elms(li):
    return str(li)[1:-1]

spark.sql("""
    SELECT Id, Group_Name, Sub_group_number, Year_Month
    FROM table
    WHERE Year_Month IN ({','.join(_to_str_elms(month_plus))})
        AND id IN (SELECT Id FROM table WHERE Year_Month IN ({','.join(_to_str_elms(month))}) AND Sub_group_number IN ({','.join(_to_str_elms(sub_groups))}))
""")

UPDATE:

I think I understood why you are looping. You need "parent" group where along with the Sub_group_number of its record and you are using lit with looped value. I think one way is that you can rethink about this problem by first query to fetch all records that are in [202101, 202102, 202103, 202104], then use some window functions to figure out the parent group. I am not yet foreseeing how it looks like, so if you can give us some sample records and logic how you want to get the "group", I can perhaps provide updates.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user2314737
Solution 2