'Pyspark: join and union in for loop

I have a really simple logic that I would like to understand how I can make it work in pyspark.

for data in df1:
    spark_data_row = spark.createDataFrame(data=[data])
    spark_data_row = spark_data_row.join(df2)
    df2= df2.union(spark_data_row)

Basically, I want to join each row of df_1 to df_2 and then append it to df2 which is initially empty. The resulting df2 is the dataframe of interest for me.

However, spark runs infinitely on even a small dataframe df1 with 100 rows. I am no sure why this is happening. Can someone suggest another way of doing this?

Side note: If I comment out either the join or the union statement in this loop, it works in seconds!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source