'How to parallelize a function the correct way

I currently have a setup where I have a very large Pandas dataframe “df” that can be filtered into separate batches df_batch_1= df[df[‘column’]==<filter 1>], df_batch_2= df[df[‘column’]==<filter 2>],…etc. In addition, I have a very “heavy” function “heavy_function(df_batch)” that must do some heavy computation on each data frame batch df_batch_1, df_batch_2,...etc. My plan is to do this “heavy lifting” on a data bricks (pyspark) cluster since it runs too slow on a regular computer.

So far, I am running this by using threads on a data bricks cluster like this:

threads = [threading.Thread(target= heavy_function, args=(df[df[‘column’]==filter]) for filter in [<filter 1>, <filter 2>,…etc.]]
for t in threads:
   t.start()

I have been told that this is an anti-pattern and that I should find another way of doing this. I hope you can help pointing me in the right direction. What is the right “pysparkian” way of doing this?

Any help is appreciated!

Best regards, DK



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source