'Is it possible to load multiple directory separately in pyspark but process them in parallel?

I have s3 or azure blob directory structure like the following

parent_dir
    child_dir1
       avro_1
       avro_2
       ...
    child_dir2
       ...

There is 1-2 hundred child_dir with couple files in each child_dir with a size of couple GB

I want to use pySpark to do some transformation to each avro and output via the same directory structure (# of avro files can change, but directory structure and data ownership have to be the same, ie. data in one child dir cannot be output to another child dir)

Right now I list out the directory in parent_dir and convert all the avro files within each child_dir to a dataframe

parent_dir
    child_dir1 -> df_for_child1
    child_dir2 -> df_for_child2

I then iterate the list of dataframe and transform the data and output them base on which child_dir each dataframe it come from


for df_container in df_container_list:
   df = df_container.getdf()
   df.transform(my_transform_func)
   df.write(generate_output_path(df_container_get_sub_dir, df))

I was wondering if it is possible to parallelize the transformation? or is the transformation and read/write already parallelize internally in spark?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Is it possible to load multiple directory separately in pyspark but process them in parallel?

Sources

Related Questions