'Is it possible to load multiple directory separately in pyspark but process them in parallel?
I have s3 or azure blob directory structure like the following
parent_dir
child_dir1
avro_1
avro_2
...
child_dir2
...
There is 1-2 hundred child_dir
with couple files in each child_dir with a size of couple GB
I want to use pySpark to do some transformation to each avro and output via the same directory structure (# of avro files can change, but directory structure and data ownership have to be the same, ie. data in one child dir cannot be output to another child dir)
Right now I list out the directory in parent_dir
and convert all the avro files within each child_dir
to a dataframe
parent_dir
child_dir1 -> df_for_child1
child_dir2 -> df_for_child2
I then iterate the list of dataframe and transform the data and output them base on which child_dir each dataframe it come from
for df_container in df_container_list:
df = df_container.getdf()
df.transform(my_transform_func)
df.write(generate_output_path(df_container_get_sub_dir, df))
I was wondering if it is possible to parallelize the transformation? or is the transformation and read/write already parallelize internally in spark?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|