'Is there any way to read multiple parquet paths from s3 in parallel using spark?

My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Unfortunately, spark reads the parquet metadata sequentially (path after path) and not in parallel. after spark reads the metadata, the data itself is getting read in parallel. but the metadata part is super slow and the machines are underutilized.

Is there any way to read multiple parquet paths from s3 in parallel using spark?

I would appreciate hearing your opinion on this.



Solution 1:[1]

So after some time I've figured out that the way I can achieve it is by reading each path on a different thread and union the results. e.g:

val paths = List[String]("a","b","c")
val parallelPaths = paths.par
parallelPaths.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(paths.length))
paths.map(path => spark.read.parquet(path)).reduce(_ union _)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 psyduck