'Spark: need to process 20 Mb of data in spark spread across 10k files in hdfs

Can someone please let me know views/suggestion on below scenario:

Scenario: Need to process 20 MB of data in spark which is split across 10k files in hdfs. Using yarn as a cluster manager here which is having 256 Mb of block size.

Question:

how much executors and cores should I launch?
If we give 1 executor and 1 core (as total file size < 256 Mb) , am I not be taking advantage of spark here ?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Spark: need to process 20 Mb of data in spark spread across 10k files in hdfs

Sources

Related Questions