'Spark: need to process 20 Mb of data in spark spread across 10k files in hdfs

Can someone please let me know views/suggestion on below scenario:

Scenario: Need to process 20 MB of data in spark which is split across 10k files in hdfs. Using yarn as a cluster manager here which is having 256 Mb of block size.

Question:

  1. how much executors and cores should I launch?
  2. If we give 1 executor and 1 core (as total file size < 256 Mb) , am I not be taking advantage of spark here ?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source