'Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
Solution 1:[1]
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
Solution 2:[2]
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | OneCricketeer |
Solution 2 | Cobb Von |