'Spark job loses executors: ERROR TaskSchedulerImpl: Lost executor 1... -> ./app.jar: No space left on device

I'm running both the master and 1 worker on a GPU server in standalone mode. After submitting the job, it retrieves and loses executors for X amount of times before timing out.

Spark-submit

spark-submit                                            \
--conf spark.plugins=com.nvidia.spark.SQLPlugin         \
--conf spark.rapids.memory.gpu.pooling.enabled=false    \
--conf spark.executor.resource.gpu.amount=1             \
--conf spark.task.resource.gpu.amount=1                 \
--jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}.    \
--master spark://<ip>:7077                              \
--driver-memory 2g                                      \
--executor-memory 10g                                   \
--conf spark.cores.max=1                                \
--class com.spark.examples.Class                        \
app.jar                                                 \
-dataPath=spark/data.csv                                \
-format=csv                                             \
-numWorkers=1                                           \
-treeMethod=gpu_hist                                    \
-numRound=100                                           \
-maxDepth=8

Logs

StandaloneSchedulerBackend: Granted executor ID app on hostPort <ip:port> with 1 core(s), 10.0 GiB RAM
21/03/30 05:29:29 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app- is now RUNNING
21/03/30 05:29:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (<ip:port>) with ID 2,  ResourceProfileId 0
21/03/30 05:29:31 INFO BlockManagerMasterEndpoint: Registering block manager (<ip:port>)with 5.8 GiB RAM, BlockManagerId(2, <ip>, 45302, None)
21/03/30 05:29:37 ERROR TaskSchedulerImpl: Lost executor 2 <ip>: Unable to create executor due to /tmp/spark-d0f43315/executor-ec041ccd/spark-295e56db/-716_cache -> ./app.jar: No space left on device
21/03/30 05:29:37 INFO DAGScheduler: Executor lost: 2 (epoch 2)

Specs

I'm using a AWS EC2 G4dn machine.

GPU: TU104GL [Tesla T4]   
15109MiB  
Driver Version: 460.32.03  
CUDA Version: 11.2

1 worker: 1 core, 10GB of memory.


Solution 1:[1]

From the logs, it seems that Spark is using /tmp to manage his executors data and as the error message implies: "No space left on device" you don't have space on that directory.

My suggestion is that you use a partition with more space for the job execution. The configuration that you are looking for is spark.local.dir. Just point it to another directory in your submit.

Best regards.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Erebus