'Setting environment variables on Dataproc cluster nodes
I would like an environment variable to be set on each node of my dataproc cluster so that it is available to a pyspark job that will be running on that cluster. What is the best way to do this?
I'm wondering if there is a way to do it using Compute Engine metadata (although my research so far indicates that Compute Engine metadata is available via the metadata server on Compute Engine instances, not via environment variables).
Other than that I can't think of a way of doing it other than issuing an export
command in a dataproc initialisation script.
Can anyone suggest any other alternatives?
Solution 1:[1]
OS level
Dataproc doesn't have first-class support for OS level custom environment variables which apply to all processes, but you can achieve it with init actions by adding your env variables to /etc/environment
. You might need to restart the services in the init action.
Hadoop and Spark services
For Hadoop and Spark services, you can set properties with hadoop-env
or spark-env
prefix when creating the cluster, for example:
gcloud dataproc clusters create
--properties hadoop-env:FOO=hello,spark-env:BAR=world
...
See this doc for more details.
Spark jobs
Spark allows setting env variables at job level. For executors, you can always use spark.executorEnv.[Name]
to set env variables, but for drivers there is a difference depending on whether you are running the job in cluster mode or client mode.
Client mode (default)
In client mode, driver env variables need to be set in spark-env.sh when creating the cluster. You can use --properties spark-env:[NAME]=[VALUE]
as described above.
Executor env variables can be set when submitting the job, for example:
gcloud dataproc jobs submit spark \
--properties spark.executorEnv.BAR=world
...
or
spark-submit --conf spark.executorEnv.BAR=world ...
Cluster mode
In cluster mode, driver env variables can be set with spark.yarn.appMasterEnv.[NAME]
, for example:
gcloud dataproc jobs submit spark \
--properties spark.submit.deployMode=cluster,spark.yarn.appMasterEnv.FOO=hello,spark.executorEnv.BAR=world
...
or
spark-submit \
--deploy-mode cluster
--conf spark.yarn.appMasterEnv.FOO=hello \
--conf spark.executorEnv.BAR=world \
...
See this doc for more details.
Solution 2:[2]
There is no sort of cluster level env variable in Dataproc, however most components have their own env variable settings and you can set those thru dataproc Properties
Solution 3:[3]
You can use GCE Metadata, then a startup-script-url
to write to /etc/environment
.
gcloud dataproc clusters create NAME \
--metadata foo=bar,startup-script-url=gs://some-bucket/startup.sh \
...
gs://some-bucket/startup.sh
#!/usr/bin/env bash
ENV_VAR=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo" -H "Metadata-Flavor: Google")
echo "foo=${ENV_VAR}" >> /etc/environment
Hope it helps...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Henry Gong |
Solution 3 | Luis Armando |