'Setting environment variables on Dataproc cluster nodes

I would like an environment variable to be set on each node of my dataproc cluster so that it is available to a pyspark job that will be running on that cluster. What is the best way to do this?

I'm wondering if there is a way to do it using Compute Engine metadata (although my research so far indicates that Compute Engine metadata is available via the metadata server on Compute Engine instances, not via environment variables).

Other than that I can't think of a way of doing it other than issuing an export command in a dataproc initialisation script.

Can anyone suggest any other alternatives?



Solution 1:[1]

OS level

Dataproc doesn't have first-class support for OS level custom environment variables which apply to all processes, but you can achieve it with init actions by adding your env variables to /etc/environment. You might need to restart the services in the init action.

Hadoop and Spark services

For Hadoop and Spark services, you can set properties with hadoop-env or spark-env prefix when creating the cluster, for example:

gcloud dataproc clusters create
    --properties hadoop-env:FOO=hello,spark-env:BAR=world
    ...

See this doc for more details.

Spark jobs

Spark allows setting env variables at job level. For executors, you can always use spark.executorEnv.[Name] to set env variables, but for drivers there is a difference depending on whether you are running the job in cluster mode or client mode.

Client mode (default)

In client mode, driver env variables need to be set in spark-env.sh when creating the cluster. You can use --properties spark-env:[NAME]=[VALUE] as described above.

Executor env variables can be set when submitting the job, for example:

gcloud dataproc jobs submit spark \
    --properties spark.executorEnv.BAR=world
    ...

or

spark-submit --conf spark.executorEnv.BAR=world ...

Cluster mode

In cluster mode, driver env variables can be set with spark.yarn.appMasterEnv.[NAME], for example:

gcloud dataproc jobs submit spark \
    --properties spark.submit.deployMode=cluster,spark.yarn.appMasterEnv.FOO=hello,spark.executorEnv.BAR=world
    ...

or

spark-submit \
    --deploy-mode cluster
    --conf spark.yarn.appMasterEnv.FOO=hello \
    --conf spark.executorEnv.BAR=world \
    ...

See this doc for more details.

Solution 2:[2]

There is no sort of cluster level env variable in Dataproc, however most components have their own env variable settings and you can set those thru dataproc Properties

Solution 3:[3]

You can use GCE Metadata, then a startup-script-url to write to /etc/environment.

gcloud dataproc clusters create NAME \
  --metadata foo=bar,startup-script-url=gs://some-bucket/startup.sh \
  ...

gs://some-bucket/startup.sh

#!/usr/bin/env bash

ENV_VAR=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo" -H "Metadata-Flavor: Google")
echo "foo=${ENV_VAR}" >> /etc/environment

Hope it helps...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Henry Gong
Solution 3 Luis Armando