'Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel
I have created a new dataproc cluster with a specific environment.yaml
. Here is the command that I have used to create that cluster:
gcloud dataproc clusters create dataproc-testing1
--enable-component-gateway
--bucket my-test-bucket
--region us-central1 --zone us-central1-c
--master-machine-type n1-standard-2
--master-boot-disk-size 32
--num-workers 3
--worker-machine-type n1-standard-2
--worker-boot-disk-size 32
--num-secondary-workers 3
--preemptible-worker-boot-disk-type
--preemptible-worker-boot-disk-size 32
--num-preemptible-worker-local-ssds 0
--image-version 2.0-ubuntu18
--properties dataproc:conda.env.config.uri=gs://my-test-bucket/environment.yaml
--optional-components JUPYTER
--scopes 'https://www.googleapis.com/auth/cloud-platform'
--project my-project
This successfully creates the cluster.
I have been able to ssh into the master and the executor nodes and they all have an environment pyspark
that is created with the environment.yaml
that I specified in the cluster creation command above. All the dependencies are there and python version is also 3.9.7.
After SSH into the worker or master nodes and running python --version
gives me Python 3.9.7
running conda env list
gives me
#
base /opt/conda/miniconda3
pyspark * /opt/conda/miniconda3/envs/pyspark
Hence, the environment activated is pyspark
.
I can deactivate this environment with conda deactivate
and then the base
environment is activated and there python --version
results in Python 3.8.12
So far everything is as I expect.
Now, I ran the jupyter notebook from the web interfaces tab in the cluster console and the problem is:
It has just 'PySpark' (note this is not the same as pyspark), 'Python3', 'spylon-kernel`, 'R' kernels available. 'R' is for R and 'spylon-kernel' is for scala.
I activate 'PySpark' kernel and run
import sys
sys.version
and the output is
'3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'
I activate 'Python 3' kernel and run
import sys
sys.version
and the output is
'3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'
In both these kernels none of the packages from environment.yaml
are available.
In conclusion, I cannot access the pyspark
environment created by environment.yaml
.
Can you please help me access the pyspark
environment created by environment.yaml
?
Solution 1:[1]
Dataproc launches Jupyter notebook from the root/base environment and the user specified environment is not leveraged there. This is currently not supported by Dataproc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ranu Vikram |