'Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel

I have created a new dataproc cluster with a specific environment.yaml. Here is the command that I have used to create that cluster:

gcloud dataproc clusters create dataproc-testing1 
--enable-component-gateway 
--bucket my-test-bucket 
--region us-central1 --zone us-central1-c 
--master-machine-type n1-standard-2 
--master-boot-disk-size 32 
--num-workers 3 
--worker-machine-type n1-standard-2 
--worker-boot-disk-size 32 
--num-secondary-workers 3 
--preemptible-worker-boot-disk-type 
--preemptible-worker-boot-disk-size 32 
--num-preemptible-worker-local-ssds 0 
--image-version 2.0-ubuntu18 
--properties dataproc:conda.env.config.uri=gs://my-test-bucket/environment.yaml 
--optional-components JUPYTER 
--scopes 'https://www.googleapis.com/auth/cloud-platform' 
--project my-project

This successfully creates the cluster.

I have been able to ssh into the master and the executor nodes and they all have an environment pyspark that is created with the environment.yaml that I specified in the cluster creation command above. All the dependencies are there and python version is also 3.9.7.

After SSH into the worker or master nodes and running python --version gives me Python 3.9.7

running conda env list gives me

#
base                     /opt/conda/miniconda3
pyspark               *  /opt/conda/miniconda3/envs/pyspark

Hence, the environment activated is pyspark.

I can deactivate this environment with conda deactivate and then the base environment is activated and there python --version results in Python 3.8.12

So far everything is as I expect.

Now, I ran the jupyter notebook from the web interfaces tab in the cluster console and the problem is:

It has just 'PySpark' (note this is not the same as pyspark), 'Python3', 'spylon-kernel`, 'R' kernels available. 'R' is for R and 'spylon-kernel' is for scala.

I activate 'PySpark' kernel and run

import sys
sys.version

and the output is

'3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'

I activate 'Python 3' kernel and run

import sys
sys.version

and the output is '3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'

In both these kernels none of the packages from environment.yaml are available.

In conclusion, I cannot access the pyspark environment created by environment.yaml.

Can you please help me access the pyspark environment created by environment.yaml?



Solution 1:[1]

Dataproc launches Jupyter notebook from the root/base environment and the user specified environment is not leveraged there. This is currently not supported by Dataproc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ranu Vikram