Category "google-cloud-dataproc"

Dataproc secondary workers not used

I've got a Dataproc cluster going on configured this way: { "worker_config": { "num_instances": 20 }, "secondary_worker_config": { "

How to extract the query result from a Hive job output logs using DataprocHiveOperator?

I am trying to build a data migration pipeline using Airflow, source being a Hive table on a Dataproc cluster and the destination is BigQuery. I'm using Datapro

Dataproc YARN container logs location

i'm aware of the existence of this thread:where are the individual dataproc spark logs? However if i ssh connect to a worker node vm and navigate to the /tmp fo

Setting environment variables on Dataproc cluster nodes

I would like an environment variable to be set on each node of my dataproc cluster so that it is available to a pyspark job that will be running on that cluster

Function of Dataproc Metastore in a Datalake environment

In a Google Datalake environment, what is the Dataproc Metastore service used for? I'm watching a Google Cloud Tech video and in this video around the 17:33 mar

How can I increase spark.driver.memoryOverhead in Google dataproc?

I am getting two types of errors in running a job on Google dataproc and it is causing executors to be lost one by one until the last executor is lost and the j

Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel

I have created a new dataproc cluster with a specific environment.yaml. Here is the command that I have used to create that cluster: gcloud dataproc clusters cr