Category "amazon-emr"

Multiple values in EMR Cluster Configuration template

Within my EMR module I have a template that is deployed for the cluster configuration, within this template are all the cluster configuration requirements for t

PySpark Self Signed certificate to access Artifactory from inside an EMR Jupyter Notebook

I am attempting to use a PySpark kernel from inside an EMR Notebook that is hosted on an AWS managed service (EMR) and I am unable to access Artifactory to inst

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1 Hudi version is 0.7 I have inserted a few records and then updated t

Problem to read data from HBase on AWS EMR cluster using Java Spring boot client

I'm trying to write a simple API application to read data from HBase on an AWS EMR cluster. But I get an UnknownHostException when I try to send the reques

AWS EMR EMRFS Kerberos login on policy refresh

I installed Kerberos on a ec2 server and on a second ec2 server I installed Apache Ranger (with Kerberos auth added in core-site file,hadoop.security.authentica

Package sparse_dot_topn in Pyspark AWS EMR Jupyter install error

Running on AWS and EMR, Jupyter, Pyspark notebook and trying to install a python package "sparse_dot_topn" version 0.2.9 I'm getting an error I don't understand

Spark on Rapids single node

I'm trying to run Tpcds on Rapids single node on EMR using this guide: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html But getting res

S3DistCP - Split source in multiples jobs

I have to do copy of an S3 to HDFS of an cluster EMR. I'm trying to smaller the execution time of my job. Looking in the logs the map input of the job is 1_000_

Aiflow 2 Xcom in Task Groups

I have two tasks inside a TaskGroup that need to pull xcom values to supply the job_flow_id and step_id. Here's the code: with TaskGroup('execute_my_steps') a

mount_workspace_dir notebook magic not working in EMR Studio

In an EMR Studio Python3 notebook, I execute the following: %mount_workspace_dir . And receive the following error: UsageError: Line magic function `%mount_wor

How to use java runtime 11 in EMR cluster AWS

I'm creating a cluter in EMR aws and when spark runs my application I'm getting error below: Exception in thread "main" java.lang.UnsupportedClassVersionError:

AWS EMRFS S3 ranger plugin error for amazon s3

I am trying to integrate AWS EMR with Apache Ranger. out of 3 plugin hive, spark, and s3. Two plugins are working hive and spark but getting error while trying

AWS EMRFS S3 ranger plugin error for amazon s3

I am trying to integrate AWS EMR with Apache Ranger. out of 3 plugin hive, spark, and s3. Two plugins are working hive and spark but getting error while trying

Delta Table / Athena And Spark

I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.sp

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4

AWS EMR: Enable auto-termination-policy in cloudformation

I am trying to enable auto termination policy in EMR. Here is the documentation https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-auto-termination-poli

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via s

Missing Jupyter Notebook Kernels in VSCode

I have multiple people working on the same AWS EMR cluster to run some Spark jobs. This is being done through Jupyter Notebooks which are created/modified usin

Category "amazon-emr"

Other Categories