Category "apache-spark"

Scala Test: how to assert lenghty exception message securly and clean without hardcoding?

I have the following code, which is used to (sha) hash columns in a spark dataframe: import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions

do two instances from the same Spark Streaming can be in conflict?

I want to run the same Java Spark Streaming (10 seconds micro batch) through 2 instances (sparkStr1 and sparkStr2). Mainly, they consume the same kafka topic (3

Pyspark Py4JJavaError while creating the delta-table

Here is the pyspark code which is running on jupyter notebook. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I re

Apache Beam run docker in pipeline

The apache beam pipeline (python) I'm currently working on contains a transformation which runs a docker container. While that works well during local testing w

Spark read csv option to escape delimiter

Have an input csv like the one below, Need to escape the delimiter within one of the columns (2nd column): f1|f2|f3 v1|v2\|2|v3 x1|x2\|2|x3 spark.read.option(

How to use dbt seed properly with dbt-spark[PyHive] running in EMR?

Problem I am trying to implement a new process using dbt seeds. When I use it in a Redshift connection there is no problem, but when I try to use it with dbt-sp

Sharing an Oracle table among Spark Nodes using Python

I have an huge Oracle table to process, so I define a list of where clauses to read by each Spark node. In the middle of the processing I need to join the data

Import custom udf from jar to Spark

I am using Jupyter notebook for running Spark. My problem arises when I am trying to register a UDF from my custom imported jar. This is how I create th UDF in

Why my Spark mapPartition function is being slowed?

My algorithm is simple: I am using Spark to distribute the processing of a process that runs a cross-validation in Python. I have 3 workers and all I do is assi

Spark on Rapids single node

I'm trying to run Tpcds on Rapids single node on EMR using this guide: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html But getting res

Receive a Kafka message through Spark Streaming and delete Phoenix/HBase data

In my project, I have the current workflow: Kafka message => Spark Streaming/processing => Insert/Update to HBase and/or Phoenix Both the Insert and Updat

Unable to fetch secrets using Instance Profile from databricks for a spring boot application

I am using spring-cloud-starter-aws-secrets-manager-config 2.3.3 for a spring boot application which works perfectly in my local pointing to stage environment

Release date for Apache Spark 3.3

Does anyone know the release date for Apache Spark 3.3 ? We have log4j vulnerability reported in Apache Spark 3.2.1 version and want to see if the next patch is

Standard Deviation coming NaN in Pyspark rolling window

I have a dataset with 4 sensor values, 'volt', 'pressure', 'rotate' and 'vibration'. For these sensor values I am calculating rolling mean and rolling standard

Error while reading date and datetime column from mariadb via spark

I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar

At what point should you force a cache in Spark when performing heavy transformations?

Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('

Building Cube in Apache Kylin hangs on the second step

I am trying to buld a Cube in Kylin. It successfully does the first step and then just keeps running always at 50% Log from the step 2: 2022-05-13 13:54:40,640

HDFS Date partition directory loop

I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at t