Category "apache-spark"

Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code

I am stuck in a very odd situation related to Hbase design i would say. Hbase version >> Version 2.1.0-cdh6.2.1 So, the problem statement is, in Hbase, w

Pipe Pyspark OSError: [WinError 87] The parameter is incorrect

I have installed Spark 3.0.0 on a Windows 64 bit machine with Python 3.9.7 using an anaconda base environment. I'm trying to execute the next code in the pyspar

Flatten Nested Json String Column Table into tabular format

I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f

Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate() import spark.implicits._ case class Someth

regex_extract_all not working with spark sql

I'm using databricks notebook to extract all field occurrences from a text column using the regexp_extract_all function. Here is the input: field_map#'IFDSIMP.7

Scala Test: how to assert lenghty exception message securly and clean without hardcoding?

I have the following code, which is used to (sha) hash columns in a spark dataframe: import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions

do two instances from the same Spark Streaming can be in conflict?

I want to run the same Java Spark Streaming (10 seconds micro batch) through 2 instances (sparkStr1 and sparkStr2). Mainly, they consume the same kafka topic (3

Pyspark Py4JJavaError while creating the delta-table

Here is the pyspark code which is running on jupyter notebook. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I re

Apache Beam run docker in pipeline

The apache beam pipeline (python) I'm currently working on contains a transformation which runs a docker container. While that works well during local testing w

Spark read csv option to escape delimiter

Have an input csv like the one below, Need to escape the delimiter within one of the columns (2nd column): f1|f2|f3 v1|v2\|2|v3 x1|x2\|2|x3 spark.read.option(

How to use dbt seed properly with dbt-spark[PyHive] running in EMR?

Problem I am trying to implement a new process using dbt seeds. When I use it in a Redshift connection there is no problem, but when I try to use it with dbt-sp

Sharing an Oracle table among Spark Nodes using Python

I have an huge Oracle table to process, so I define a list of where clauses to read by each Spark node. In the middle of the processing I need to join the data

Import custom udf from jar to Spark

I am using Jupyter notebook for running Spark. My problem arises when I am trying to register a UDF from my custom imported jar. This is how I create th UDF in

Why my Spark mapPartition function is being slowed?

My algorithm is simple: I am using Spark to distribute the processing of a process that runs a cross-validation in Python. I have 3 workers and all I do is assi

Spark on Rapids single node

I'm trying to run Tpcds on Rapids single node on EMR using this guide: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html But getting res

Receive a Kafka message through Spark Streaming and delete Phoenix/HBase data

In my project, I have the current workflow: Kafka message => Spark Streaming/processing => Insert/Update to HBase and/or Phoenix Both the Insert and Updat

Unable to fetch secrets using Instance Profile from databricks for a spring boot application

I am using spring-cloud-starter-aws-secrets-manager-config 2.3.3 for a spring boot application which works perfectly in my local pointing to stage environment

Release date for Apache Spark 3.3

Does anyone know the release date for Apache Spark 3.3 ? We have log4j vulnerability reported in Apache Spark 3.2.1 version and want to see if the next patch is

Standard Deviation coming NaN in Pyspark rolling window

I have a dataset with 4 sensor values, 'volt', 'pressure', 'rotate' and 'vibration'. For these sensor values I am calculating rolling mean and rolling standard