Category "apache-spark"

Spark on Rapids single node

I'm trying to run Tpcds on Rapids single node on EMR using this guide: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html But getting res

Receive a Kafka message through Spark Streaming and delete Phoenix/HBase data

In my project, I have the current workflow: Kafka message => Spark Streaming/processing => Insert/Update to HBase and/or Phoenix Both the Insert and Updat

Unable to fetch secrets using Instance Profile from databricks for a spring boot application

I am using spring-cloud-starter-aws-secrets-manager-config 2.3.3 for a spring boot application which works perfectly in my local pointing to stage environment

Release date for Apache Spark 3.3

Does anyone know the release date for Apache Spark 3.3 ? We have log4j vulnerability reported in Apache Spark 3.2.1 version and want to see if the next patch is

Standard Deviation coming NaN in Pyspark rolling window

I have a dataset with 4 sensor values, 'volt', 'pressure', 'rotate' and 'vibration'. For these sensor values I am calculating rolling mean and rolling standard

Error while reading date and datetime column from mariadb via spark

I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar

At what point should you force a cache in Spark when performing heavy transformations?

Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('

Building Cube in Apache Kylin hangs on the second step

I am trying to buld a Cube in Kylin. It successfully does the first step and then just keeps running always at 50% Log from the step 2: 2022-05-13 13:54:40,640

HDFS Date partition directory loop

I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at t

Scala spark UDF function that takes input and puts it in an Array

I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a

What does the HiveWarehouseConnector executeUpdate() function return?

I can't believe I have to ask this here but there seems to be no documentation on what the HWC actually does. All I can find is that it returns a boolean: publi

spark how to convert a json string to a struct column without schema

Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string

Is there any way to read multiple parquet paths from s3 in parallel using spark?

My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un

(py)spark weighted average taking account of missing values

Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da

Hi All, facing an issue of spark sql query for delete on basis of timestamp

I am running the delete query with the < (less then) and > (greater then) condition on the timestamp field but we are not getting the desired results. Fir

Issue while running command spark-shell on windows

I have been trying to set up spark to use it further for pyspark library. I installed JDK, Hadoop and spark. Also provided the environment variables correctly.

Java 17 solution for Spark - java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils

There are some solutions here Windows Spark Error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils The mentioned

How to select all columns except 2 of them from a large table on pyspark sql?

In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks. My pyspark sql: %

Errors when running spark-submit on a local machine with Apache Spark (stand alone, single node)

I've installed apache spark on my mac with 16 GB of RAM to test my pyspark code locally with small data sets before I test it on a real cluster. I've installed