I'm trying to run Tpcds on Rapids single node on EMR using this guide: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html But getting res
In my project, I have the current workflow: Kafka message => Spark Streaming/processing => Insert/Update to HBase and/or Phoenix Both the Insert and Updat
I am using spring-cloud-starter-aws-secrets-manager-config 2.3.3 for a spring boot application which works perfectly in my local pointing to stage environment
Does anyone know the release date for Apache Spark 3.3 ? We have log4j vulnerability reported in Apache Spark 3.2.1 version and want to see if the next patch is
I have a dataset with 4 sensor values, 'volt', 'pressure', 'rotate' and 'vibration'. For these sensor values I am calculating rolling mean and rolling standard
I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar
Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('
I am trying to buld a Cube in Kylin. It successfully does the first step and then just keeps running always at 50% Log from the step 2: 2022-05-13 13:54:40,640
I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file
I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at t
I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a
I can't believe I have to ask this here but there seems to be no documentation on what the HWC actually does. All I can find is that it returns a boolean: publi
Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string
My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un
Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da
I am running the delete query with the < (less then) and > (greater then) condition on the timestamp field but we are not getting the desired results. Fir
I have been trying to set up spark to use it further for pyspark library. I installed JDK, Hadoop and spark. Also provided the environment variables correctly.
There are some solutions here Windows Spark Error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils The mentioned
In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks. My pyspark sql: %
I've installed apache spark on my mac with 16 GB of RAM to test my pyspark code locally with small data sets before I test it on a real cluster. I've installed