Category "apache-spark"

Error while reading date and datetime column from mariadb via spark

I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar

At what point should you force a cache in Spark when performing heavy transformations?

Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('

Building Cube in Apache Kylin hangs on the second step

I am trying to buld a Cube in Kylin. It successfully does the first step and then just keeps running always at 50% Log from the step 2: 2022-05-13 13:54:40,640

HDFS Date partition directory loop

I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at t

Scala spark UDF function that takes input and puts it in an Array

I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a

What does the HiveWarehouseConnector executeUpdate() function return?

I can't believe I have to ask this here but there seems to be no documentation on what the HWC actually does. All I can find is that it returns a boolean: publi

spark how to convert a json string to a struct column without schema

Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string

Is there any way to read multiple parquet paths from s3 in parallel using spark?

My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un

(py)spark weighted average taking account of missing values

Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da

Hi All, facing an issue of spark sql query for delete on basis of timestamp

I am running the delete query with the < (less then) and > (greater then) condition on the timestamp field but we are not getting the desired results. Fir

Issue while running command spark-shell on windows

I have been trying to set up spark to use it further for pyspark library. I installed JDK, Hadoop and spark. Also provided the environment variables correctly.

Java 17 solution for Spark - java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils

There are some solutions here Windows Spark Error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils The mentioned

How to select all columns except 2 of them from a large table on pyspark sql?

In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks. My pyspark sql: %

Errors when running spark-submit on a local machine with Apache Spark (stand alone, single node)

I've installed apache spark on my mac with 16 GB of RAM to test my pyspark code locally with small data sets before I test it on a real cluster. I've installed

Spark writing extra rows when saving to CSV

I wrote a file to parquet containing 1,000,000 rows. When I read the parquet file back, the result is 1,000,000 rows. df = spark.read.parquet(parquet_path) df.

How to close the spark instance

I want to stop my spark instance here once I complete my job running on Jupyter notebook. I did execute spark.stop() at the end, but when I open my terminal, I'

How to find the number of Inserts and Updates of Merge command?

I have code similar to this in Spark(Scala). I would like to know the number of records this code updated/inserted when execute() is complete. Is there a way?

Processing data from a kafka stream using Pyspark

What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,

Error to write dataframe in Cassandra table on Amazon Keyspaces

I'm trying to write a dataframe on AWS (Keyspace), but I'm getting the following messages below: Stack: dfExploded.write.cassandraFormat(table = "table", keyspa