Category "apache-spark"

How to find quantile of a row in PySpark dataframe?

I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pa

Creating spark application with VSCode - Synapse PySpark installation error - Exit with non zero 3221225477

I am on windows and I am trying to follow this doc to create spark applications with VSCode using a Synapse workspace. I can sign into Azure and set a default S

h2o-pysparkling-2.4 and Glue Jobs with: {"error":"TypeError: 'JavaPackage' object is not callable","errorType":"EXECUTION_FAILURE"}

I am try to using pysparkling.ml.H2OMOJOModel for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the e

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO. But what if the parquet file is stored in HDFS, should we download t

"Array dimensions exceeded supported range" when creating a dataframe

I'm getting this exception when trying to create a DataFrame with only 50 millions rows. Any ideas how to avoid this problem? [Exception] [JvmBridge] Array dime

Upgrading from Spark 3.1 to Spark 3.2

In Spark 3.2, special datetime values such as epoch, today, yesterday, tomorrow, and now are supported in typed literals or in cast of foldable strings only, fo

Alter multiple column comments simultaneously in spark/delta lake

Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially acro

Spark-Java : How to add an array column in spark Dataframe

I am trying to add a new column to my Spark Dataframe. New column added will be of a size based on a variable (say salt) post which I will use that column to ex

Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code

I am stuck in a very odd situation related to Hbase design i would say. Hbase version >> Version 2.1.0-cdh6.2.1 So, the problem statement is, in Hbase, w

Pipe Pyspark OSError: [WinError 87] The parameter is incorrect

I have installed Spark 3.0.0 on a Windows 64 bit machine with Python 3.9.7 using an anaconda base environment. I'm trying to execute the next code in the pyspar

Flatten Nested Json String Column Table into tabular format

I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f

Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate() import spark.implicits._ case class Someth

regex_extract_all not working with spark sql

I'm using databricks notebook to extract all field occurrences from a text column using the regexp_extract_all function. Here is the input: field_map#'IFDSIMP.7

Scala Test: how to assert lenghty exception message securly and clean without hardcoding?

I have the following code, which is used to (sha) hash columns in a spark dataframe: import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions

do two instances from the same Spark Streaming can be in conflict?

I want to run the same Java Spark Streaming (10 seconds micro batch) through 2 instances (sparkStr1 and sparkStr2). Mainly, they consume the same kafka topic (3

Pyspark Py4JJavaError while creating the delta-table

Here is the pyspark code which is running on jupyter notebook. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I re

Apache Beam run docker in pipeline

The apache beam pipeline (python) I'm currently working on contains a transformation which runs a docker container. While that works well during local testing w

Spark read csv option to escape delimiter

Have an input csv like the one below, Need to escape the delimiter within one of the columns (2nd column): f1|f2|f3 v1|v2\|2|v3 x1|x2\|2|x3 spark.read.option(