Category "apache-spark"

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneTy

How can I use snowflake jar in Bitnami Spark Docker container?

I was able to create docker based bitnami stand alone spark instance and run spark jobs on it. However I'm not able not able to write data to snowflake from the

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1 Hudi version is 0.7 I have inserted a few records and then updated t

Extract value from ArrayType column in Scala and reshape to long

I have a DataFrame that consists of Column that is ArrayType, and the array may have a different length in each row of the data. I have provide some example cod

How to find position of substring in another column of dataframe using spark scala

I have a Spark scala DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio

Only Two Applications ever running for Synapse Notebooks in For Each activity

I am running a Synapse Notebook in a For Each activity in a Synapse Pipeline. The notebook loads some data from the datalake into the database and some custom

How "stable" is monotonically_increasing_id() in Spark?

I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks fulldf = spark.read.format("csv").option("header", True).option

Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD

Below is the sample code snippet that is used for data fetch from HBase. This worked fine with Spark 3.1.2. However after upgrading to Spark 3.2.1, it is not wo

Glue-Spark transform to Postgres time data type

Postgres has a time data type. I am trying to insert rows into postgres from a glue job. Given the code: applymapping1 = ApplyMapping.apply(frame = SelectFromCo

Pyspark caches dataframe by default or not?

If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i

Split corresponding column values in pyspark

Below table would be the input dataframe col1 col2 col3 1 12;34;56 Aus;SL;NZ 2 31;54;81 Ind;US;UK 3 null Ban 4 Ned null Expected output dataframe [values of c

How to find quantile of a row in PySpark dataframe?

I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pa

Creating spark application with VSCode - Synapse PySpark installation error - Exit with non zero 3221225477

I am on windows and I am trying to follow this doc to create spark applications with VSCode using a Synapse workspace. I can sign into Azure and set a default S

h2o-pysparkling-2.4 and Glue Jobs with: {"error":"TypeError: 'JavaPackage' object is not callable","errorType":"EXECUTION_FAILURE"}

I am try to using pysparkling.ml.H2OMOJOModel for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the e

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO. But what if the parquet file is stored in HDFS, should we download t

"Array dimensions exceeded supported range" when creating a dataframe

I'm getting this exception when trying to create a DataFrame with only 50 millions rows. Any ideas how to avoid this problem? [Exception] [JvmBridge] Array dime

Upgrading from Spark 3.1 to Spark 3.2

In Spark 3.2, special datetime values such as epoch, today, yesterday, tomorrow, and now are supported in typed literals or in cast of foldable strings only, fo

Alter multiple column comments simultaneously in spark/delta lake

Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially acro