Category "apache-spark"

Vertica data into pySpark throws "Failed to find data source"

I have spark 3.2, vertica 9.2. spark = SparkSession.builder.appName("Ukraine").master("local[*]")\ .config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hado

reading and writing from hive tables with spark after aggregation

We have a hive warehouse, and wanted to use spark for various tasks (mainly classification). At times write the results back as a hive table. For example, we wr

Spark Delta table restore to version

I am trying to restore a delta table to its previous version via spark java , am using local ide .code is as below import io.delta.tables.*; DeltaTable deltaTa

Spark hangs on union with zero running task

I have two records of type RDD[T] For example: val a: RDD[Integer] = .... val b: RDD[Integer] = ... when I perform val z = a.union(b) println(z) I find the s

How to tail yarn logs?

I am submitting a Spark Job using below command. I want to tail the yarn log using application Id similar to tail command operation in Linux box. export SPARK

What are broadcast variables? What problems do they solve?

I am going through Spark Programming guide that says: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather th

Concatenating string by rows in pyspark

I am having a pyspark dataframe as DOCTOR | PATIENT JOHN | SAM JOHN | PETER JOHN | ROBIN BEN | ROSE BEN | GRAY and need to concatenate patient n

Spark job loses executors: ERROR TaskSchedulerImpl: Lost executor 1... -> ./app.jar: No space left on device

I'm running both the master and 1 worker on a GPU server in standalone mode. After submitting the job, it retrieves and loses executors for X amount of times be

How to efficiently remove duplicate rows in Spark Dataframe, keeping row with highest timestamp

I have a large data set which I am reading from Postgres. It has an ID column, a timestamp column and several other columns which may have been updated. For eac

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column. Input: ----------------+ |symbols

SPARK SQL - case when then

I'm new to SPARK-SQL. Is there an equivalent to "CASE WHEN 'CONDITION' THEN 0 ELSE 1 END" in SPARK SQL ? select case when 1=1 then 1 else 0 end from table Tha

How to use spark with large decimal numbers?

My database has numeric value, which is up to 256-bit unsigned integer. However, spark's decimalType has a limit of Decimal(38,18). When I try to do calculatio

How can I increase spark.driver.memoryOverhead in Google dataproc?

I am getting two types of errors in running a job on Google dataproc and it is causing executors to be lost one by one until the last executor is lost and the j

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS? If I use Spark standalone cluster manager and have my data distributed in HDFS c

I am trying to start a Spark Session in Jupyter Notebook and get the following error

This is my first time using PySpark. I am using a Mac and I am trying to start up a session within Jupiter Notebook using the code below: import pyspark from py

Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?

I have been working on this problem for two days and still have not find the way. Problem: Our Spark installed via newest CDH 5 always complains about the lost

Different default persist for RDD and Dataset

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it. Does an

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it. df.persist(Stora

Pyspark UDF monitoring with prometheus

I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(

spark elasticsearch: Multiple ES-Hadoop versions detected in the classpath

I'm new to spark. I'm trying to run a spark job that loads data to elasticsearch. I've built a fat jar from my code and used it during spark-submit. spark-subm