Category "apache-spark"

I am trying to start a Spark Session in Jupyter Notebook and get the following error

This is my first time using PySpark. I am using a Mac and I am trying to start up a session within Jupiter Notebook using the code below: import pyspark from py

Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?

I have been working on this problem for two days and still have not find the way. Problem: Our Spark installed via newest CDH 5 always complains about the lost

Different default persist for RDD and Dataset

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it. Does an

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it. df.persist(Stora

Pyspark UDF monitoring with prometheus

I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(

spark elasticsearch: Multiple ES-Hadoop versions detected in the classpath

I'm new to spark. I'm trying to run a spark job that loads data to elasticsearch. I've built a fat jar from my code and used it during spark-submit. spark-subm

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053',

Sum a column values based on a condition using spark scala

I have a dataframe like this: JoiKey period Age Amount Jk1 2022-02 2 200 Jk1 2022-02 3 450 Jk2 2022-03 5 500 Jk3 2022-03 0 200 Jk2 2022-02 8 300 Jk3 2022-03 9

Converting Pandas dataframe into Spark dataframe error

I'm trying to convert Pandas DF into Spark one. DF head: 10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543 10000001,2,0,1,12:36,OK,10002,1

DAG of Spark Sort application spanning two jobs

I've written a very simple Sort scala program with Spark. object Sort { def main(args: Array[String]): Unit = { if (args.length < 2) {

OutOfMemoryError : Java heap space in Spark

I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities a

Livy and Elasticsearch-Spark: Multiple ES-Hadoop versions detected

I'm trying to read from elasticsearch in a livy job using the elastisearch-spark jar. When I upload the jar to a livy client(like the example here) I get this e

Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory

I am currently running a Java Spark Application in tomcat and receiving the following exception: Caused by: java.io.IOException: Mkdirs failed to create file:/

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API.

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na

Spark job failing on jackson dependencies

I have spark job that is failing after the upgrade of the cdh from 5.5.4 which had spark 1.5.0 to cdh 5.13.0 which has spark 1.6.0 The job is running with the

Is there any an issue with the file name openjdk-8-jdk-headless?

I am trying to install Spark which requires Java with using !apt-get install openjdk-8-jdk-headless -qq > /dev/null And I get an error after it. E: Failed

Zeppelin+Spark+Kubernetes: Let Zeppelin Job run on existing Spark Cluster

In a k8s cluster. How do you configure zeppelin to run spark jobs in an existing spark cluster instead of spinning up a new pod? I've got a k8s cluster up and r

Another SparkContext is being constructed Eror

I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h

Java.io.FileNotFoundException : YAML file does not exists

When I am submitting the spark job from terminal I am getting below error that file does not exists. Although I have already placed config file to my local. spa