Category "apache-spark"

Calling Kubernetes Spark Operator with Java api

There is a good of examples of creating Spark jobs using the Kubernetes Spark Operator and simply submitting a request with the following kubectl apply -f spa

Spark streamming take long time read from kafka

I build a cluster use CDH5.14.2, includes 5 nodes, each node has 130G momery and 40 cpu cores. I builded the spark streamming application to read from multiple

Trigger IF Statement only when two Spark dataframe meet the conditions

I have two identical Spark DataFrame. They have the same columns. I am trying to create a IF-Else statement in one line but couldnt find a better way to do it.

KernelRestarter: restart failed in jupyter , Kernel died

[I 10:43:53.627 NotebookApp] 启动notebooks 在本地路径: /opt/soft/recommender/jupyter [I 10:43:53.627 NotebookApp]

Cannot connect to Cassandra in spark-shell

I am trying to connect to a remote cassandra cluster in my spark shell using the Spark-cassandra connector. But its throwing some unusual errors. I do the usual

Py4JJavaError in an Azure Databricks notebook pipeline

I have a curious issue, when launching a databricks notebook from a caller notebook through dbutils.notebook.run (I am working in Azure Databricks). One intere

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often- 1) DataFrame is untyped 2) DataFrame has sch

Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab

I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class and followed the resolution

Unable to register with external shuffle server. Failed to connect on standalone Spark cluster

I followed the Dynamic allocation setup configuration however, getting the following error when starting the executors. ERROR TaskSchedulerImpl: Lost execu

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often). The sql query just performs create table as select

Programmatically add/remove executors to a Spark Session

I'm looking for a reliable way in Spark (v2+) to programmatically adjust the number of executors in a session. I know about dynamic allocation and the ability

sbt package is trying to download a package whose path does not exist

These are the contents of my build.sbt file: name := "WordCounter" version := "0.1" scalaVersion := "2.13.1" libraryDependencies ++= Seq( "org.apache.spar

Why do I got TypeError: cannot pickle '_thread.RLock' object when using pyspark

I'm using spark to deal with my data, like that: dataframe_mysql = spark.read.format('jdbc').options( url='jdbc:mysql://xxxxxxx',

Compare two dataframes Pyspark

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/

Join two dataframes using the closest timestamp pyspark

So I am very new to pyspark but I am still unable to correctly create my own query. I try googling my problems but I just don't understand how most of this work

Setup Apache Sedona on EMR

I want to be able to use Apache Sedona for distributed GIS computing on AWS EMR. We need the right bootstrap script to have all dependencies. I tried setting up

Find the Max value of an Array column and find associated value in another Array with in the dataframe

I have a csv file with below data. Id Subject Marks 1 M,P,C 10,8,6 2 M,P,C 5,7,9 3 M,P,C 6,7,4 I Need to find out Max value in the Marks column for each Id an

spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file. This has to work both on HDFS but also on local fs (for testing). However it seems

Spark pivot groupby performance very slow

I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum): x_pivot = raw_df.groupBy("a", "b", "c"

Spark : skip top rows with spark-excel

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their githu

Category "apache-spark"

Other Categories