Category "apache-spark"

Spark Streaming "ERROR JobScheduler: error in job generator"

I build a spark Streaming application to keep receiving messages from Kafka and then write them into a table HBase. This app runs pretty good for first 25 mins

How to Override log4j with log4j2 version to resolve "SocketServer class vulnerable to deserialization" for apache-core_2.12 version

How to Override log4j version 1.2.17 with log4j-core 2.16.0 version to resolve "SocketServer class vulnerable to deserialization" for spark-core_2.12 binaries.

Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel

I have created a new dataproc cluster with a specific environment.yaml. Here is the command that I have used to create that cluster: gcloud dataproc clusters cr

Getting java.lang.NoSuchMethodError: org.yaml.snakeyaml.Yaml.<init> while running spark based spring boot application

SnakeYaml jar present at classPath: snakeyaml-1.26.jar 2330 [main] ERROR org.springframework.boot.SpringApplication - Application run failed java.lang.NoSuchMe

How do you create merge_asof functionality in PySpark?

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Ta

How to apply the describe function after grouping a PySpark DataFrame?

I want to find the cleanest way to apply the describe function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF) I

Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading

I am using PostGre as database. I want to capture one table data for each batch and convert it as parquet file and store in to s3. I tried to connect using JDB

How to use a Scala class inside Pyspark

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject. Let's