I wrote a file to parquet containing 1,000,000 rows. When I read the parquet file back, the result is 1,000,000 rows. df = spark.read.parquet(parquet_path) df.
I want to stop my spark instance here once I complete my job running on Jupyter notebook. I did execute spark.stop() at the end, but when I open my terminal, I'
I have code similar to this in Spark(Scala). I would like to know the number of records this code updated/inserted when execute() is complete. Is there a way?
What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,
I'm trying to write a dataframe on AWS (Keyspace), but I'm getting the following messages below: Stack: dfExploded.write.cassandraFormat(table = "table", keyspa
I'm using Databricks Autoloader to incrementally stream from a Delta Lake table into a SQL database. If an OPTIMIZE or VACUUM statement is ran against the Delt
When I try to write the dataframe to s3 as parquet, I always get an error like below. In the s3 bucket, an empty folder is generated automatically every time, b
Spark-submit in Databricks cluster.. is giving this error. I am using Spark 3.1.2 Scala 2.12 Springframeworkboot 2.6.3 However spark-submit is running good in m
I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by us
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (T
I'm creating a cluter in EMR aws and when spark runs my application I'm getting error below: Exception in thread "main" java.lang.UnsupportedClassVersionError:
I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data pro
I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr
I've been trying to submit applications to a Kubernetes. I have followed the tutorial in https://spark.apache.org/docs/latest/running-on-kubernetes.html such as
Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+
I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.sp
I'm running spark 2.4.5 in my mac. When I execute spark-submit --version ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/
I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zoo
I am running spark 3.1.1 on kubernetes 1.19. Once job finishes executor pods get cleaned up but driver pod remains in completed state. How to clean up driver po