Category "apache-spark"

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr

java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter in SparkSubmit

I've been trying to submit applications to a Kubernetes. I have followed the tutorial in https://spark.apache.org/docs/latest/running-on-kubernetes.html such as

Pyspark Window function on entire data frame

Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+

Delta Table / Athena And Spark

I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.sp

How to Install specific version of spark using specific version of scala

I'm running spark 2.4.5 in my mac. When I execute spark-submit --version ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zoo

Spark on Kubernetes driver pod cleanup

I am running spark 3.1.1 on kubernetes 1.19. Once job finishes executor pods get cleaned up but driver pod remains in completed state. How to clean up driver po

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as

How did spark RDD map to Cassandra table?

I am new to Spark, and recently I saw a code is saving data in RDD format to Cassandra table. But I am not able to figure it out how it is doing the column mapp

access objects in pyspark user-defined function from outer scope, avoid PicklingError: Could not serialize object

How do I avoid initializing a class within a pyspark user-defined function? Here is an example. Creating a spark session and DataFrame representing four latitu

Auto increment id in delta table while inserting

I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not mat

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based

Scala error - Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

I have a requirement where i am reading data from a CSV file and writing data to a Delta table over scala on window OS. My scala code is given below:- import co

Connection between kafka and spark : Failed to find data source : kafka

I am trying to do link between kafka and spark by reading data from one topic and tryy to print the content of this topic into a DataFrame, but by doing connect

Has anyone found good learning resources for the "Databricks Certified Data Engineer Associate" exam?

I have been studying for the above exam using Databricks' learning platform, but I have not found any external resources such as study guides or practice exams

TypeError: 'str' object is not callable -Pyspark

df1=df.withColumn('etl_load_dt_part_new', concat_ws("-",year(df.ETL_LOAD_DT_PART),lit('12'),lit('31')).cast('date') ) i am trying to add new column named as e

Start of the week on Monday in Spark

This is my dataset: from pyspark.sql import SparkSession, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([('2021-02-07',)

Convert columns to rows in Spark SQL

I have some data like this: ID Value1 Value2 Value40 101 3 520 2001 102 29 530 2020 I want to take this data and convert in to a KV style pair instead ID Val

PATINDEX in spark sql

I have this statement in sql Case WHEN AAAA is not null then AAAA Else RTRIM(LEFT(BBBB, PATINDEX('%[0-9]%', BBBB) - 1)) END as NAME. I need to co

SPARK SQL create table does not show / read all columns as expected

I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (