Category "apache-spark-sql"

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr

Unable to start spark-shell failing to submit spark-submit

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/b

dataframe Spark scala explode json array

Let's say I have a dataframe which looks like this: +--------------------+--------------------+--------------------------------------------------------------+

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often- 1) DataFrame is untyped 2) DataFrame has sch

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often). The sql query just performs create table as select

Why do I got TypeError: cannot pickle '_thread.RLock' object when using pyspark

I'm using spark to deal with my data, like that: dataframe_mysql = spark.read.format('jdbc').options( url='jdbc:mysql://xxxxxxx',

Compare two dataframes Pyspark

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/

Join two dataframes using the closest timestamp pyspark

So I am very new to pyspark but I am still unable to correctly create my own query. I try googling my problems but I just don't understand how most of this work

Spark : skip top rows with spark-excel

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their githu

Update DeltaTable properties on S3

I have a DeltaTable at aws S3 location (s3://bucket/myDeltaTable) which has a default table property delta.logRetentionDuration set to 30 days. Is there a way I

Spark dataframe transform multiple rows to column

I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m

How can I access python variable in Spark SQL?

I have python variable created under %python in my jupyter notebook file in Azure Databricks. How can I access the same variable to make comparisons under %sql.

SparkSQL error: collect_set() cannot have map type data

For SparkSQL on hive, when I used named_struct in the query, it returns results: SELECT id, collect_set(emp_info) as employee_info FROM ( SELECT t.id,

How to split a list to multiple columns in Pyspark?

I have: key value a [1,2,3] b [2,3,4] I want: key value1 value2 value3 a 1 2 3 b 2 3 4 It seems that in scala I can wr

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr

how to sequentially iterate rows in Pyspark Dataframe

I have a Spark DataFrame like this: +-------+------+-----+---------------+ |Account|nature|value| time| +-------+------+-----+---------------+ |

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column. Input: ----------------+ |symbols

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it. df.persist(Stora