Category "apache-spark-sql"

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr

how to sequentially iterate rows in Pyspark Dataframe

I have a Spark DataFrame like this: +-------+------+-----+---------------+ |Account|nature|value| time| +-------+------+-----+---------------+ |

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column. Input: ----------------+ |symbols

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it. df.persist(Stora

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053',

How to query CSV using pure spark sql

I hope to get output from spark-sql CLI. But the data is in CSV which is separated by "\t". Is there any way to do this using pure sql? cmd like: spark-sql -e '

How to query CSV using pure spark sql

I hope to get output from spark-sql CLI. But the data is in CSV which is separated by "\t". Is there any way to do this using pure sql? cmd like: spark-sql -e '

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API.

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na

Update using JOIN or CTE in Databricks

I am trying to update a delta table in Databricks using the Databricks documentation here as an example. This document talks only about updating a literal value

pyspark get element from array Column of struct based on condition

I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark? I used below code snippet from pyspark.sql import functions a

When is spark groupby preferred over reducebykey?

My dataset is pretty big and I would like to understand when groupby makes sense over reducebykey?

How do I get the last item from a list using pyspark?

Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(d

How to write Spark SQL batch job results to the Apache Druid?

I want to write Spark batch results data to the Apache Druid. I know Druid has native batch ingestions such as index_parallel. Druid runs Map-Reduce jobs in the

How do you create merge_asof functionality in PySpark?

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Ta

mismatched input ';' expecting <EOF>(line 1, pos 90)

I am trying to fetch multiple rows in zeppelin using spark SQL. Here's my SQL statement: select id, name from target where updated_at = "val1", "val2","val3"

How to use a Scala class inside Pyspark

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject. Let's