Category "apache-spark-sql"

Insert Overwrite in data bricks overwriting complete data in table?

I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by us

Spark/Scala approximate group by

Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?

How to use Apache Spark to query Hive table with Kerberos?

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data pro

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr

Pyspark Window function on entire data frame

Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as

Auto increment id in delta table while inserting

I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not mat

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based

Scala error - Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

I have a requirement where i am reading data from a CSV file and writing data to a Delta table over scala on window OS. My scala code is given below:- import co

TypeError: 'str' object is not callable -Pyspark

df1=df.withColumn('etl_load_dt_part_new', concat_ws("-",year(df.ETL_LOAD_DT_PART),lit('12'),lit('31')).cast('date') ) i am trying to add new column named as e

Start of the week on Monday in Spark

This is my dataset: from pyspark.sql import SparkSession, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([('2021-02-07',)

Reading image dataset into data frame and feature extraction [spark with python]

In my project , i need to read image dataset[each folder having different object and I want to read these folder in stream one by one ], and then need to extrac

SPARK SQL create table does not show / read all columns as expected

I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (

How to return null in SUM if some values are null?

I have a case where I may have null values in the column that needs to be summed up in a group. If I encounter a null in a group, I want the sum of that group t

Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$

Hi I try to run spark on my local laptop. I created a mvn project in intelijidea and in my main class I have one line like bellow and when I try to run a projec

bucketing with QuantileDiscretizer using groupBy function in pyspark

I have a large dataset like so: | SEQ_ID|RESULT| +-------+------+ |3462099|239.52| |3462099|239.66| |3462099|239.63| |3462099|239.64| |3462099|239.57| |3462099|

Pyspark: How do I covert dataframe column values into a comma separated string?

I am running this on Databricks. My goal is to make a select statement with all the values in the column comma separated. Content of my df: For example, I want

pyspark recover for an even number the two values ​of a median

Is there a way i pyspark to recover for an even number the two values ​​of a median ? For exemple: I have this dataframe df1 = spark.createDataFrame

what is est in filter sparkUI sql tab

I am trying to debug my spark UI, and in the SQL tab of spark UI getting this red mark on filter description, trying to figure out what does it mean. Spark UI s