Category "apache-spark-sql"

How to processing json data in a column by using python/pyspark?

Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info) JSON_Info {"sampleData":"dataD

Is there an elegant, easy and fast way to move data out of HBase into MongoDB?

Is there an elegant, easy and fast way to move data out of HBase into MongoDB? I want to migrate HBase to mongoDB. I am new to mongoDB. Could someone please hel

collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this common_id table1_address table2_address table3_address table4_address 123 null null stack building12 null 157 123road stree

Spark dataframe from dictionary

I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33

Encounter an Error Converting Rdd in Dataframe Pyspark

I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.

Azure Databricks - Write to parquet file using spark.sql with union and subqueries

Issue: I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem

How to have a single csv file after applying partitionBy in Pysark

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri

spark sql Find the number of extensions for a record

I have a dataset as below col1 extension_col1 2345 2246 2246 2134 2134 2091 2091 Null 1234 1111 1111 Null I need to find the number of extensions available fo

Group by id and create a column based on priority in Pyspark

Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2

Why compute row_number() order by monotonically_increasing_id() in Spark?

It is suggested that you can 'generate unique increasing numeric values' by select row_number() over (order by monotonically_increasing_id()) from /* ... */ Bu

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru

Spark - Update a nested column to string

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f

An error occurred while calling o590.save. : java.lang.RuntimeException: quote cannot be more than one character

When I use pyspark to write to the csv file: sql_df.write\ .format("csv")\ .option('sep', '\t')\ .option("compression", "gzip")\ .option("quote"

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like

ParseException: SQL CTE

result = aml_identity_g.connectedComponents() conn_comps = result.select("id", "component",'type') \ .createOrReplaceTempView("components") display(result)

LEAD function with date scenario

I have multiple files, but lets consider 2 files which have filename and start dates columns. Start_Date FileName 2022-01-01 product 1 2022-02-02 product 2 pl

Glue Dynamic Frame Parse text file with ¶ delimiter

I have a text file which look like below. HDR¶20200101 BDY¶1¶Jimmy BDY¶1¶Something TRL¶123 I would like to parse it to a Glue Dyn

Spark Scala - Split DataFrame column into multiple depending on the size of the column

I need to split a column in several columns depending on the number of fields that each record has, for example, if I have the following DF: +---+--------------