Category "pyspark"

collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this common_id table1_address table2_address table3_address table4_address 123 null null stack building12 null 157 123road stree

Spark dataframe from dictionary

I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33

Encounter an Error Converting Rdd in Dataframe Pyspark

I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.

Python function to iterate each unique column and transform using pyspark

I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format

How to have a single csv file after applying partitionBy in Pysark

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri

Python argparse unexpected behavior when passing "``" to the argument string in pysaprk cluster mode

I am trying to pass a string in my pyspark code and it works fine but when i pass the following string to escape reserved keyword `date` or any value passed in

How to effectively run tasks parallelly in pyspark

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like { "check_1": [ sql_query_1, sql_query_2 ], "check_2":

Java gateway process exited before sending its port number

I am trying to install PySpark on my Windows 10 to be used on Jupyter Lab. I have already installed Java and running Python 3.7.3: openjdk version "1.8.0_242" O

Is there a way to configure the memory resources for Spark using Pyspark

I'm working on an ETL job with an SageMaker notebook that uses spark 2.4.0. After joining a couple of tables I keep getting the following errors: Update-- I was

Why pyspark is taking so long to create a SparkSession on jupyter?

Whell i'm learning PySpark, i installed ipykernel, jupyterlab, notebook and pyspark via PIP, and Java 8 via .exe, the problem is when i need to create the sessi

Group by id and create a column based on priority in Pyspark

Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2

why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be a

Programmatic way to find the cluster version from CDSW - Cloudera Data Science Workbench

Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru

Spark - Update a nested column to string

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =

Is it possible to connect to serverless sql pool via azure databricks?

I'm trying to connect to synapse serverless pool via databricks. I need to create synapse views and external tables directly in databricks as part of an existin

Add comments to delta

If a pyspark dataframe is reading some data from a table and writing it to azure delta lake Can we add comments to this newly written file? For e.g Df = sql("se

Pyspark - explode return an empty dataframe when a nested collection has no item

I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----

Reuse Spark Session Across Modules/Packages

We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f