Category "pyspark"

Reuse Spark Session Across Modules/Packages

We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f

An error occurred while calling o590.save. : java.lang.RuntimeException: quote cannot be more than one character

When I use pyspark to write to the csv file: sql_df.write\ .format("csv")\ .option('sep', '\t')\ .option("compression", "gzip")\ .option("quote"

java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark

I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error: df = spark.read.parquet("path/to/file.parqu

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like

Sort by key (Month) using RDDs in Pyspark

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark? Note: Don't want to use spark.sql or Dataframe. +-----+-----+ |Month|co

Show Method for Dynamic Frame in AWS glue returns empty field

When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it int

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

I work on DataBricks with PySpark dataframe containing string-type columns. I use .withColumnRenamed() to rename one of them. Later in the process I use a .filt

What is the difference between running pyspark program with and without cluster?

I have a program that contain few lines of functions that uses pyspark (the rest is normal Python). The portion of my code that uses pyspark: X.to_csv(r'first.t

Aws Multi region Access point

Need help on aws Multi region Access point(mrap) . I'm using spark data frame to write data to a mrap and that is error ing out Df.write(<mrap alias>.acce

Using XSD in PySpark

I am building a datawarehouse in Azure Synapse where one of the sources are about 20 different types of XML files (with a different XSD scheme) and 1 base schem

Pyspark performance tunning - cache or not to cache?

I am trying to speed up the calculations from multiple operations that I am adding as columns in a pyspark data frame, when I found the sparkbyexamples article

Glue Dynamic Frame Parse text file with ¶ delimiter

I have a text file which look like below. HDR¶20200101 BDY¶1¶Jimmy BDY¶1¶Something TRL¶123 I would like to parse it to a Glue Dyn

Spark Cache with TTL option

Do Spark have cache with TTL option. I need to do lookup on reference data to perform some transformation in my Spark streaming application. Also lookup dataset

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant

PySpark: How to transform data from string to data (or integer) in an easy-to-read manner

I have a date column in dataframe that that looks like this: "JAN20, FEB20, MAR20 .... JAN21, FEB21, MAR21..." This created a problem when I tried to plot numb

Can PySpark ML models be run on only parts of a dataframe, depending on a condition?

I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an ar