Category "pyspark"

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

I work on DataBricks with PySpark dataframe containing string-type columns. I use .withColumnRenamed() to rename one of them. Later in the process I use a .filt

What is the difference between running pyspark program with and without cluster?

I have a program that contain few lines of functions that uses pyspark (the rest is normal Python). The portion of my code that uses pyspark: X.to_csv(r'first.t

Aws Multi region Access point

Need help on aws Multi region Access point(mrap) . I'm using spark data frame to write data to a mrap and that is error ing out Df.write(<mrap alias>.acce

Using XSD in PySpark

I am building a datawarehouse in Azure Synapse where one of the sources are about 20 different types of XML files (with a different XSD scheme) and 1 base schem

Pyspark performance tunning - cache or not to cache?

I am trying to speed up the calculations from multiple operations that I am adding as columns in a pyspark data frame, when I found the sparkbyexamples article

Glue Dynamic Frame Parse text file with ¶ delimiter

I have a text file which look like below. HDR¶20200101 BDY¶1¶Jimmy BDY¶1¶Something TRL¶123 I would like to parse it to a Glue Dyn

Spark Cache with TTL option

Do Spark have cache with TTL option. I need to do lookup on reference data to perform some transformation in my Spark streaming application. Also lookup dataset

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant

PySpark: How to transform data from string to data (or integer) in an easy-to-read manner

I have a date column in dataframe that that looks like this: "JAN20, FEB20, MAR20 .... JAN21, FEB21, MAR21..." This created a problem when I tried to plot numb

Can PySpark ML models be run on only parts of a dataframe, depending on a condition?

I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an ar

Pyspark: Extract Json Objects from Array

I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to

PySpark Self Signed certificate to access Artifactory from inside an EMR Jupyter Notebook

I am attempting to use a PySpark kernel from inside an EMR Notebook that is hosted on an AWS managed service (EMR) and I am unable to access Artifactory to inst

Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark

I am trying to write data on an S3 bucket from my local computer: spark = SparkSession.builder \ .appName('application') \ .config("spark.hadoop.fs.s3a.

How can I extract columns from a dataframe according to an array and if it does not find a column, that column should contain null values? - pyspark

I have an array named "extractColumns" and I have a dataframe named "raw_data". I wanted to create a new dataframe according to the array and the dataframe. Eve

pyspark read file from S3 Compatible Storage(Dell ECS) not working

I have a spark standalone configured with 3 nodes. I want to read csv data stored in s3-compatible storage (dell ecs) in this pySpark. Here's the method and con

PicklingError: Could not serialize object: ValueError: Cell is empty when training elephas-keras model inside pyspark

i am new in using pyspark with elephas and tensorflow i am trying to train a deep learning model inside pyspark using elephas module my code : https://www.kaggl

How to create a list of all elements present in a single cell of a dataframe?

Let say I have a dataframe: now i want list of the elements present in the column NAME like this: ['s', 'a', 'c', 'h', 'i', 'n'] how can we do this in pyspark

Date from week date format: 2022-W02-1 (ISO 8601) [duplicate]

Having a date, I create a column with ISO 8601 week date format: from pyspark.sql import functions as F df = spark.createDataFrame([('2019-03-