Category "pyspark"

Pyspark: Extract Json Objects from Array

I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to

PySpark Self Signed certificate to access Artifactory from inside an EMR Jupyter Notebook

I am attempting to use a PySpark kernel from inside an EMR Notebook that is hosted on an AWS managed service (EMR) and I am unable to access Artifactory to inst

Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark

I am trying to write data on an S3 bucket from my local computer: spark = SparkSession.builder \ .appName('application') \ .config("spark.hadoop.fs.s3a.

How can I extract columns from a dataframe according to an array and if it does not find a column, that column should contain null values? - pyspark

I have an array named "extractColumns" and I have a dataframe named "raw_data". I wanted to create a new dataframe according to the array and the dataframe. Eve

pyspark read file from S3 Compatible Storage(Dell ECS) not working

I have a spark standalone configured with 3 nodes. I want to read csv data stored in s3-compatible storage (dell ecs) in this pySpark. Here's the method and con

PicklingError: Could not serialize object: ValueError: Cell is empty when training elephas-keras model inside pyspark

i am new in using pyspark with elephas and tensorflow i am trying to train a deep learning model inside pyspark using elephas module my code : https://www.kaggl

How to create a list of all elements present in a single cell of a dataframe?

Let say I have a dataframe: now i want list of the elements present in the column NAME like this: ['s', 'a', 'c', 'h', 'i', 'n'] how can we do this in pyspark

Date from week date format: 2022-W02-1 (ISO 8601) [duplicate]

Having a date, I create a column with ISO 8601 week date format: from pyspark.sql import functions as F df = spark.createDataFrame([('2019-03-

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneTy

pyspark self class variable using to read csv file for dataframe

I have to call csv file to read data frame as below format, i can read normal file read but using self variable , i am not aware of it. kindly help us. from pys

Why archived venv during azure pipelines created with venv-pack has corrupted python interpreter?

I want to use archived environment for spark-submit, but after unpacking on k8s cluster it has corrupted python interpreter

How to register a JDBC Spark dialect in Python?

I am trying to read from a databricks table. I have used the url from a cluster in the databricks. I am getting this error: java.sql.SQLDataException: [Simba][

Write data from broadcast variable (databricks) to azure blob

I have a url from where I download the data (which is in JSON format) using Databricks: url="https://tortuga-prod-eu.s3-eu-west-1.amazonaws.com/%2FNinetyDays/am

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks fulldf = spark.read.format("csv").option("header", True).option

Pyspark caches dataframe by default or not?

If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i

Split corresponding column values in pyspark

Below table would be the input dataframe col1 col2 col3 1 12;34;56 Aus;SL;NZ 2 31;54;81 Ind;US;UK 3 null Ban 4 Ned null Expected output dataframe [values of c

Matching vocabulary elements to indices from LDA Model using PySpark

I'd like to take a Spark LDA Model's term indices from the .describeTopics() output and match them to the appropriate term in the count vectorizer's vocabulary.

How to find quantile of a row in PySpark dataframe?

I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0

Pyspark 1.6.3 error when trying to use to_date method

im currently working on pyspark 1.6.3 and there is this error. Do you know what can be the reason? code

Creating spark application with VSCode - Synapse PySpark installation error - Exit with non zero 3221225477

I am on windows and I am trying to follow this doc to create spark applications with VSCode using a Synapse workspace. I can sign into Azure and set a default S