Category "apache-spark"

Pyspark - explode return an empty dataframe when a nested collection has no item

I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----

attributeerror: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'

Recently, I have started to occupy the AWS platform, but when trying to occupy Sagemaker, the following error and I don't know if it is because of Sagemaker or

Reuse Spark Session Across Modules/Packages

We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl

spark agg multi collect_list. Can we guarantee that the index of multiple columns in the same row is the same?

Is it possible to ensure that the value at the same index of each Collect_set is on a single line of the original dataframe? ("a",1) ,("b",2)

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like

Show Method for Dynamic Frame in AWS glue returns empty field

When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it int

Generate Kafka message with Headers using Apache Spark

I have an ETL (spark-scala). After writing in a table, a message with "header" must be sent to Kafka. I couldn't add the header in the message. I have a spark D

How to get list of all leaf folders from ADLS Gen2 path via Scala code?

We have folders and subfolders in it with year,month, day folders in it. How can we get only the last leaf level folder list using dbutils.fs.ls utility? Exampl

How to pass Cardinality Threshold value for Histogram in Deequ package?

By default the variable DEFAULT_CARDINALITY_THRESHOLD is set to 120 in Deequ. This is very low for our use case. Can anyone please suggest if we can set this va

LEAD function with date scenario

I have multiple files, but lets consider 2 files which have filename and start dates columns. Start_Date FileName 2022-01-01 product 1 2022-02-02 product 2 pl

Multi-schema data pipeline

We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a fi

What is the difference between running pyspark program with and without cluster?

I have a program that contain few lines of functions that uses pyspark (the rest is normal Python). The portion of my code that uses pyspark: X.to_csv(r'first.t

Best way to Create a custom Transformer In Java spark ml

I am learning Big data using Apache spark and I want to create a custom transformer for Spark ml so that I can execute some aggregate functions or can perform o

Performing a groupBy on a dataframe while limiting the number of rows

I have a dataframe that contains an "id" column and a "publication" column. The "id" column contains duplicates, and represents a researcher. The "publication"

Spark Cache with TTL option

Do Spark have cache with TTL option. I need to do lookup on reference data to perform some transformation in my Spark streaming application. Also lookup dataset

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant

Can PySpark ML models be run on only parts of a dataframe, depending on a condition?

I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an ar