Category "apache-spark"

spark sql Find the number of extensions for a record

I have a dataset as below col1 extension_col1 2345 2246 2246 2134 2134 2091 2091 Null 1234 1111 1111 Null I need to find the number of extensions available fo

How to effectively run tasks parallelly in pyspark

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like { "check_1": [ sql_query_1, sql_query_2 ], "check_2":

Is there a way to configure the memory resources for Spark using Pyspark

I'm working on an ETL job with an SageMaker notebook that uses spark 2.4.0. After joining a couple of tables I keep getting the following errors: Update-- I was

oozie intial instance and start time giving error on missing dataset

I am new to oozie and trying to understand dataset.xml. I have following dataset and trying to understand what exactly oozie is trying to validate here. what is

Group by id and create a column based on priority in Pyspark

Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2

how to fix php spark serve not working and said thrown in C:\xampp\htdocs\ci-news\system\CLI\CLI.php on line 758

this is what cmd said and I don't know how to fix this I saw similar cases like this in the stackoverflow but their suggestion didn't fix my problem I hope you

how to connect spark workers to spark driver in kubernetes (standalone cluster)

I created a Dockerfile with just debian and apache spark downloaded from the main website. I then created a kubernetes deployment to have 1 pod running spark dr

Databricks Runtime 10.4 LTS - AnalysisException: No such struct field id in 0, 1 after upgrading

We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime

Why compute row_number() order by monotonically_increasing_id() in Spark?

It is suggested that you can 'generate unique increasing numeric values' by select row_number() over (order by monotonically_increasing_id()) from /* ... */ Bu

Unable to open or query .parquet files due to corrupted column

I am sending JSON telemetry data from Azure Stream Analytics to Azure Data Lake Gen2 serialized as .parquet files. From the data lake I've then created a view i

Programmatic way to find the cluster version from CDSW - Cloudera Data Science Workbench

Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru

Spark - Update a nested column to string

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =

Spark write file csv/hive take too much time and performance benchmark

I am having a very simple problem with spark, but there is very little information on the web. I have encountered this problem using both pyspark and scala. The

solr distinct query I want only certain fields to be listed

I want to find the number of unique records based on myparam value. Solr distinct query I want only certain fields to be listed. too many ifs in the distinctVal

Pyspark - explode return an empty dataframe when a nested collection has no item

I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----

attributeerror: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'

Recently, I have started to occupy the AWS platform, but when trying to occupy Sagemaker, the following error and I don't know if it is because of Sagemaker or

Reuse Spark Session Across Modules/Packages

We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl

spark agg multi collect_list. Can we guarantee that the index of multiple columns in the same row is the same?

Is it possible to ensure that the value at the same index of each Collect_set is on a single line of the original dataframe? ("a",1) ,("b",2)

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui