Category "pyspark"

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneTy

pyspark self class variable using to read csv file for dataframe

I have to call csv file to read data frame as below format, i can read normal file read but using self variable , i am not aware of it. kindly help us. from pys

Why archived venv during azure pipelines created with venv-pack has corrupted python interpreter?

I want to use archived environment for spark-submit, but after unpacking on k8s cluster it has corrupted python interpreter

How to register a JDBC Spark dialect in Python?

I am trying to read from a databricks table. I have used the url from a cluster in the databricks. I am getting this error: java.sql.SQLDataException: [Simba][

Write data from broadcast variable (databricks) to azure blob

I have a url from where I download the data (which is in JSON format) using Databricks: url="https://tortuga-prod-eu.s3-eu-west-1.amazonaws.com/%2FNinetyDays/am

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks fulldf = spark.read.format("csv").option("header", True).option

Pyspark caches dataframe by default or not?

If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i

Split corresponding column values in pyspark

Below table would be the input dataframe col1 col2 col3 1 12;34;56 Aus;SL;NZ 2 31;54;81 Ind;US;UK 3 null Ban 4 Ned null Expected output dataframe [values of c

Matching vocabulary elements to indices from LDA Model using PySpark

I'd like to take a Spark LDA Model's term indices from the .describeTopics() output and match them to the appropriate term in the count vectorizer's vocabulary.

How to find quantile of a row in PySpark dataframe?

I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0

Pyspark 1.6.3 error when trying to use to_date method

im currently working on pyspark 1.6.3 and there is this error. Do you know what can be the reason? code

Creating spark application with VSCode - Synapse PySpark installation error - Exit with non zero 3221225477

I am on windows and I am trying to follow this doc to create spark applications with VSCode using a Synapse workspace. I can sign into Azure and set a default S

Package sparse_dot_topn in Pyspark AWS EMR Jupyter install error

Running on AWS and EMR, Jupyter, Pyspark notebook and trying to install a python package "sparse_dot_topn" version 0.2.9 I'm getting an error I don't understand

Databricks dbfs file read issue

I am trying to open a file that i uploaded to the dbfs location. However, I get error while trying to open the file but I can see the file when I do a ls. Also

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things: HISTORICAL DATA: To read the data from aurora(say TABLE A) do some processing and updat

Alter multiple column comments simultaneously in spark/delta lake

Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially acro

Pipe Pyspark OSError: [WinError 87] The parameter is incorrect

I have installed Spark 3.0.0 on a Windows 64 bit machine with Python 3.9.7 using an anaconda base environment. I'm trying to execute the next code in the pyspar

Flatten Nested Json String Column Table into tabular format

I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f

Pyspark Py4JJavaError while creating the delta-table

Here is the pyspark code which is running on jupyter notebook. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \

String similarity library

I am currently working on big data in pyspark.I need to do a record linkage.I have seen some implementations in pandas using record linkage library.I would like

Category "pyspark"

Other Categories