Category "pyspark"

Pyspark to parse multiples dictionaries

Imagine I have the following list of dicts in python: list = [dict1, dict2, dict3] I want to parse these dicts and transform them to a dataframe and save it as

SparkFatalException root cause

I am using spark 3.0.2 with java 8 version. I am trying to write data on s3 path using spark job. I am getting below exception, not able to know what caused thi

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index. For example, in pandas: df.iloc[5:10,:] Is there a similar way in pyspark to slice data bas

Delete multiple rows from a delta table/pyspark data frame givien a list of IDs

I need to find a way to delete multiple rows from a delta table/pyspark data frame given a list of ID's to identify the rows. As far as I can tell there isn't a

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4

pyspark create dictionary from data in two columns

I have a pyspark dataframe with two columns: [Row(zip_code='58542', dma='MIN'), Row(zip_code='58701', dma='MIN'), Row(zip_code='57632', dma='MIN'), Row(zip_

Is it possible to load multiple directory separately in pyspark but process them in parallel?

I have s3 or azure blob directory structure like the following parent_dir child_dir1 avro_1 avro_2 ... child_dir2 ... There

How to test mocked (moto/boto) S3 read/write in PySpark

I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to ac

How to create a new column with a null value using Pyspark DataFrame?

I'm having issues with using pyspark dataframes. I have a column called eventkey which is a concatenation of the following elements: account_type, counter_type

DF.topandas() - Failed to locate the winutils binary in the hadoop binary path

I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ["SPARK_HOM

Search and filter text from a column using Pyspark

I am new to Data Scraping. I am reading the data from a file having JSON objects as one row {"name": "Soul Sweet \u2018Taters (Step-by-Step!)", "ingredients":

Invalid labels for classification logistic regression model in pyspark databricks

I am using Spark ML library for classification problem using a logistic regression. I have vectorized input features and created training dataset and test datas

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows: create table hive.data.datadump ( tUnixEpoch varchar, tDateTi

Translate Python to Pyspark

i have code Python for take all values contains NCO-ETD in Type with groupby ID and Date cond = 'NCO - ETD' df_ = (data.where(data.assign(new = data['Type'].str

finding the middle values with the min distance in pyspark

i need some help please i have this dataframe with an even number of values for the column 'b' df1 = spark.createDataFrame([ ('c',1), ('c',2), ('c',

How to quickly check if row exists in PySpark Dataframe?

I have a PySpark dataframe like this: +------+------+ | A| B| +------+------+ | 1| 2| | 1| 3| | 2| 3| | 2| 5| +------+--

How to connect Snowflake with PySpark?

I am trying to connect to Snowflake with Pyspark on my local machine. My code is as follows: from pyspark.sql.types import * from pyspark.sql import SparkSessio

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio