Category "pyspark"

How to test mocked (moto/boto) S3 read/write in PySpark

I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to ac

How to create a new column with a null value using Pyspark DataFrame?

I'm having issues with using pyspark dataframes. I have a column called eventkey which is a concatenation of the following elements: account_type, counter_type

DF.topandas() - Failed to locate the winutils binary in the hadoop binary path

I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ["SPARK_HOM

Search and filter text from a column using Pyspark

I am new to Data Scraping. I am reading the data from a file having JSON objects as one row {"name": "Soul Sweet \u2018Taters (Step-by-Step!)", "ingredients":

Invalid labels for classification logistic regression model in pyspark databricks

I am using Spark ML library for classification problem using a logistic regression. I have vectorized input features and created training dataset and test datas

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows: create table hive.data.datadump ( tUnixEpoch varchar, tDateTi

Translate Python to Pyspark

i have code Python for take all values contains NCO-ETD in Type with groupby ID and Date cond = 'NCO - ETD' df_ = (data.where(data.assign(new = data['Type'].str

finding the middle values with the min distance in pyspark

i need some help please i have this dataframe with an even number of values for the column 'b' df1 = spark.createDataFrame([ ('c',1), ('c',2), ('c',

How to quickly check if row exists in PySpark Dataframe?

I have a PySpark dataframe like this: +------+------+ | A| B| +------+------+ | 1| 2| | 1| 3| | 2| 3| | 2| 5| +------+--

How to connect Snowflake with PySpark?

I am trying to connect to Snowflake with Pyspark on my local machine. My code is as follows: from pyspark.sql.types import * from pyspark.sql import SparkSessio

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio

pyspark: hours diff between two dates columns

I would like to calculate number of hours between two date columns in pyspark. Could only find how to calculate number of days between the dates. dfs_4.show()

Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang

Py4JJavaError when trying to write pyspark DataFrame to parquet

I wanted to convert a large .csv vile into .parquet format using pyspark. I am using python 3. I tried changing the codec used for compression, as suggested in

How to create a CSV file with PySpark?

I have a short question about pyspark write. read_jdbc = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtabl

Using pyspark in Google Colab

This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info Up until this week I had a colab notebo

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr

Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error

No module named 'pyspark.streaming.kafka' even with older spark version

In another similar question, they hint 'install older spark 2.4.5.' EDIT: the solution from above link says 'install spark 2.4.5 and it does have kafkautils. Bu

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd