Category "pyspark"

pyspark: hours diff between two dates columns

I would like to calculate number of hours between two date columns in pyspark. Could only find how to calculate number of days between the dates. dfs_4.show()

Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang

Py4JJavaError when trying to write pyspark DataFrame to parquet

I wanted to convert a large .csv vile into .parquet format using pyspark. I am using python 3. I tried changing the codec used for compression, as suggested in

How to create a CSV file with PySpark?

I have a short question about pyspark write. read_jdbc = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtabl

Using pyspark in Google Colab

This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info Up until this week I had a colab notebo

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr

Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error

No module named 'pyspark.streaming.kafka' even with older spark version

In another similar question, they hint 'install older spark 2.4.5.' EDIT: the solution from above link says 'install spark 2.4.5 and it does have kafkautils. Bu

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd

How to convert from Pandas' DatetimeIndex to DataFrame in PySpark?

I have the following code: # Get the min and max dates minDate, maxDate = df2.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()

Pyspark Fetching MongoDB records using MongoConnector and Where Clause

I'm trying to read MongoDB using this guide df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load() df = df.select(['my_cols']) df = df.where('date

Py4JJavaError in an Azure Databricks notebook pipeline

I have a curious issue, when launching a databricks notebook from a caller notebook through dbutils.notebook.run (I am working in Azure Databricks). One intere

Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab

I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class and followed the resolution

Unable to register with external shuffle server. Failed to connect on standalone Spark cluster

I followed the Dynamic allocation setup configuration however, getting the following error when starting the executors. ERROR TaskSchedulerImpl: Lost execu

ValueError: RDD is empty-- Pyspark (Windows Standalone)

I am trying to create an RDD but spark not creating it, throwing back error, pasted below; data = records.map(lambda r: LabeledPoint(extract_label(r), extract_

exclusions doesn't work in AWS Glue ELT job s3 connection

According to AWS Glue documentation, we can use exlusions to exclude files when the connection type is s3: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-

Why do I got TypeError: cannot pickle '_thread.RLock' object when using pyspark

I'm using spark to deal with my data, like that: dataframe_mysql = spark.read.format('jdbc').options( url='jdbc:mysql://xxxxxxx',

Compare two dataframes Pyspark

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/

Join two dataframes using the closest timestamp pyspark

So I am very new to pyspark but I am still unable to correctly create my own query. I try googling my problems but I just don't understand how most of this work

installing pyspark on windows

I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages. 1)