Category "apache-spark"

Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

How do I interpret Input size / records in Spark Stage UI

I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: The number o

How to create Dataframe form presto db table of Array Data type column using spark

I am trying to create spark Dataframe from presto db table which has few columns as Array DataType. I tried multiple ways but I am getting same exception java.s

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via s

how to use bm25 in spark

I have more than 1 million documents to search, and more than 100,000 keywords. Each keyword needs to search 10 most similar documents in the offline way. So ho

How to create a CSV file with PySpark?

I have a short question about pyspark write. read_jdbc = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtabl

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr

Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error

spark-streaming-kafka-0-8 vs spark-streaming-kafka-0-10

I am a new beginner in the big data field, I need to make a demo which streams data from Kafka topic using spark stream then make some aggregation and filtering

Unable to start spark-shell failing to submit spark-submit

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/b

Spark executor metrics don't reach prometheus sink

Circumstances: I have read through these:https://spark.apache.org/docs/3.1.2/monitoring.html https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/ ver

No module named 'pyspark.streaming.kafka' even with older spark version

In another similar question, they hint 'install older spark 2.4.5.' EDIT: the solution from above link says 'install spark 2.4.5 and it does have kafkautils. Bu

dataframe Spark scala explode json array

Let's say I have a dataframe which looks like this: +--------------------+--------------------+--------------------------------------------------------------+

How to escape ( parentheses in Spark Scala?

I am trying to replace parentheses in a string (i.e. column names). It is working fine with white spaces but not with ( parentheses. I tried """, \(, \\( but I

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd

How to convert from Pandas' DatetimeIndex to DataFrame in PySpark?

I have the following code: # Get the min and max dates minDate, maxDate = df2.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()

Pyspark Fetching MongoDB records using MongoConnector and Where Clause

I'm trying to read MongoDB using this guide df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load() df = df.select(['my_cols']) df = df.where('date