Category "apache-spark"

How to create Dataframe form presto db table of Array Data type column using spark

I am trying to create spark Dataframe from presto db table which has few columns as Array DataType. I tried multiple ways but I am getting same exception java.s

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via s

how to use bm25 in spark

I have more than 1 million documents to search, and more than 100,000 keywords. Each keyword needs to search 10 most similar documents in the offline way. So ho

How to create a CSV file with PySpark?

I have a short question about pyspark write. read_jdbc = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtabl

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr

Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error

spark-streaming-kafka-0-8 vs spark-streaming-kafka-0-10

I am a new beginner in the big data field, I need to make a demo which streams data from Kafka topic using spark stream then make some aggregation and filtering

Unable to start spark-shell failing to submit spark-submit

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/b

Spark executor metrics don't reach prometheus sink

Circumstances: I have read through these:https://spark.apache.org/docs/3.1.2/monitoring.html https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/ ver

No module named 'pyspark.streaming.kafka' even with older spark version

In another similar question, they hint 'install older spark 2.4.5.' EDIT: the solution from above link says 'install spark 2.4.5 and it does have kafkautils. Bu

dataframe Spark scala explode json array

Let's say I have a dataframe which looks like this: +--------------------+--------------------+--------------------------------------------------------------+

How to escape ( parentheses in Spark Scala?

I am trying to replace parentheses in a string (i.e. column names). It is working fine with white spaces but not with ( parentheses. I tried """, \(, \\( but I

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd

How to convert from Pandas' DatetimeIndex to DataFrame in PySpark?

I have the following code: # Get the min and max dates minDate, maxDate = df2.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()

Pyspark Fetching MongoDB records using MongoConnector and Where Clause

I'm trying to read MongoDB using this guide df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load() df = df.select(['my_cols']) df = df.where('date

Calling Kubernetes Spark Operator with Java api

There is a good of examples of creating Spark jobs using the Kubernetes Spark Operator and simply submitting a request with the following kubectl apply -f spa

Spark streamming take long time read from kafka

I build a cluster use CDH5.14.2, includes 5 nodes, each node has 130G momery and 40 cpu cores. I builded the spark streamming application to read from multiple

Trigger IF Statement only when two Spark dataframe meet the conditions

I have two identical Spark DataFrame. They have the same columns. I am trying to create a IF-Else statement in one line but couldnt find a better way to do it.

KernelRestarter: restart failed in jupyter , Kernel died

[I 10:43:53.627 NotebookApp] 启动notebooks 在本地路径: /opt/soft/recommender/jupyter [I 10:43:53.627 NotebookApp]