Category "apache-spark"

Spark History server not listing completed jars

I'm running Spark standalone jobs in Windows. I would like to monitor my Spark jobs using the spark history server. I have launched spark history server with be

Spark RDD: Find the single row that has the highest count and for that row report the month, count and hashtag name. Output Using PrintLn

[Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output us

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera **time spark-submit \ --deploy-mode client \ --conf spark.app.name='XXXxxxxxx' --conf spark.master=l

Random Sampling base on 1 column after Groupby

I have a Spark Table, which contains 400+ millions records/rows. I used spark.table to convert it into a DF. The DF looks like this below id pub_date

How we can use mutimap_agg function in spark sql and also suggest if any equivalent or alternative function to this

Can anyone help how multimap_agg function in SQL and can be used in spark sql

I am trying to setup spark in local but getting error

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(

Problem with cassandra-connector at "load()"

I downloaded succesfully this connector: com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 And when I try to load the information with this line: data = s

Spark load csv file in jar from resources folder

I am trying to create a Spark application running on Scala that reads a .csv file that is located in src/main/resources directory and saves it on the local hdfs

TypeError while tokenizing a column in Spark dataframe

I'm trying to tokenize a 'string' column from a spark dataset. The spark dataframe is as follows: df: index ---> Integer question ---> String This is h

How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark

We want to implement SCD2 in Spark using SQL Join. i got reference from Github https://gist.github.com/rampage644/cc4659edd11d9a288c1b but it's not very cle

PySpark - Convert a heterogeneous array JSON array to Spark dataframe and flatten it

I have streaming data coming in as JSON array and I want flatten it out as a single row in a Spark dataframe using Python. Here is how the JSON data looks like

SparkFatalException root cause

I am using spark 3.0.2 with java 8 version. I am trying to write data on s3 path using spark job. I am getting below exception, not able to know what caused thi

Distinct Count on Column in Dataset in Structured Streaming

I am New in Structure Streaming Topic. so facing issue while calculating distinct count in column in Dataset/Dataframe. //DataFrame val readFromKafka = sparks

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index. For example, in pandas: df.iloc[5:10,:] Is there a similar way in pyspark to slice data bas

Spark error : java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport

I am using Spark in Horton works, when i execute the below code i am getting exception. i also have a separate spark instance running in my system - same code i

Is it possible to load multiple directory separately in pyspark but process them in parallel?

I have s3 or azure blob directory structure like the following parent_dir child_dir1 avro_1 avro_2 ... child_dir2 ... There

How to use countDistinct using a window function in Spark/Scala?

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any is

get count of partitions in a kafka topic with scala 2.12

With scala 2.11 and spark-streaming-kafka-0-8_2.11 I could do import org.apache.spark.streaming.kafka.KafkaCluster val params = Map[String, Object]( "bootstr

Getting Error while encoding from reading text file

I have a pipe delimited file I need to strip the first two rows off of. So I read it into and RDD, exclude the first two rows, and make it into a data frame. va