I am New in Structure Streaming Topic. so facing issue while calculating distinct count in column in Dataset/Dataframe. //DataFrame val readFromKafka = sparks
I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi
In python or R, there are ways to slice DataFrame using index. For example, in pandas: df.iloc[5:10,:] Is there a similar way in pyspark to slice data bas
I am using Spark in Horton works, when i execute the below code i am getting exception. i also have a separate spark instance running in my system - same code i
I have s3 or azure blob directory structure like the following parent_dir child_dir1 avro_1 avro_2 ... child_dir2 ... There
I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any is
With scala 2.11 and spark-streaming-kafka-0-8_2.11 I could do import org.apache.spark.streaming.kafka.KafkaCluster val params = Map[String, Object]( "bootstr
I have a pipe delimited file I need to strip the first two rows off of. So I read it into and RDD, exclude the first two rows, and make it into a data frame. va
I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ["SPARK_HOM
I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations. These operations involve multiple joins,
Is it possible to use Google guice as dependency injection provider for a Apache spark Java application? I am able to achieve this if the execution is happening
I'm trying to run a pitest report on a gradle + kotlin project, but I get the following error: Exception in thread "main" org.pitest.help.PitHelpError: No mutat
How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic startin
I was trying to install mesos latest version(1.9.0) on the ubuntu 20.04 using Dockefile. FROM ubuntu:20.04 ENV MESOS_VERSION 1.9.0 ENV MESOS_ARTIFACT_FILENAME
I'm trying to instantiate a SparkContext inside a SBT console, using the following scala commands: import org.apache.spark.SparkConf import org.apache.spark.Spa
I am trying to add a UUID column to my dataset. getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toStrin
Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed
I have a PySpark dataframe like this: +------+------+ | A| B| +------+------+ | 1| 2| | 1| 3| | 2| 3| | 2| 5| +------+--
Why spark is faster than Hadoop MapReduce?. As per my understanding if spark is faster due to in-memory processing then Hadoop is also load data into RAM then i
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means o