I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ["SPARK_HOM
I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations. These operations involve multiple joins,
Is it possible to use Google guice as dependency injection provider for a Apache spark Java application? I am able to achieve this if the execution is happening
I'm trying to run a pitest report on a gradle + kotlin project, but I get the following error: Exception in thread "main" org.pitest.help.PitHelpError: No mutat
How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic startin
I was trying to install mesos latest version(1.9.0) on the ubuntu 20.04 using Dockefile. FROM ubuntu:20.04 ENV MESOS_VERSION 1.9.0 ENV MESOS_ARTIFACT_FILENAME
I'm trying to instantiate a SparkContext inside a SBT console, using the following scala commands: import org.apache.spark.SparkConf import org.apache.spark.Spa
I am trying to add a UUID column to my dataset. getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toStrin
Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed
I have a PySpark dataframe like this: +------+------+ | A| B| +------+------+ | 1| 2| | 1| 3| | 2| 3| | 2| 5| +------+--
Why spark is faster than Hadoop MapReduce?. As per my understanding if spark is faster due to in-memory processing then Hadoop is also load data into RAM then i
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means o
I want to convert Spark dataframe all TIMESTAMP columns into String columns. Could anybody say how to do that automatically for each dataframe? The position of
If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio
I am forming a query in a String Builder like below : println(dataQuery) Execution started at 2019-10-31 02:58:24.006019 PST res245: String = " SELECT transac
What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(
I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang
I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9
I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9
I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: The number o