Category "apache-spark"

DF.topandas() - Failed to locate the winutils binary in the hadoop binary path

I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ["SPARK_HOM

Custom sort order on a Spark dataframe/dataset

I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations. These operations involve multiple joins,

Apache spark, is it possible to have Google guice as dependency injection technique

Is it possible to use Google guice as dependency injection provider for a Apache spark Java application? I am able to achieve this if the execution is happening

Pitest is failing showing: No mutations found due to the supplied classpath or filters + Gradle

I'm trying to run a pitest report on a gradle + kotlin project, but I get the following error: Exception in thread "main" org.pitest.help.PitHelpError: No mutat

Spark + Read kafka topic from a specific offset based on timestamp

How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic startin

Installing Mesos on ubuntu 20.04 causing makefile issue

I was trying to install mesos latest version(1.9.0) on the ubuntu 20.04 using Dockefile. FROM ubuntu:20.04 ENV MESOS_VERSION 1.9.0 ENV MESOS_ARTIFACT_FILENAME

Spark in SBT console: "Could not find spark-version-info.properties"

I'm trying to instantiate a SparkContext inside a SBT console, using the following scala commands: import org.apache.spark.SparkConf import org.apache.spark.Spa

Add UUID to spark dataset [duplicate]

I am trying to add a UUID column to my dataset. getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toStrin

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed

How to quickly check if row exists in PySpark Dataframe?

I have a PySpark dataframe like this: +------+------+ | A| B| +------+------+ | 1| 2| | 1| 3| | 2| 3| | 2| 5| +------+--

Why spark is 100 times faster than Hadoop Map Reduce

Why spark is faster than Hadoop MapReduce?. As per my understanding if spark is faster due to in-memory processing then Hadoop is also load data into RAM then i

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means o

How to convert timestamp column of Spark Dataframe to string column

I want to convert Spark dataframe all TIMESTAMP columns into String columns. Could anybody say how to do that automatically for each dataframe? The position of

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio

Spark SQL error : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '$' expecting

I am forming a query in a String Builder like below : println(dataQuery) Execution started at 2019-10-31 02:58:24.006019 PST res245: String = " SELECT transac

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(

Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

How do I interpret Input size / records in Spark Stage UI

I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: The number o