Category "apache-spark"

Join two dataframes using the closest timestamp pyspark

So I am very new to pyspark but I am still unable to correctly create my own query. I try googling my problems but I just don't understand how most of this work

Setup Apache Sedona on EMR

I want to be able to use Apache Sedona for distributed GIS computing on AWS EMR. We need the right bootstrap script to have all dependencies. I tried setting up

Find the Max value of an Array column and find associated value in another Array with in the dataframe

I have a csv file with below data. Id Subject Marks 1 M,P,C 10,8,6 2 M,P,C 5,7,9 3 M,P,C 6,7,4 I Need to find out Max value in the Marks column for each Id an

spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file. This has to work both on HDFS but also on local fs (for testing). However it seems

Spark pivot groupby performance very slow

I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum): x_pivot = raw_df.groupBy("a", "b", "c"

Spark : skip top rows with spark-excel

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their githu

Spark Calculate Standard deviation row wise

I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this SD= (reduce(sqrt((add, (abs(col

Spark Compressing files from Multiple partitions into Single partition with larger files

I would like to take small parquet files that are spread out through multiple partition layers on s3 and compress them into larger files with a single partition

installing pyspark on windows

I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages. 1)

Error building maven project in Intellij : "object apache is not a member of package org"

Whenever I try to run my main program directly in IntelliJ I get this error: Error:(5, 12) object apache is not a member of package org import org.apache.common

Using scalamock: Could not find implicit value for evidence parameter of type error

I am writing unit tests for my spark/scala application. I am using scalamock as well to mock objects, specifically Session / Session Factory. In one of my test

How to make sure my DataFrame frees its memory?

I have a Spark/Scala job in which I do this: 1: Compute a big DataFrame df1 + cache it into memory 2: Use df1 to compute dfA 3: Read raw data into df2 (again,

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the su

ImportError: No module named numpy on spark workers

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Someho

pyspark delta-lake metastore

Using "spark.sql.warehouse.dir" in the same jupyter session (no databricks) works. But after a kernel restart in jupyter the catalog db and tables arent't re

Create index for tables within Delta Lake

I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is b

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext

Anyon know Why I keeo getting this error in Jupyter Notebooks??? I've been trying to load my Tensorflow model into Apache Spark vis SparlFlowbut I can't seem to

coding reduceByKey(lambda) in map does'nt work pySpark

I can't understand why my code isn't working. The last line is the problem: import findspark findspark.init() from pyspark import SparkConf, SparkContext from p

Why Spark Submit causes NoSuchMethodError when I run a uber jar made though maven shade plugin?

I have a Apache Beam project which works fine if I directly run it. But if i try to create a jar using maven clean:package it creates a uber jar using maven sha

pyspark wordcount sort by value

I'm learning pyspark, I'm trying below code. Can someone help me to understand what wrong? >>> pairs=data.flatMap(lambda x:x.split(' ')).map(lambda x