Category "apache-spark"

Pyspark-pandas not working on Spark 3.1.2

I am using spark 3.1.2 and attempting to use pyspark-pandas. However when attempting from pyspark import pandas as ps I am getting the following error: ImportEr

Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4

I am submitting Spark job with following specification:(same program has been used to run different size of data range from 50GB to 400GB) /usr/hdp/2.6.0.3-8/

Databricks display() function equivalent or alternative to Jupyter

I'm in the process of migrating current DataBricks Spark notebooks to Jupyter notebooks, DataBricks provides convenient and beautiful display(data_frame) functi

How can you parse a string that is json from an existing temp table using PySpark?

I have an existing Spark dataframe that has columns as such: -------------------- pid | response -------------------- 12 | {"status":"200"} response is a st

Where to set the S3 configuration in Spark locally?

I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access

SparkSQL error: collect_set() cannot have map type data

For SparkSQL on hive, when I used named_struct in the query, it returns results: SELECT id, collect_set(emp_info) as employee_info FROM ( SELECT t.id,

How to split a list to multiple columns in Pyspark?

I have: key value a [1,2,3] b [2,3,4] I want: key value1 value2 value3 a 1 2 3 b 2 3 4 It seems that in scala I can wr

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.

Run spark program locally with intellij

I tried to run a simple test code in intellij IDEA. Here is my code: import org.apache.spark.sql.functions._ import org.apache.spark.{SparkConf} import org.apa

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this. I

Microsoft Spark - JVM method execution failed: Nonstatic method 'csv' failed for class

I try to save the result as one "csv" file on Windows Server 2019. I'm using the "Microsoft.Spark" library. An empty folder is created with no "csv" file. The q

How to fix org.apache.spark.SparkException: Job aborted due to stage failure Task & com.datastax.spark.connector.rdd.partitioner.CassandraPartition

In my project i am using spark-Cassandra-connector to read the from Cassandra table and process it further into JavaRDD but i am facing issue while processing C

AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from 'pyspark/cloudpickle/__init__.py'>

While executing pyspark code from a script. Getting following error while df.show(). from pyspark.sql.types import StructType,StructField, StringType, IntegerTy

PySpark error: AnalysisException: 'Cannot resolve column name

I am trying to transform an entire df to a single vector column, using df_vec = vectorAssembler.transform(df.drop('col200')) I am being thrown this error: F

Spark Scala Split dataframe into equal number of rows

I have a Dataframe and wish to divide it into an equal number of rows. In other words, I want a list of dataframes where each one is a disjointed subset of the

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr

Vertica data into pySpark throws "Failed to find data source"

I have spark 3.2, vertica 9.2. spark = SparkSession.builder.appName("Ukraine").master("local[*]")\ .config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hado

reading and writing from hive tables with spark after aggregation

We have a hive warehouse, and wanted to use spark for various tasks (mainly classification). At times write the results back as a hive table. For example, we wr

Spark Delta table restore to version

I am trying to restore a delta table to its previous version via spark java , am using local ide .code is as below import io.delta.tables.*; DeltaTable deltaTa

Spark hangs on union with zero running task

I have two records of type RDD[T] For example: val a: RDD[Integer] = .... val b: RDD[Integer] = ... when I perform val z = a.union(b) println(z) I find the s