Category "apache-spark"

How to prevent SQL Server from stripping leading zeros when importing data

A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo

Why Apache Beam's Spark runner giving error Error: Failed to load org/apache/beam/sdk/options/PipelineOptions?

I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404

Spark: retrieving old values of rows after casting made invalid input nulls

I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a

Why would anybody choose Flink over spark? [closed]

I see spark to be superior to Flink. Below is my research. I see that most of features of Spark are covered in Flink, except for The Fair sche

Exception: Java gateway process exited before sending its port number

I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other

Limiting an RDD size

I have an RDD as follows: rdd .filter { case (_, record) => predicates.forall(_.accept(record)) } .toDS() .cache() It basically filters down an RDD

Spark structured streaming- checkpoint metadata growing indefinitely

I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa

Better/Efficient way to filter out Spark Dataframe rows with multiple conditions

I have a dataframe look like this below id pub_date version unique_id c_id p_id type source lni001 20220301 1

Converting PySpark's consecutive withColumn to SQL

I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t

Does this function computeSVD use MapReduce in Pyspark

Does computeSVD() use map , reduce since it is a predefined function? i couldn't know the code of the function. from pyspark.mllib.linalg import Vectors from py

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException while running DAG in Airflow

I am running a python project through DAG in airflow, and I encounter the following exception when the dag runs this line from the project - df = spark.sql(quer

Extract value from array in Spark

I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi

Computing number of business days between start/end columns

I have two Dataframes facts: columns: data, start_date and end_date holidays: column: holiday_date What I want is a way to produce another Dataframe that has

Insert Spark dataframe to partitioned table

I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True, but I couldn't work out how to handle the scenario below. For

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ... Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable) if new table is created during run (o

Issue with display()/collect() Large DataFrame In Pyspark

Getting The Following Issue In PySpark to perform display()/collect() operation on top of a generated dataframe. The df contains single column & Row (JSON d