Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent
The code below suppose to filter the data contains string x or y. It works fine in spark shell, but when I run the script in bash it only finds the data contain
I m reading data through kafka with spark windowing operation and union the kafka and hive data , make single dataframe. I am trying to find latest data with ti
We use Synapse Notebooks to perform data transformations and load the data into fact and dimension tables within our ADLSG2 data lake. We are disappointed with
val creation_timestamp = df.groupBy().agg(min($"userCreation_timestamp").alias("ts")).col("ts") df.filter(col("userCreation_timestamp").cast("timestamp") >=
Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore. I would like to understand
I have a scala dataframe with the following schema: root |-- time: string (nullable = true) |-- itemId: string (nullable = true) |-- itemFeatures: map (nulla
I have various columns in Spark DataFrame, they are nested json columns. In configuration i will provide a list of columns and fields to remove from json. For e
What does it mean for AWS to throw the following exception with a 200 status code Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: copyFile(vKg4OA16S76ItqD
What does it mean for AWS to throw the following exception with a 200 status code Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: copyFile(vKg4OA16S76ItqD
I have many MongoDB dumps in gzip compressed BSON files, each with multiple documents. I would like to read them directly to Spark, ideally partitioning on indi
I have a dataframe like below. This is a dynamic dataframe and will grow as more Topic fields are getting added. val ds = Seq(("T1",0,44), ("T1",1,54),
I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check
I have set up a docker network with: 4 producer containers (each scraping a different forum) that produce to a Kafka container (1 topic for each of the forum) A
I have data as below +-----+---------+----------+ | TYPE|DTIN_MNTH|DTOUT_MNTH| +-----+---------+----------+ | A| 2022-03| 2022-05| | B| 2022-04|
In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' datafra
I have this df: df = spark.createDataFrame( [('row_a', 5.0, 0.0, 11.0), ('row_b', 3394.0, 0.0, 4543.0), ('row_c', 136111.0, 0.0, 219255.0), (
I came across this question recently in one of the interviews and haven't been able to find a satisfying answer to this question. The incremental merge could co
I have a dataframe like below. No comp_value 1 [[ -> 10]] 2 [[ -> 35]] The schema type of column - value is. comp_value: array (nullable = tru
I have a Spark dataframe that looks like this: +-----+----------+--------+-----+ |key1 |date |variable|value| +-----+----------+--------+-----+ | A49|2022