I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0
I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pa
I am on windows and I am trying to follow this doc to create spark applications with VSCode using a Synapse workspace. I can sign into Azure and set a default S
I am try to using pysparkling.ml.H2OMOJOModel for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the e
We all know parquet is column-oriented so we can get only columns we desired and reduce IO. But what if the parquet file is stored in HDFS, should we download t
I'm getting this exception when trying to create a DataFrame with only 50 millions rows. Any ideas how to avoid this problem? [Exception] [JvmBridge] Array dime
In Spark 3.2, special datetime values such as epoch, today, yesterday, tomorrow, and now are supported in typed literals or in cast of foldable strings only, fo
Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially acro
I am trying to add a new column to my Spark Dataframe. New column added will be of a size based on a variable (say salt) post which I will use that column to ex
I am stuck in a very odd situation related to Hbase design i would say. Hbase version >> Version 2.1.0-cdh6.2.1 So, the problem statement is, in Hbase, w
I have installed Spark 3.0.0 on a Windows 64 bit machine with Python 3.9.7 using an anaconda base environment. I'm trying to execute the next code in the pyspar
I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate() import spark.implicits._ case class Someth
I'm using databricks notebook to extract all field occurrences from a text column using the regexp_extract_all function. Here is the input: field_map#'IFDSIMP.7
I have the following code, which is used to (sha) hash columns in a spark dataframe: import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions
I want to run the same Java Spark Streaming (10 seconds micro batch) through 2 instances (sparkStr1 and sparkStr2). Mainly, they consume the same kafka topic (3
Here is the pyspark code which is running on jupyter notebook. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I re
The apache beam pipeline (python) I'm currently working on contains a transformation which runs a docker container. While that works well during local testing w
Have an input csv like the one below, Need to escape the delimiter within one of the columns (2nd column): f1|f2|f3 v1|v2\|2|v3 x1|x2\|2|x3 spark.read.option(