Category "azure-databricks"

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

I work on DataBricks with PySpark dataframe containing string-type columns. I use .withColumnRenamed() to rename one of them. Later in the process I use a .filt

How to flatten a nested Json struct using Python databricks

Trying to flatten a nested json response using Python databricks dataframe. I was able to flatten the "survey" struct successfully but getting errors when i try

SQL Azure Data Bricks

We have a table 1 Day table aggregated with group by call_date ,tdlinx_id ,work_request_id ,category_name another table we have 1 week level data aggregated w

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks fulldf = spark.read.format("csv").option("header", True).option

Split corresponding column values in pyspark

Below table would be the input dataframe col1 col2 col3 1 12;34;56 Aus;SL;NZ 2 31;54;81 Ind;US;UK 3 null Ban 4 Ned null Expected output dataframe [values of c

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pa

How can I execute and schedule Databricks notebook from Azure Devops Pipeline using YAML

I wanted to do CICD of my azure Databricks notebook using YAML file. I have followed the below flow Pushed my code from Databricks notebook to Azure Repos. Crea

Databricks- ConcurrentAppendException:

I'm running like 20 notebooks concurrently and they all update the same Delta table (however, different rows). I'm getting the below exception if any two notebo

Azure ADLS Gen2 file created by Azure Databricks doesn't inherit ACL

I have a databricks notebook that is writing a dataframe to a file in ADLS Gen2 storage. It creates a temp folder, outputs the file and then copies that file to

How to loop through folders in Azure Blob Containers

I have the following code which is written in Visual Studio Code. Now I want to run this in Azure Databricks. I have uploaded the entire folder to my Azure Blob

Spark binary file and Delta Table

I have batches of binary files (~3mb each) that I receive in batches of ~20000 files at a time. These files are used downstream for further processing, but I wa

Load Data Using Azure Batch Service and Spark Databricks

I have File Azure Blob Storage that I need to load daily into the Data Lake. I am not clear on which approach I should use(Azure Batch Account, Custom Activity

Read outlook emails in databricks

I would like to read mails from microsoft outlook using python and run the script using a databricks cluster. I'm using win32com on my local machine and able to

How do I Insert Overwrite with parquet format?

I am have two parquet file in azure data lake gen2 I want to Insert Overwrite onw with other. I was trying the same in azure data bricks by doing below. Reading

Pyspark: join and union in for loop

I have a really simple logic that I would like to understand how I can make it work in pyspark. for data in df1: spark_data_row = spark.createDataFrame(data

Azure Databricks - Generate SQL Select Statement with Columns

I have tables in Azure Databricks that I am using SQL to interact with via a notebook. I need to select all columns from a table with 200 columns, I need to sel

Is it possible to set only one branch at Databricks shared git folder(highlighted in screenshot)?

I would like to set only one branch at shared folder in databricks workspace. Attaching screenshot to give more clarity on the same. All of data factory pipeli

VACUUM/OPTIMIZE Effect on Autoloader Checkpoints

I'm using Databricks Autoloader to incrementally stream from a Delta Lake table into a SQL database. If an OPTIMIZE or VACUUM statement is ran against the Delt

IoT - Databricks Deltalake - access in C# api or Node js API

I am working on IoT solution, where there are multiple sensors which are sending data. I have one job which listen to Event hub, get the IoT sensor data and sto

How to handle memory issue while writing data in which a particular column contains very large data in each record in databricks in pyspark

I have a set of records with 10 columns. There is a column 'x' which contains an array of float values and the length of array can be very large(for eg, the len

Category "azure-databricks"

Other Categories