Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da
I am have two parquet file in azure data lake gen2 I want to Insert Overwrite onw with other. I was trying the same in azure data bricks by doing below. Reading
What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,
Table is like this id ADDRESS 0 6101 SUMMITVIEW AVE STE 200 YAKIMA 1 527 CEDAR WAY SUITE 105 OAKMONT 2 1700 N ROSE AVE SUITE 460 OXNARD 3 1275 YORK AVE NEW YOR
I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception: Exc
I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by us
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data pro
I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr
Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+
I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i
Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as
I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not mat
I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based
I have a requirement where i am reading data from a CSV file and writing data to a Delta table over scala on window OS. My scala code is given below:- import co
df1=df.withColumn('etl_load_dt_part_new', concat_ws("-",year(df.ETL_LOAD_DT_PART),lit('12'),lit('31')).cast('date') ) i am trying to add new column named as e
This is my dataset: from pyspark.sql import SparkSession, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([('2021-02-07',)
In my project , i need to read image dataset[each folder having different object and I want to read these folder in stream one by one ], and then need to extrac
I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (
I have a case where I may have null values in the column that needs to be summed up in a group. If I encounter a null in a group, I want the sum of that group t