Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da
I am have two parquet file in azure data lake gen2 I want to Insert Overwrite onw with other. I was trying the same in azure data bricks by doing below. Reading
In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks. My pyspark sql: %
I've installed apache spark on my mac with 16 GB of RAM to test my pyspark code locally with small data sets before I test it on a real cluster. I've installed
I wrote a file to parquet containing 1,000,000 rows. When I read the parquet file back, the result is 1,000,000 rows. df = spark.read.parquet(parquet_path) df.
So I have parquet files in S3 bucket and I want to load it using pyspark in python, but I'm getting some error, here's what I have tried so far. I'm using Juput
I have a really simple logic that I would like to understand how I can make it work in pyspark. for data in df1: spark_data_row = spark.createDataFrame(data
I want to stop my spark instance here once I complete my job running on Jupyter notebook. I did execute spark.stop() at the end, but when I open my terminal, I'
I need to give the correct schema to an rdd I have, but struggling with a maptype that has different valuetypes. I guess the problem is that one specific Key ha
What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,
Table is like this id ADDRESS 0 6101 SUMMITVIEW AVE STE 200 YAKIMA 1 527 CEDAR WAY SUITE 105 OAKMONT 2 1700 N ROSE AVE SUITE 460 OXNARD 3 1275 YORK AVE NEW YOR
I'm trying to return next weeks Saturday date from datatype column rel_d. Normally, in python, I'd subtract number of days till next Saturday and add it to the
I have a set of records with 10 columns. There is a column 'x' which contains an array of float values and the length of array can be very large(for eg, the len
I have a delta table, and am trying to append data to it and then checkpoint that table. By default I believe it checkpoints every 10 commits, but I would like
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (T
I want to divide the sum of two columns in pyspark. For example, I have a datasets like below: A B C 1 1 2 3 2 1 2 3 3 1 2 3 What I want is t
I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr
I've been trying to submit applications to a Kubernetes. I have followed the tutorial in https://spark.apache.org/docs/latest/running-on-kubernetes.html such as
Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+
I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.sp