In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' datafra
I have this df: df = spark.createDataFrame( [('row_a', 5.0, 0.0, 11.0), ('row_b', 3394.0, 0.0, 4543.0), ('row_c', 136111.0, 0.0, 219255.0), (
I came across this question recently in one of the interviews and haven't been able to find a satisfying answer to this question. The incremental merge could co
I have a dataframe like below. No comp_value 1 [[ -> 10]] 2 [[ -> 35]] The schema type of column - value is. comp_value: array (nullable = tru
I have a Spark dataframe that looks like this: +-----+----------+--------+-----+ |key1 |date |variable|value| +-----+----------+--------+-----+ | A49|2022
I have multiple files which ideally would all be the input to the map-reduce job. But each takes some time to download, so I was wondering if I can begin in-mem
So part of an unfinished spark pipeline that I am working on is saving data as a Hadoop Sequence file. However, as I try to finish the pipeline I found some bug
I have 2 dataframes, the first one has 53 columns and the second one has 132 column. I want to compare the 2 dataframes and remove all the columns that are not
Failure happens quite rarely, and on different tasks but everything is connected with the checkpoint() call. hdfs is used for checkpointing perhaps the problem
I am currently using spark 3.1, and I am using spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id) spark_context._jsc.hadoopConf
I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')
I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String
import org.apache.spark.sql.SparkSession object RDDBroadcast extends App { val spark = SparkSession.builder() .appName("SparkByExamples.com") .maste
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S. Theses job are reading files from S3 usi
Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404
I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a
I see spark to be superior to Flink. Below is my research. I see that most of features of Spark are covered in Flink, except for The Fair sche