I have 2 dataframes, the first one has 53 columns and the second one has 132 column. I want to compare the 2 dataframes and remove all the columns that are not
Failure happens quite rarely, and on different tasks but everything is connected with the checkpoint() call. hdfs is used for checkpointing perhaps the problem
I am currently using spark 3.1, and I am using spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id) spark_context._jsc.hadoopConf
I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')
I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String
import org.apache.spark.sql.SparkSession object RDDBroadcast extends App { val spark = SparkSession.builder() .appName("SparkByExamples.com") .maste
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S. Theses job are reading files from S3 usi
Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404
I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a
I see spark to be superior to Flink. Below is my research. I see that most of features of Spark are covered in Flink, except for The Fair sche
I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other
I have an RDD as follows: rdd .filter { case (_, record) => predicates.forall(_.accept(record)) } .toDS() .cache() It basically filters down an RDD
I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in
problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con
I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa
I have a dataframe look like this below id pub_date version unique_id c_id p_id type source lni001 20220301 1
I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t