I have multiple files which ideally would all be the input to the map-reduce job. But each takes some time to download, so I was wondering if I can begin in-mem
So part of an unfinished spark pipeline that I am working on is saving data as a Hadoop Sequence file. However, as I try to finish the pipeline I found some bug
I have 2 dataframes, the first one has 53 columns and the second one has 132 column. I want to compare the 2 dataframes and remove all the columns that are not
Failure happens quite rarely, and on different tasks but everything is connected with the checkpoint() call. hdfs is used for checkpointing perhaps the problem
I am currently using spark 3.1, and I am using spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id) spark_context._jsc.hadoopConf
I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')
I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String
import org.apache.spark.sql.SparkSession object RDDBroadcast extends App { val spark = SparkSession.builder() .appName("SparkByExamples.com") .maste
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S. Theses job are reading files from S3 usi
Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404
I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a
I see spark to be superior to Flink. Below is my research. I see that most of features of Spark are covered in Flink, except for The Fair sche
I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other
I have an RDD as follows: rdd .filter { case (_, record) => predicates.forall(_.accept(record)) } .toDS() .cache() It basically filters down an RDD
I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in
problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con
I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa