I have multiple files, but lets consider 2 files which have filename and start dates columns. Start_Date FileName 2022-01-01 product 1 2022-02-02 product 2 pl
We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a fi
I have a program that contain few lines of functions that uses pyspark (the rest is normal Python). The portion of my code that uses pyspark: X.to_csv(r'first.t
I am learning Big data using Apache spark and I want to create a custom transformer for Spark ml so that I can execute some aggregate functions or can perform o
I have a dataframe that contains an "id" column and a "publication" column. The "id" column contains duplicates, and represents a researcher. The "publication"
Do Spark have cache with TTL option. I need to do lookup on reference data to perform some transformation in my Spark streaming application. Also lookup dataset
I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant
I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant
I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part
I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an ar
I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to
HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back. The Thrift frame
I am from Linkedin, we are having compatibility issue with spark-cdm-connector, to give a little context I have a cdm data in ADLS which I’m trying to rea
I am attempting to use a PySpark kernel from inside an EMR Notebook that is hosted on an AWS managed service (EMR) and I am unable to access Artifactory to inst
I am trying to write data on an S3 bucket from my local computer: spark = SparkSession.builder \ .appName('application') \ .config("spark.hadoop.fs.s3a.
I have a dataframe having a column of type MapType<StringType, StringType>. |-- identity: map (nullable = true) | |-- key: string | |-- value: st
I have a spark standalone configured with 3 nodes. I want to read csv data stored in s3-compatible storage (dell ecs) in this pySpark. Here's the method and con
I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le. df = spark.read.format("csv") .option("delimiter", ",") .opti
Having a date, I create a column with ISO 8601 week date format: from pyspark.sql import functions as F df = spark.createDataFrame([('2019-03-
I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax.