I have a table that looks like this common_id table1_address table2_address table3_address table4_address 123 null null stack building12 null 157 123road stree
I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33
I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.
I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format
I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri
I am trying to pass a string in my pyspark code and it works fine but when i pass the following string to escape reserved keyword `date` or any value passed in
I am working on writing a framework that basically does a data sanity check. I have a set of inputs like { "check_1": [ sql_query_1, sql_query_2 ], "check_2":
I am trying to install PySpark on my Windows 10 to be used on Jupyter Lab. I have already installed Java and running Python 3.7.3: openjdk version "1.8.0_242" O
I'm working on an ETL job with an SageMaker notebook that uses spark 2.4.0. After joining a couple of tables I keep getting the following errors: Update-- I was
Whell i'm learning PySpark, i installed ipykernel, jupyterlab, notebook and pyspark via PIP, and Java 8 via .exe, the problem is when i need to create the sessi
Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2
Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be a
Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =
I'm trying to connect to synapse serverless pool via databricks. I need to create synapse views and external tables directly in databricks as part of an existin
If a pyspark dataframe is reading some data from a table and writing it to azure delta lake Can we add comments to this newly written file? For e.g Df = sql("se
I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----
We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl
I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f