Category "pyspark"

Using monotonically_increasing_id() for assigning row number to pyspark dataframe

I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below: df1 = df1.withColumn("idx", monotonically_increasing_id(

pyspark wordcount sort by value

I'm learning pyspark, I'm trying below code. Can someone help me to understand what wrong? >>> pairs=data.flatMap(lambda x:x.split(' ')).map(lambda x

How can I access python variable in Spark SQL?

I have python variable created under %python in my jupyter notebook file in Azure Databricks. How can I access the same variable to make comparisons under %sql.

Pyspark-pandas not working on Spark 3.1.2

I am using spark 3.1.2 and attempting to use pyspark-pandas. However when attempting from pyspark import pandas as ps I am getting the following error: ImportEr

How can you parse a string that is json from an existing temp table using PySpark?

I have an existing Spark dataframe that has columns as such: -------------------- pid | response -------------------- 12 | {"status":"200"} response is a st

Where to set the S3 configuration in Spark locally?

I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access

How to split a list to multiple columns in Pyspark?

I have: key value a [1,2,3] b [2,3,4] I want: key value1 value2 value3 a 1 2 3 b 2 3 4 It seems that in scala I can wr

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

I am trying to I am tring to delete stop words via spark,the code is as follow from nltk.corpus import stopwords from pyspark.context import SparkContext from

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this. I

AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from 'pyspark/cloudpickle/__init__.py'>

While executing pyspark code from a script. Getting following error while df.show(). from pyspark.sql.types import StructType,StructField, StringType, IntegerTy

PySpark error: AnalysisException: 'Cannot resolve column name

I am trying to transform an entire df to a single vector column, using df_vec = vectorAssembler.transform(df.drop('col200')) I am being thrown this error: F

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr

Vertica data into pySpark throws "Failed to find data source"

I have spark 3.2, vertica 9.2. spark = SparkSession.builder.appName("Ukraine").master("local[*]")\ .config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hado

Can you get filename from input_file_name() in aws s3 when using gunzipped files

Been searching for an answer to do this for quite awhile now but can't seem to figure it out. I've read Why is input_file_name() empty for S3 catalog sources in

Pyspark dataframe returns different results each time I run

Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe. Here is the code I am using: df

how to sequentially iterate rows in Pyspark Dataframe

I have a Spark DataFrame like this: +-------+------+-----+---------------+ |Account|nature|value| time| +-------+------+-----+---------------+ |

Concatenating string by rows in pyspark

I am having a pyspark dataframe as DOCTOR | PATIENT JOHN | SAM JOHN | PETER JOHN | ROBIN BEN | ROSE BEN | GRAY and need to concatenate patient n

define environment variable in databricks init script

I want to define an environment variable in Databricks init script and then read it in Pyspark notebook. I wrote this: dbutils.fs.put("/databricks/scripts/i

I am trying to start a Spark Session in Jupyter Notebook and get the following error

This is my first time using PySpark. I am using a Mac and I am trying to start up a session within Jupiter Notebook using the code below: import pyspark from py

Pyspark UDF monitoring with prometheus

I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(