Category "pyspark"

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr

Vertica data into pySpark throws "Failed to find data source"

I have spark 3.2, vertica 9.2. spark = SparkSession.builder.appName("Ukraine").master("local[*]")\ .config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hado

Can you get filename from input_file_name() in aws s3 when using gunzipped files

Been searching for an answer to do this for quite awhile now but can't seem to figure it out. I've read Why is input_file_name() empty for S3 catalog sources in

Pyspark dataframe returns different results each time I run

Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe. Here is the code I am using: df

how to sequentially iterate rows in Pyspark Dataframe

I have a Spark DataFrame like this: +-------+------+-----+---------------+ |Account|nature|value| time| +-------+------+-----+---------------+ |

Concatenating string by rows in pyspark

I am having a pyspark dataframe as DOCTOR | PATIENT JOHN | SAM JOHN | PETER JOHN | ROBIN BEN | ROSE BEN | GRAY and need to concatenate patient n

define environment variable in databricks init script

I want to define an environment variable in Databricks init script and then read it in Pyspark notebook. I wrote this: dbutils.fs.put("/databricks/scripts/i

I am trying to start a Spark Session in Jupyter Notebook and get the following error

This is my first time using PySpark. I am using a Mac and I am trying to start up a session within Jupiter Notebook using the code below: import pyspark from py

Pyspark UDF monitoring with prometheus

I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053',

unable to install pyspark

I am trying to install pyspark as this: python setup.py install I get this error: Could not import pypandoc - required to package PySpark pypandoc is inst

Is there any way to unnest bigquery columns in databricks in single pyspark script

I am trying to connect bigquery using databricks latest version(7.1+, spark 3.0) with pyspark as script editor/base language. We ran a below pyspark script to f

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_

OutOfMemoryError : Java heap space in Spark

I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities a

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_

How to efficiently read data from mongodb and convert it into spark's dataframe?

I have already researched a lot but could not find a solution. Closest question I could find here is Why my SPARK works very slowly with mongoDB. I am trying t

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na

Another SparkContext is being constructed Eror

I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h

spark ETL and spark thrift server

Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t

Distribute group tasks evenly using pandas_udf in PySpark

I have a Spark Dataframe which contains groups of training data. Each group is identified by the "group" column. group | feature_1 | feature_2 | label --------