Category "pyspark"

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053',

unable to install pyspark

I am trying to install pyspark as this: python setup.py install I get this error: Could not import pypandoc - required to package PySpark pypandoc is inst

Is there any way to unnest bigquery columns in databricks in single pyspark script

I am trying to connect bigquery using databricks latest version(7.1+, spark 3.0) with pyspark as script editor/base language. We ran a below pyspark script to f

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_

OutOfMemoryError : Java heap space in Spark

I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities a

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_

How to efficiently read data from mongodb and convert it into spark's dataframe?

I have already researched a lot but could not find a solution. Closest question I could find here is Why my SPARK works very slowly with mongoDB. I am trying t

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na

Another SparkContext is being constructed Eror

I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h

spark ETL and spark thrift server

Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t

Distribute group tasks evenly using pandas_udf in PySpark

I have a Spark Dataframe which contains groups of training data. Each group is identified by the "group" column. group | feature_1 | feature_2 | label --------

Spark read from S3 working, but I am unable to write using the same session [duplicate]

I am using a pyspark test script to read and write files to S3. Here is how I initialize the spark-session: import findspark from pyspark.sql

Synapse spark job fails as input folder does not exist

How to do exception handling for file reading. For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The

pyspark get element from array Column of struct based on condition

I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null

How to add custom method to Pyspark Dataframe class by inheritance

I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataf

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below. please help me resolve this and understand what I am doing wrong

How do I get the last item from a list using pyspark?

Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(d

Pyspark connection to the Microsoft SQL server?

I have a huge dataset in SQL server, I want to Connect the SQL server with python, then use pyspark to run the query. I've seen the JDBC driver but I don't fin

How to correctly import pyspark.sql.functions?

from pyspark.sql.functions import isnan, when, count, sum , etc... It is very tiresome adding all of it. Is there a way to import all of it at once?

Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel

I have created a new dataproc cluster with a specific environment.yaml. Here is the command that I have used to create that cluster: gcloud dataproc clusters cr