Category "pyspark"

Pyspark's df.writeStream generates no output

I'm trying to store the tweets from my kafka cluster into Elastic Search. Initially, I set the output format to be 'org.elasticsearch.spark.sql'. But , it creat

Pyspark select multiple columns from list and filter on different values

I have a table with ~5k columns and ~1 M rows that looks like this: ID Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 ID1 0 1 0 1 0 2 1 1 2 2 0 ID2 1

Read and group json files by date element using pyspark

I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document. What I think that my c

Could not create lake database from synapse notebooks

New to azure synapse, trying to create database (Managed table) from synapse notebook. I also added Storage blob data contributor for synapse workspace and spec

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera **time spark-submit \ --deploy-mode client \ --conf spark.app.name='XXXxxxxxx' --conf spark.master=l

Dataproc YARN container logs location

i'm aware of the existence of this thread:where are the individual dataproc spark logs? However if i ssh connect to a worker node vm and navigate to the /tmp fo

Pyspark - iterate on a big dataframe

I'm using the following code events_df = [] for i in df.collect(): v = generate_event(i) events_df.append(v) events_df = spark.cr

How to process a 64 GB CSV file in Pyspark efficiently?

I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB. What should be th

How we can use mutimap_agg function in spark sql and also suggest if any equivalent or alternative function to this

Can anyone help how multimap_agg function in SQL and can be used in spark sql

How to create child dataframe from xml file using Pyspark?

I have all those supporting libraries in pyspark and I am able to create dataframe for parent- def xmlReader(root, row, filename): df = spark.read.format("

Problem with cassandra-connector at "load()"

I downloaded succesfully this connector: com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 And when I try to load the information with this line: data = s

Pyspark to parse multiples dictionaries

Imagine I have the following list of dicts in python: list = [dict1, dict2, dict3] I want to parse these dicts and transform them to a dataframe and save it as

SparkFatalException root cause

I am using spark 3.0.2 with java 8 version. I am trying to write data on s3 path using spark job. I am getting below exception, not able to know what caused thi

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index. For example, in pandas: df.iloc[5:10,:] Is there a similar way in pyspark to slice data bas

Delete multiple rows from a delta table/pyspark data frame givien a list of IDs

I need to find a way to delete multiple rows from a delta table/pyspark data frame given a list of ID's to identify the rows. As far as I can tell there isn't a

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4

pyspark create dictionary from data in two columns

I have a pyspark dataframe with two columns: [Row(zip_code='58542', dma='MIN'), Row(zip_code='58701', dma='MIN'), Row(zip_code='57632', dma='MIN'), Row(zip_

Is it possible to load multiple directory separately in pyspark but process them in parallel?

I have s3 or azure blob directory structure like the following parent_dir child_dir1 avro_1 avro_2 ... child_dir2 ... There