Category "pyspark"

SPARK SQL create table does not show / read all columns as expected

I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (

'DecisionTreeClassificationModel' object has no attribute 'stages'

tree = dtModel.stages[-1] print(tree) #visualize the decision tree model AttributeError Traceback (most recent call last) Attribute

pyspark how to collect map keys into list

I have a dataframe with a map column. I want to collect the not null keys into a new column:

Printing secret value in Databricks

Even though secrets are for masking confidential information, I need to see the value of the secret for using it outside Databricks. When I simply print the sec

How to return null in SUM if some values are null?

I have a case where I may have null values in the column that needs to be summed up in a group. If I encounter a null in a group, I want the sum of that group t

bucketing with QuantileDiscretizer using groupBy function in pyspark

I have a large dataset like so: | SEQ_ID|RESULT| +-------+------+ |3462099|239.52| |3462099|239.66| |3462099|239.63| |3462099|239.64| |3462099|239.57| |3462099|

pyspark recover for an even number the two values ​of a median

Is there a way i pyspark to recover for an even number the two values ​​of a median ? For exemple: I have this dataframe df1 = spark.createDataFrame

what is est in filter sparkUI sql tab

I am trying to debug my spark UI, and in the SQL tab of spark UI getting this red mark on filter description, trying to figure out what does it mean. Spark UI s

Pyspark's df.writeStream generates no output

I'm trying to store the tweets from my kafka cluster into Elastic Search. Initially, I set the output format to be 'org.elasticsearch.spark.sql'. But , it creat

Pyspark select multiple columns from list and filter on different values

I have a table with ~5k columns and ~1 M rows that looks like this: ID Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 ID1 0 1 0 1 0 2 1 1 2 2 0 ID2 1

Read and group json files by date element using pyspark

I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document. What I think that my c

Could not create lake database from synapse notebooks

New to azure synapse, trying to create database (Managed table) from synapse notebook. I also added Storage blob data contributor for synapse workspace and spec

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera **time spark-submit \ --deploy-mode client \ --conf spark.app.name='XXXxxxxxx' --conf spark.master=l

Dataproc YARN container logs location

i'm aware of the existence of this thread:where are the individual dataproc spark logs? However if i ssh connect to a worker node vm and navigate to the /tmp fo

Pyspark - iterate on a big dataframe

I'm using the following code events_df = [] for i in df.collect(): v = generate_event(i) events_df.append(v) events_df = spark.cr

How to process a 64 GB CSV file in Pyspark efficiently?

I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB. What should be th

How we can use mutimap_agg function in spark sql and also suggest if any equivalent or alternative function to this

Can anyone help how multimap_agg function in SQL and can be used in spark sql

How to create child dataframe from xml file using Pyspark?

I have all those supporting libraries in pyspark and I am able to create dataframe for parent- def xmlReader(root, row, filename): df = spark.read.format("

Problem with cassandra-connector at "load()"

I downloaded succesfully this connector: com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 And when I try to load the information with this line: data = s

Pyspark to parse multiples dictionaries

Imagine I have the following list of dicts in python: list = [dict1, dict2, dict3] I want to parse these dicts and transform them to a dataframe and save it as