I'd like to take a Spark LDA Model's term indices from the .describeTopics() output and match them to the appropriate term in the count vectorizer's vocabulary.
I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0
im currently working on pyspark 1.6.3 and there is this error. Do you know what can be the reason? code
I am on windows and I am trying to follow this doc to create spark applications with VSCode using a Synapse workspace. I can sign into Azure and set a default S
Running on AWS and EMR, Jupyter, Pyspark notebook and trying to install a python package "sparse_dot_topn" version 0.2.9 I'm getting an error I don't understand
I am trying to open a file that i uploaded to the dbfs location. However, I get error while trying to open the file but I can see the file when I do a ls. Also
I am having some data in aurora mysql db, I would like to do two things: HISTORICAL DATA: To read the data from aurora(say TABLE A) do some processing and updat
Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially acro
I have installed Spark 3.0.0 on a Windows 64 bit machine with Python 3.9.7 using an anaconda base environment. I'm trying to execute the next code in the pyspar
I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f
Here is the pyspark code which is running on jupyter notebook. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
I am currently working on big data in pyspark.I need to do a record linkage.I have seen some implementations in pandas using record linkage library.I would like
I have an huge Oracle table to process, so I define a list of where clauses to read by each Spark node. In the middle of the processing I need to join the data
Given a PySpark SQL such as park.sql('''select 10%4 as hello ''') what is the best way to throw an exception anytime an operator % is used?
My algorithm is simple: I am using Spark to distribute the processing of a process that runs a cross-validation in Python. I have 3 workers and all I do is assi
Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job. However, I'm unable to see many differences outside
I have a dataset with 4 sensor values, 'volt', 'pressure', 'rotate' and 'vibration'. For these sensor values I am calculating rolling mean and rolling standard
I am attempting to read the first X number of rows of a delta table into a dataframe, and then write (overwrite) that back to the delta table. Here is code: # r
Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('
I have batches of binary files (~3mb each) that I receive in batches of ~20000 files at a time. These files are used downstream for further processing, but I wa