I'm trying to store the tweets from my kafka cluster into Elastic Search. Initially, I set the output format to be 'org.elasticsearch.spark.sql'. But , it creat
I have a table with ~5k columns and ~1 M rows that looks like this: ID Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 ID1 0 1 0 1 0 2 1 1 2 2 0 ID2 1
I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document. What I think that my c
New to azure synapse, trying to create database (Managed table) from synapse notebook. I also added Storage blob data contributor for synapse workspace and spec
when I'm doing spark-submit using this command on Cloudera **time spark-submit \ --deploy-mode client \ --conf spark.app.name='XXXxxxxxx' --conf spark.master=l
i'm aware of the existence of this thread:where are the individual dataproc spark logs? However if i ssh connect to a worker node vm and navigate to the /tmp fo
I'm using the following code events_df = [] for i in df.collect(): v = generate_event(i) events_df.append(v) events_df = spark.cr
I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB. What should be th
Can anyone help how multimap_agg function in SQL and can be used in spark sql
I have all those supporting libraries in pyspark and I am able to create dataframe for parent- def xmlReader(root, row, filename): df = spark.read.format("
I downloaded succesfully this connector: com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 And when I try to load the information with this line: data = s
Imagine I have the following list of dicts in python: list = [dict1, dict2, dict3] I want to parse these dicts and transform them to a dataframe and save it as
I am using spark 3.0.2 with java 8 version. I am trying to write data on s3 path using spark job. I am getting below exception, not able to know what caused thi
I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi
In python or R, there are ways to slice DataFrame using index. For example, in pandas: df.iloc[5:10,:] Is there a similar way in pyspark to slice data bas
I need to find a way to delete multiple rows from a delta table/pyspark data frame given a list of ID's to identify the rows. As far as I can tell there isn't a
I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4
I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script: py4
I have a pyspark dataframe with two columns: [Row(zip_code='58542', dma='MIN'), Row(zip_code='58701', dma='MIN'), Row(zip_code='57632', dma='MIN'), Row(zip_
I have s3 or azure blob directory structure like the following parent_dir child_dir1 avro_1 avro_2 ... child_dir2 ... There