We tried to test the following example code for accessing HBase tables (Spark-1.3.1, HBase-1.1.1, Hadoop-2.7.0): import sys from pyspark import SparkContext
Is it possible to use spark structured streaming to read data from mongo db with a readStream ? For standard use of structured streaming, I usually do so: va
So I setup a vagrant environment with Spark 1.5.0 installed. Then I use sbin/start-all.sh to start Spark. Inside VM I can curl localhost:8080 to get the HTML co
We have a panda dataframe that are using. We have a function we use in retail data which runs on a daily basis row by row to calculate the item to item differe
I am using spark 3.1.2 and attempting to use pyspark-pandas. However when attempting from pyspark import pandas as ps I am getting the following error: ImportEr
I am submitting Spark job with following specification:(same program has been used to run different size of data range from 50GB to 400GB) /usr/hdp/2.6.0.3-8/
I'm in the process of migrating current DataBricks Spark notebooks to Jupyter notebooks, DataBricks provides convenient and beautiful display(data_frame) functi
I have an existing Spark dataframe that has columns as such: -------------------- pid | response -------------------- 12 | {"status":"200"} response is a st
I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access
For SparkSQL on hive, when I used named_struct in the query, it returns results: SELECT id, collect_set(emp_info) as employee_info FROM ( SELECT t.id,
I have: key value a [1,2,3] b [2,3,4] I want: key value1 value2 value3 a 1 2 3 b 2 3 4 It seems that in scala I can wr
I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.
I tried to run a simple test code in intellij IDEA. Here is my code: import org.apache.spark.sql.functions._ import org.apache.spark.{SparkConf} import org.apa
I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this. I
I try to save the result as one "csv" file on Windows Server 2019. I'm using the "Microsoft.Spark" library. An empty folder is created with no "csv" file. The q
In my project i am using spark-Cassandra-connector to read the from Cassandra table and process it further into JavaRDD but i am facing issue while processing C
While executing pyspark code from a script. Getting following error while df.show(). from pyspark.sql.types import StructType,StructField, StringType, IntegerTy
I am trying to transform an entire df to a single vector column, using df_vec = vectorAssembler.transform(df.drop('col200')) I am being thrown this error: F
I have a Dataframe and wish to divide it into an equal number of rows. In other words, I want a list of dataframes where each one is a disjointed subset of the
Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr