I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities a
I'm trying to read from elasticsearch in a livy job using the elastisearch-spark jar. When I upload the jar to a livy client(like the example here) I get this e
I am currently running a Java Spark Application in tomcat and receiving the following exception: Caused by: java.io.IOException: Mkdirs failed to create file:/
I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API.
I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na
I have spark job that is failing after the upgrade of the cdh from 5.5.4 which had spark 1.5.0 to cdh 5.13.0 which has spark 1.6.0 The job is running with the
I am trying to install Spark which requires Java with using !apt-get install openjdk-8-jdk-headless -qq > /dev/null And I get an error after it. E: Failed
In a k8s cluster. How do you configure zeppelin to run spark jobs in an existing spark cluster instead of spinning up a new pod? I've got a k8s cluster up and r
I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h
When I am submitting the spark job from terminal I am getting below error that file does not exists. Although I have already placed config file to my local. spa
Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application? Is it possible to use Guice, for instance? If so,
Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t
I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder. Need a Scala function which wil
I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this: val conversations = sqlContext.read .f
I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied m
In the below program ,the duplicate columns are getting created while joining two dataframes in pyspark . >>> spark = SparkSession.builder.appName("Jo
I am using a pyspark test script to read and write files to S3. Here is how I initialize the spark-session: import findspark from pyspark.sql
I work on spark application using (spark 2.0.0 & scala 2.11.8) and the application works fine within intellij Idea environment. I've extracted application a
How to do exception handling for file reading. For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The
I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null