I am writing unit tests for my spark/scala application. I am using scalamock as well to mock objects, specifically Session / Session Factory. In one of my test
I have a Spark/Scala job in which I do this: 1: Compute a big DataFrame df1 + cache it into memory 2: Use df1 to compute dfA 3: Read raw data into df2 (again,
I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the su
Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Someho
Using "spark.sql.warehouse.dir" in the same jupyter session (no databricks) works. But after a kernel restart in jupyter the catalog db and tables arent't re
I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is b
Anyon know Why I keeo getting this error in Jupyter Notebooks??? I've been trying to load my Tensorflow model into Apache Spark vis SparlFlowbut I can't seem to
I can't understand why my code isn't working. The last line is the problem: import findspark findspark.init() from pyspark import SparkConf, SparkContext from p
I have a Apache Beam project which works fine if I directly run it. But if i try to create a jar using maven clean:package it creates a uber jar using maven sha
I'm learning pyspark, I'm trying below code. Can someone help me to understand what wrong? >>> pairs=data.flatMap(lambda x:x.split(' ')).map(lambda x
I'm using Spark to run a grid search job using spark sklearn package. Here's my config NUM_SLAVES = 14 DRIVER_SPARK_MEMORY=53 # "spark.driver.memory" EXECUTOR_
I'm using Spark to run a grid search job using spark sklearn package. Here's my config NUM_SLAVES = 14 DRIVER_SPARK_MEMORY=53 # "spark.driver.memory" EXECUTOR_
I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m
In Win10, in IntelliJ this path("C:/hive/Orders_[0-9]*.csv") works good when run as stand alone java spark job. But not working as Spring Boot spark job. Seems
a job running some time about 1 day will throw the exception when i upgrade spark version to 3.2.1 i set it a driver and 2 executors executor allocate 2g memory
I'm using a shared EMR cluster with Jupyterhub installed. If my cluster is under heavy load, I get an error How do I increase the timeout for a spark applicati
We tried to test the following example code for accessing HBase tables (Spark-1.3.1, HBase-1.1.1, Hadoop-2.7.0): import sys from pyspark import SparkContext
Is it possible to use spark structured streaming to read data from mongo db with a readStream ? For standard use of structured streaming, I usually do so: va
So I setup a vagrant environment with Spark 1.5.0 installed. Then I use sbin/start-all.sh to start Spark. Inside VM I can curl localhost:8080 to get the HTML co
We have a panda dataframe that are using. We have a function we use in retail data which runs on a daily basis row by row to calculate the item to item differe