'How do we do Spark Dataframe testing using JUnit?

We are trying to build an integration test suite using JUnit. Our pipeline (built in Spark using Scala) gives us DataFrames as output, we plan to compare them against ExpectedOutput passed along using some config/ JSON input. We have some internal tools which are integrated with JUnit for coverage and CI/CD so we need a way to integrate JUnit with our dataframe comparisons, but we are unable to find any such example.

Has anyone seen such implementation that we can refer to?



Solution 1:[1]

You can start a local spark context in the tests. Make sure you create only one context for the whole test run. In each test .collect() the dataset (small samples) and compare it against your JSON.

Here is a pseudo-code example of ctx setup with @BeforeClass.

    @BeforeClass
    public void init() {
        SparkConf conf = new SparkConf();
        conf.setMaster("local");
        conf.setAppName("junit");
        ctx = new JavaSparkContext(conf);     
    }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 João Guitana