'Data testing framework for data streaming (deequ vs Great Expectations)

I want to introduce data quality testing (empty fields/max-min values/regex/etc...) into my pipeline which will essentially consume kafta topics testing the data before it is logged into the DB.

I am having a hard time choosing between the Deequ and Great Expectations frameworks. Deequ lacks clear documentation but has "anomaly detection" which can compare previous scans to current ones. Great expectations has very nice and clear documentation and thus less overhead. I think neither of these frameworks is made for data streaming specifically.

Can anyone offer some advice/other framework suggestions?



Solution 1:[1]

As Philipp observed, in most cases batches of some sort are a good way to apply tests to streaming data (even Spark Streaming is effectively using a "mini-batch" system).

That said: if you need to use a streaming algorithm to compute a metric required for your validation (e.g. to maintain running counts over observed data), it is possible to decompose your target metric into a "state" and "update" portion, which can be properties of the "last" and "current" batches (even if those are only one record each). Improved support for that kind of cross-batch metric is actually the area we're most actively working on in Great Expectations now!

In that way, I think of the concept of the Batch as both baked deeply into the core concepts of what gets validated, but also sufficiently flexible to work in a streaming system.

Disclaimer: I am one of the authors of Great Expectations. (Stack Overflow alerts! :))

Solution 2:[2]

You can mini-batch your data and apply data quality verification to each of these batches individually. Moreover, deequ allows for stateful computation of data quality metrics where, like James already pointed out, metrics are computed on two partitions of data and are then merged. You can find deequ examples of this here.

Is there a specific example that was not covered in deequ's documentation? You can find a basic example of running deequ against a Spark Dataframe here. Also, there are more examples in the same folder, for example for anomaly detection use-cases.

Disclaimer: I am one of the authors of deequ.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 James
Solution 2