'Recalculate historical data using Apache Beam
I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug fix or after changing the way it processes data without a big delay?
Solution 1:[1]
It is quite application dependent.
For example, a straightforward approach if you are using Kafka (and all data is in there):
- Stop and relaunch the job (or if you want no downtime at all, launch another job while the other keeps running) without using a savepoint:
- Use a different Kafka consumer group to not mess with the existing pipeline
- Set a new database as output to build its contents from scratch
- Scale up the job so it finishes to reprocess as fast as possible
- Switch the old database with the new one atomically
- Scale back down the job
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Gerard Garcia |