'Recalculate historical data using Apache Beam

I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug fix or after changing the way it processes data without a big delay?



Solution 1:[1]

It is quite application dependent.

For example, a straightforward approach if you are using Kafka (and all data is in there):

  • Stop and relaunch the job (or if you want no downtime at all, launch another job while the other keeps running) without using a savepoint:
    • Use a different Kafka consumer group to not mess with the existing pipeline
    • Set a new database as output to build its contents from scratch
    • Scale up the job so it finishes to reprocess as fast as possible
  • Switch the old database with the new one atomically
  • Scale back down the job

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gerard Garcia