'Postgresql Prune replicated data from outbox table
Problem Statement
In order to ensure disk size isn't growing unnecessary, I want to be able to delete rows that have been replicated from my outbox table.
Context
Postgres is at v12
We are using a Kafka source connector to stream changes made to a postgres table. These changes are insert only and thus are no longer needed once written to kafka. The source connector is using logical replication to stream the changes to the connector and the state of the replication can be displayed in pg_replication_slots.
When looking at the pg_replication_slots you can see useful data that it's storing in order to know what logs it has to keep to ensure replication can still happen for the client.
For example when I run:
select * from pg_replication_slots;
I might see:
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+--------------------+-----------+--------+------------+------+--------------+-------------+---------------------
debezium | wal2json | logical | 26593 | database_name | f | t | 7404 | | 26729 | 0/DCD98E8 | 0/DCD9920
(1 row)
What I'm interested in knowing is if I can reliably use that data and then the postgresql metadata on the table to select all rows that have been replicated from that slot.
For example, this doesn't work as far as I can tell, but ideally would return rows that have been replicated and are now safe to prune from the table:
select * from outbox where age(xmin) < (select age(catalog_xmin) from pg_replication_slots);
Any guidance would be sweet! Cheers!
Solution 1:[1]
I have been implementing the Outbox pattern using Debezium with MySQL and delete the outbox record straight after inserting it which I saw done here https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ The insert is picked up and sent and the delete is ignored. So essentially there should never be anything in the outbox table(outside of the transaction). I also pre-generate the primary keys for the entries(which I use for the event ID in Kafka) so I can bulk insert and delete.
Solution 2:[2]
Circling back around to this, I had to think a bit differently around how I we could tie the replications progress to our outbox table. Previously in my question I was trying to glean progress from pg_replication_slots, but in this working example I switched to using pg_stat_replication. This table can be queried by the slot_name we care about and can return lag results. For an example:
SELECT * FROM outbox WHERE created_at < (SELECT(NOW() - COALESCE(replay_lag, interval '60 seconds')) as stale_time from pg_stat_replication where pg_stat_replication.slot_name = 'outbox_slot');
So here this will return to us rows from our outbox table that were inserted outside of our replay_lag time or 1 minute.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Bennett T |
Solution 2 | MattyIceDev |