'Try to understand MVCC

I'm trying to understand MVCC and can't get it. For example. Transaction1 (T1) try to read some data row. In that same time T2 update the same row.

The flow of transaction is

T1 begin -> T2 begin -> T2 commit -> T1 commit

So the first transaction get it snapshot of database and returns to user result on which he is gonna build other calculation. But as I understand, customer get the old data value? As I understand MVCC, after T1 transaction begins, that transaction doesn't know, that some other transaction change data. So if now user doing some calculation after that (without DB involved), he is doing it on wrong data? Or I'm not right and first transaction have some mechanisms to know that row was changed?

Let's now change the transaction flow.

T2beg -> T1beg -> T2com -> T1com

In 2PL user get the newer version of data because of locks (T1 must wait before exclusive lock released). But in case of MVCC it still will be the old data, as I understand the postgresql MVCC model. So I can get stale data in exchange for speed. Am I right? Or I miss something?

Thank you



Solution 1:[1]

Yes, it can happen that you read some old data from the database (that a concurrent transaction has modified), perform a calculation based on this and store “outdated” data in the database.

This is not a problem, because rather than the actual order in which transactions happen, the logical order is more relevant:

If T1 reads some data, then T2 modifies the data, then T1 modifies the database based on what it read, you can say that T1 logically took place before T2, and there is no inconsistency.

You only get an anomaly if T1 and T2 modify the same data: T1 reads data, T2 modifies the data, then T1 modifies the same data based on what it read. This anomaly is called a “lost update”.

Lost updates can only occur with the weakest (which is the default) isolation level, READ COMMITTED.

If you want better isolation from concurrent activity, you need to use at least REPEATABLE READ isolation. Then T1 would receive a serialization error when it tries to update the data, and you would have to repeat the transaction. On the second attempt, it would read the new data, and everything will be consistent.

Solution 2:[2]

The flow of transaction is next T1 begin -> T2 begin -> T2 commit -> T1 commit. So the first transaction get it snapshot of database and returns to user result on which he is gonna build other calculation. But as I understand, customer get the old data value?

Unless T1 and T2 try to update the same row, this could usually be reordered to be the same as: T1 begin; T1 commit; T2 begin; T2 commit;

In other words, any undesirable business outcome that could be achieved by T2 changing the data while T1 was making a decision, could have also occurred by T2 changing the data immediately after T1 made the same decision.

Solution 3:[3]

Each transaction can only see data that is younger or equal to that transaction's ID.

When transaction 1 reads data, it marks the read time stamp of that data to transaction 1.

If transaction 2 tries to read the same data, it checks the read timestamp of the data, if the read timestamp for the data is less than transaction 2, then transaction 2 is aborted because 1 < 2 -- 1 got there before us and they must finish before us.

At commit time, we also check if the read timestamp of the data is less than the committing transaction. If it is, we abort the transaction and restart with a new transaction ID.

We also check if write timestamps are less than our transaction. Younger transaction wins.

There is an edge case where a younger transaction can abort if an older transaction gets ahead of the younger transaction.

I have actually implemented MVCC in Java. (see transaction, runner and mvcc code)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Laurenz Albe
Solution 2 jjanes
Solution 3