'Visible order of operations with acquire/release fence in C++

I have a following program which uses std::atomic_thread_fences:

int data1 = 0;
std::atomic<int> data2 = 0;
std::atomic<int> state;

int main() {
    state.store(0);
    data1 = 0;
    data2 = 0;

    std::thread t1([&]{
        data1 = 1;
        state.store(1, std::memory_order_release);
    });

    std::thread t2([&]{
        auto s = state.load(std::memory_order_relaxed);
        if (s != 1) return;

        std::atomic_thread_fence(std::memory_order_acquire);

        data2.store(data1, std::memory_order_relaxed);

        std::atomic_thread_fence(std::memory_order_release);
        state.store(2, std::memory_order_relaxed);
    });

    std::thread t3([&]{
        auto d = data2.load(std::memory_order_relaxed);
    
        std::atomic_thread_fence(std::memory_order_acquire);

        if (state.load(std::memory_order_relaxed) == 0) {
            std::cout << d;
        }
    });

    t1.join();
    t2.join();
    t3.join();
}

It consists of 3 threads and one global atomic variable state used for synchronization. First thread writes some data to a global, non-atomic variable data1 and sets state to 1. Second thread reads state and if it's equal to 1 it modifies assigns data1 to another global non-atomic variable data2. After that, it stores 2 into state. This thread reads the content of data2 and then checks state.

Q: Will the third thread always print 0? Or is it possible for the third thread to see update to data2 before update to state? If so, is the only solution to guarantee that to use seq_cst memory order?



Solution 1:[1]

I think that t3 can print 1.

I believe the basic issue is that the release fence in t2 is misplaced. It is supposed to be sequenced before the store that is to be "upgraded" to release, so that all earlier loads and stores become visible before the later store does. Here, it has the effect of "upgrading" the state.store(2). But that is not helpful because nobody is trying to use the condition state.load() == 2 to order anything. So the release fence in t2 doesn't synchronize with the acquire fence in t3. Therefore you do not get any stores to happen-before any of the loads in t3, so you get no assurance at all about what values they might return.

The fence really ought to go before data2.store(data1), and then it should work. You would be assured that anyone who observes that store will thereafter observe all prior stores. That would include t1's state.store(1) which is ordered earlier because of the release/acquire pair between t1 and t2.

So if you change t2 to

        auto s = state.load(std::memory_order_relaxed);
        if (s != 1) return;

        std::atomic_thread_fence(std::memory_order_acquire);

        std::atomic_thread_fence(std::memory_order_release); // moved
        data2.store(data1, std::memory_order_relaxed);

        state.store(2, std::memory_order_relaxed);  // irrelevant

then whenever data2.load() in t3 returns 1, the release fence in t2 synchronizes with the acquire fence in t3 (see C++20 atomics.fences p2). The t2 store to data2 only happened if the t2 load of state returned 1, which would ensure that the release store in t1 synchronizes with the acquire fence in t2 (atomics.fences p4). We then have

t1 state.store(1)
    synchronizes with
t2 acquire fence
    sequenced before
t2 release fence
    synchronizes with
t3 acquire fence
    sequenced before
t3 state.load()

so that state.store(1) happens before state.load(), and thus state.load() cannot return 0 in this case. This would ensure the desired ordering without requiring seq_cst.


To imagine how the original code could actually fail, think about something like POWER, where certain sets of cores get special early access to snoop stores from each others' store buffers, before they hit L1 cache and become globally visible. Then an acquire barrier just has to wait until all earlier loads are complete; while a release barrier should drain not only its own store buffer, but also all other store buffers that it has access to.

So suppose core1 and core2 are such a special pair, but core3 is further away and only gets to see stores after they are written to L1 cache. We could have:

core1                  core2          L1 cache     core3
=====                  =====          ========     =====
data1 <- 1
release                               data1 <- 1
state <- 1
(still in store buffer)
                      1 <- state
                      acquire
                      1 <- data1
                      data2 <- 1      data2 <- 1
                                                  1 <- data2
                                                  acquire
                                                  0 <- state
                      release         state <- 1
                      state <- 2      state <- 2

The release barrier in core 2 does cause the store buffer of core 1 to drain and thus write state <- 1 to L1 cache, but by then it is too late.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nate Eldredge