'Do __shfl_xx_sync() intrinsics with mask need an additional __syncwarp()?

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough?

I cannot provide a working minimal example, as it is very long and confidential code and the error appeared only in certain run/build configurations.

The code looks basically like the following:

if (threadIdx.x >= 30) {
    temp.x = __shfl_up_sync(0xC0000000, x, 1);
    temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);

Release builds worked fine; with debug builds lanes 30 and 31 waited (according to debugger and SASS) at a different sync instruction than the other lanes.

When I introduced __syncwarp() also debug builds run through. And I hope this problem is definitely fixed!?!

I am using a mask in the first two shuffle instructions indicating that only lanes 30 and 31 participate. What happens, if the scheduler decides that lanes 0 to 29 are executed first and goes to the second shuffle instruction (with all participating lanes)? Then those shuffle instructions wait for the lanes 30 and 31. Those threads then get to the upper shuffle instructions. Can the shuffles be distinguished?

If the __syncwarp() is needed: Why would it react differently then the shuffle instruction with mask 0xffffffff itself?

Because it is of other type? (shuffle sync instead of normal sync?) Or was it by accident that the program worked in this way?

(The __syncwarp() intrinsic is probably useful here anyway (for performance reasons), as the threads converge at that point.)

If __syncwarp() is not enough: How to make sure the kernel does not hang? Is there generally another recommended way than __syncwarp()?

I am running this on Turing RTX 2060 Mobile (and debugged is with Visual Studio).



Solution 1:[1]

No, you should not need a __syncwarp() here. CUDA went from e.g. __shfl_up() to __shfl_up_sync() to avoid this. I think the problem is that you are trying to shuffle up data from a thread that is not participating in the call, i.e. thread 30 is trying to get data from thread 29, so thread 29 has to participate.

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

from the docs. Although this explanation is still unsatisfactory, as you seem to get a deadlock instead of an undefined value. But maybe this is wanted behavior for a debug build?

That being said, I'm not quite sure how to do this elegantly, because just including thread 29 in the conditional and mask will only shift the problem to 29 trying to get data from 28. In the examples given in the documentation, they always do the intrinsic with all threads and then conditionally use the results.

My best guess is that you want thread 29 to participate, but with a delta of 0. I have not found anything saying that delta needs to be the same across threads.

You might also want to use __ballot_sync() to retrieve a mask as can be seen in Listing 3 of this blogpost to avoid bugs from manually specifying the mask, which needs to be changed whenever the conditional is changed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1