'Numba support for cuda cooperative block synchronization?? Python numba cuda grid sync

Numba Cuda has syncthreads() to sync all thread within a block. How can I sync all blocks in a grid without exiting the current kernel?

In C-Cuda there's a cooperativeBlocks library to handle this case. I can't find something like that in the Numba Docs.

Why this matters a lot!

This sort of thing happens in reductions where one computes something in each block, then you want to find the maximum over the blocks.

Trivially one could push these into the stream as two separate calls. This assures that the block computes are all finished before the call to reduce.

But if those two operations are lightweight, then the execution time is dominated by setting up the kernels not by the execution of the operations. If these are inside a python loop, the loop could easily run 1000 times faster if the loop and the two kerel calls could be fused into one kernel

for u in range(100000):
   Amax =CudaFindArrayMaximum(A)
   CudaDivideArray(A,Amax)
   CudaDoSomethingWithMatrix(A)

since each of the three lines in the loop are fast kernels, I'd like to put them and the loop all into one single kernel.

But I can't think of any way to do that without synching across all the blocks in the grid. INdeed even the very first step of finding the maximum is tricky in itself for the same reason.



Solution 1:[1]

In CUDA, without the use of cooperative groups, there is no safe or reliable mechanism to do a grid-wide sync (other than using the kernel launch boundary). In fact, providing this capability was one of the motivations behind the introduction of cooperative groups.

Currently, numba does not expose cooperative groups functionality. Therefore there is no safe or reliable way to achieve this within the numba capabilities, currently.

numba offers this feature now, see here.

Refer to this question for an example of a possible hazard in trying to do this in CUDA without cooperative groups.

Solution 2:[2]

In version 0.53.1 of Numba adds cooperative groups support so you can sync the entire grid by doing:

g = cuda.cg.this_grid()

g.sync()

Note that Cooperative Groups also require the CUDA Device Runtime library, cudadevrt, to be available -- for conda default channel-installed CUDA toolkit packages, it is only available in versions 10.2 onwards. System-installed toolkits (e.g. from NVIDIA distribution packages or runfiles) all include cudadevrt.

For more details you can read here

Solution 3:[3]

You can communicate between blocks through global memory. Moreover, GPUs read/write to theirs memory in blocks like 64 bytes. So if each block write one or several such aligned blocks, you will not get conflicts. Not synchronization, but at least...

But there is another problem. You can have 10000 blocks and only 30 will be working at some time :) . Only when one of them finishes and is forgiven, a next block is started. But in principle even in that case you can organize your work in such a way that first iteration is blocks e.g. 1-1000, the next one - 1001-1500, the 3rd - 1501-1750 and so on. And to check in each block that the necessary input data are ready and in worst case to make a dummy loop...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 talonmies
Solution 3 Mikhail M