'numba cuda does not produce correct result with += (gpu reduction needed?)

I am using numba cuda to calculate a function.

The code is simply to add up all the values into one result, but numba cuda gives me a different result from numpy.

numba code

 import math
 def numba_example(number_of_maximum_loop,gs,ts,bs):
        from numba import cuda
        result = cuda.device_array([3,])

        @cuda.jit(device=True)

    def BesselJ0(x):
        return math.sqrt(2/math.pi/x)

    @cuda.jit
    def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
        i = cuda.grid(1)
        if i < number_of_maximum_loop:
            result[0] += BesselJ0(i/100+gs)
            result[1] += BesselJ0(i/100+ts)
            result[2] += BesselJ0(i/100+bs)

    # Configure the blocks
    threadsperblock = 128
    blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock

    # Start the kernel 
    cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs) 

    return result.copy_to_host()

numba_example(1000,20,20,20) 

output:

array([ 0.17770302,  0.34166728,  0.35132036])

numpy code

import math
def numpy_example(number_of_maximum_loop,gs,ts,bs):
    import numpy as np
    result = np.zeros([3,])

    def BesselJ0(x):
        return math.sqrt(2/math.pi/x)

    for i in range(number_of_maximum_loop):
        result[0] += BesselJ0(i/100+gs)
        result[1] += BesselJ0(i/100+ts)
        result[2] += BesselJ0(i/100+bs)

    return result

numpy_example(1000,20,20,20) 

output:

array([ 160.40546935,  160.40546935,  160.40546935])

I don't know where I am being wrong. I guess I might use reduction. But it seems impossible to finish it with one cuda kernel.



Solution 1:[1]

Yes, a proper parallel reduction is needed to sum data from multiple GPU threads to a single variable.

Here's one trivial example of how it could be done from a single kernel:

$ cat t23.py
import math
def numba_example(number_of_maximum_loop,gs,ts,bs):
    from numba import cuda
    result = cuda.device_array([3,])

    @cuda.jit(device=True)
    def BesselJ0(x):
        return math.sqrt(2/math.pi/x)

    @cuda.jit
    def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
        i = cuda.grid(1)
        if i < number_of_maximum_loop:
            cuda.atomic.add(result, 0, BesselJ0(i/100+gs))
            cuda.atomic.add(result, 1, BesselJ0(i/100+ts))
            cuda.atomic.add(result, 2, BesselJ0(i/100+bs))

# Configure the blocks
    threadsperblock = 128
    blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock

 # Start the kernel
    init = [0.0,0.0,0.0]
    result = cuda.to_device(init)
    cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs)

    return result.copy_to_host()

print(numba_example(1000,20,20,20))
$ python t23.py
[ 162.04299487  162.04299487  162.04299487]
$

You can also do a proper reduction in numba directly with the reduce decorator as described here although I'm not sure 3 reductions can be done in a single kernel that way.

Finally, you could write an ordinary cuda parallel reduction using numba cuda as indicated here. It should not be difficult I think to extend that to performing 3 reductions in a single kernel.

These 3 different methods will likely have performance differences, of course.

As an aside, if you're wondering about the results discrepancy between my code above and your python code in the question, I can't explain it. When I run your python code I get results matching the numba cuda code in my answer:

$ cat t24.py
import math
def numpy_example(number_of_maximum_loop,gs,ts,bs):
    import numpy as np
    result = np.zeros([3,])

    def BesselJ0(x):
        return math.sqrt(2/math.pi/x)

    for i in range(number_of_maximum_loop):
        result[0] += BesselJ0(i/100+gs)
        result[1] += BesselJ0(i/100+ts)
        result[2] += BesselJ0(i/100+bs)

    return result

print(numpy_example(1000,20,20,20))
$ python t24.py
[ 162.04299487  162.04299487  162.04299487]
$

Solution 2:[2]

Performing multiple (or even 1) atomic adds per thread will very likely produce bad performance. You might consider a tiled approach (summing the contributions from a block using a shared array or from a warp using a warp shuffle) and then doing the atomic operations once per warp or block.

By the way, the discrepancy between the numerical values you posted and the ones in Robert Crovella's reply appears to be a python version issue (specifically the introduction of // for integer division in python 3). Changing "i/100" to "i//100" in the Crovella version and executing in python 3 reproduces your values.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Robert Crovella
Solution 2 dstorti