'numba cuda does not produce correct result with += (gpu reduction needed?)
I am using numba cuda to calculate a function.
The code is simply to add up all the values into one result, but numba cuda gives me a different result from numpy.
numba code
import math
def numba_example(number_of_maximum_loop,gs,ts,bs):
from numba import cuda
result = cuda.device_array([3,])
@cuda.jit(device=True)
def BesselJ0(x):
return math.sqrt(2/math.pi/x)
@cuda.jit
def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
i = cuda.grid(1)
if i < number_of_maximum_loop:
result[0] += BesselJ0(i/100+gs)
result[1] += BesselJ0(i/100+ts)
result[2] += BesselJ0(i/100+bs)
# Configure the blocks
threadsperblock = 128
blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock
# Start the kernel
cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs)
return result.copy_to_host()
numba_example(1000,20,20,20)
output:
array([ 0.17770302, 0.34166728, 0.35132036])
numpy code
import math
def numpy_example(number_of_maximum_loop,gs,ts,bs):
import numpy as np
result = np.zeros([3,])
def BesselJ0(x):
return math.sqrt(2/math.pi/x)
for i in range(number_of_maximum_loop):
result[0] += BesselJ0(i/100+gs)
result[1] += BesselJ0(i/100+ts)
result[2] += BesselJ0(i/100+bs)
return result
numpy_example(1000,20,20,20)
output:
array([ 160.40546935, 160.40546935, 160.40546935])
I don't know where I am being wrong. I guess I might use reduction. But it seems impossible to finish it with one cuda kernel.
Solution 1:[1]
Yes, a proper parallel reduction is needed to sum data from multiple GPU threads to a single variable.
Here's one trivial example of how it could be done from a single kernel:
$ cat t23.py
import math
def numba_example(number_of_maximum_loop,gs,ts,bs):
from numba import cuda
result = cuda.device_array([3,])
@cuda.jit(device=True)
def BesselJ0(x):
return math.sqrt(2/math.pi/x)
@cuda.jit
def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
i = cuda.grid(1)
if i < number_of_maximum_loop:
cuda.atomic.add(result, 0, BesselJ0(i/100+gs))
cuda.atomic.add(result, 1, BesselJ0(i/100+ts))
cuda.atomic.add(result, 2, BesselJ0(i/100+bs))
# Configure the blocks
threadsperblock = 128
blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock
# Start the kernel
init = [0.0,0.0,0.0]
result = cuda.to_device(init)
cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs)
return result.copy_to_host()
print(numba_example(1000,20,20,20))
$ python t23.py
[ 162.04299487 162.04299487 162.04299487]
$
You can also do a proper reduction in numba directly with the reduce
decorator as described here although I'm not sure 3 reductions can be done in a single kernel that way.
Finally, you could write an ordinary cuda parallel reduction using numba cuda as indicated here. It should not be difficult I think to extend that to performing 3 reductions in a single kernel.
These 3 different methods will likely have performance differences, of course.
As an aside, if you're wondering about the results discrepancy between my code above and your python code in the question, I can't explain it. When I run your python code I get results matching the numba cuda code in my answer:
$ cat t24.py
import math
def numpy_example(number_of_maximum_loop,gs,ts,bs):
import numpy as np
result = np.zeros([3,])
def BesselJ0(x):
return math.sqrt(2/math.pi/x)
for i in range(number_of_maximum_loop):
result[0] += BesselJ0(i/100+gs)
result[1] += BesselJ0(i/100+ts)
result[2] += BesselJ0(i/100+bs)
return result
print(numpy_example(1000,20,20,20))
$ python t24.py
[ 162.04299487 162.04299487 162.04299487]
$
Solution 2:[2]
Performing multiple (or even 1) atomic adds per thread will very likely produce bad performance. You might consider a tiled approach (summing the contributions from a block using a shared array or from a warp using a warp shuffle) and then doing the atomic operations once per warp or block.
By the way, the discrepancy between the numerical values you posted and the ones in Robert Crovella's reply appears to be a python version issue (specifically the introduction of // for integer division in python 3). Changing "i/100" to "i//100" in the Crovella version and executing in python 3 reproduces your values.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Robert Crovella |
Solution 2 | dstorti |