'Changing one part of code affects the other part when compiling with use_fast_math flag
I have the following kernel:
__global__ void kernel()
{
float loc{269.0f};
float s{(356.0f - loc) / 13.05f};
float a{pow(1.0f - 0.15f * s, 1.0f)};
float b{pow(1.0f - 0.15f * (356.0f - loc) / 13.05f, 1.0f)};
printf("%f\n", a);
printf("%f\n", b);
}
It prints
0.000000
0.000000
But if I change how b
is calculated:
__global__ void kernel()
{
float loc{269.0f};
float s{(356.0f - loc) / 13.05f};
float a{pow(1.0f - 0.15f * s, 1.0f)};
float b{pow(1.0f - 0.15f * ((356.0f - loc) / 13.05f), 1.0f)}; // notice the added braces
printf("%f\n", a);
printf("%f\n", b);
}
it prints:
nan
nan
Why does adding the braces for b
change a
as well? Why do the braces have an effect at all? Looking at the code on godbolt.org I see that the generated assembly is different, but I don't have enough knowledge to understand what exactly causes this behavior.
This is my project configuration:
set_target_properties(test PROPERTIES
CXX_STANDARD 14
CXX_STANDARD_REQUIRED YES
CXX_EXTENSIONS NO
CUDA_STANDARD 14
CUDA_STANDARD_REQUIRED YES
CUDA_EXTENSIONS NO
CUDA_SEPARABLE_COMPILATION ON
CUDA_ARCHITECTURES "61"
)
set(CUDA_FLAGS --use_fast_math)
set(CXX_FLAGS -O0)
Please notice the flag --use_fast_math
- without it, everything works fine.
My GPU is Quadro P1000.
Cuda compilation tools, release 11.2, V11.2.152.
Solution 1:[1]
Your code is not numerically stable: it contains a catastrophic cancellation. Indeed, 1.0f - 0.15f * s
and 1.0f - 0.15f * ((356.0f - loc) / 13.05f)
are nearly equal to 0. It can be equal to few unit in the last place (ULP). Regarding the rounding, the value can be positive, negative or 0. When the base value is negative, the result of pow is undefined. The thing is the rounding is dependent of the compiler heuristic since you explicitly enabled fast-math that disable some IEEE-754 rules like taking care of the floating-point associativity and the possible presence of NaN values. With fast-math, the compiler is free to use approximations like a reciprocal (less accurate) instead of the basic division. In fact it actually do that in your case. Consequently, results are undefined and the compiler is free to set the output to 0 or NaN regarding its heuristic. I think that having a pow exponent set to 1.0f (which is well represented) does not save you from this behaviour with pow
(note you should use powf
for float
values and not pow
which is for double
values.
The proper way to solve this issue is to use a stable numerical computation that do not contains any catastrophic cancellation like in your code. The actual solution is very dependent of the overall computational method. Sometime, one need to use a completely different algorithm (eg. QR factorization instead of LU factorization, 2 step variance computation instead of a 1 step one).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jérôme Richard |