'Changing one part of code affects the other part when compiling with use_fast_math flag

I have the following kernel:

__global__ void kernel()
{
  float loc{269.0f};
  float s{(356.0f - loc) / 13.05f};

  float a{pow(1.0f - 0.15f * s, 1.0f)};
  float b{pow(1.0f - 0.15f * (356.0f - loc) / 13.05f, 1.0f)};

  printf("%f\n", a);
  printf("%f\n", b);
}

It prints

0.000000
0.000000

But if I change how b is calculated:

__global__ void kernel()
{
  float loc{269.0f};
  float s{(356.0f - loc) / 13.05f};

  float a{pow(1.0f - 0.15f * s, 1.0f)};
  float b{pow(1.0f - 0.15f * ((356.0f - loc) / 13.05f), 1.0f)}; // notice the added braces

  printf("%f\n", a);
  printf("%f\n", b);
}

it prints:

nan
nan

Why does adding the braces for b change a as well? Why do the braces have an effect at all? Looking at the code on godbolt.org I see that the generated assembly is different, but I don't have enough knowledge to understand what exactly causes this behavior.

This is my project configuration:

set_target_properties(test PROPERTIES
    CXX_STANDARD 14
    CXX_STANDARD_REQUIRED YES
    CXX_EXTENSIONS NO
    CUDA_STANDARD 14
    CUDA_STANDARD_REQUIRED YES
    CUDA_EXTENSIONS NO
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_ARCHITECTURES "61"
)

set(CUDA_FLAGS --use_fast_math)
set(CXX_FLAGS -O0)

Please notice the flag --use_fast_math - without it, everything works fine. My GPU is Quadro P1000. Cuda compilation tools, release 11.2, V11.2.152.



Solution 1:[1]

Your code is not numerically stable: it contains a catastrophic cancellation. Indeed, 1.0f - 0.15f * s and 1.0f - 0.15f * ((356.0f - loc) / 13.05f) are nearly equal to 0. It can be equal to few unit in the last place (ULP). Regarding the rounding, the value can be positive, negative or 0. When the base value is negative, the result of pow is undefined. The thing is the rounding is dependent of the compiler heuristic since you explicitly enabled fast-math that disable some IEEE-754 rules like taking care of the floating-point associativity and the possible presence of NaN values. With fast-math, the compiler is free to use approximations like a reciprocal (less accurate) instead of the basic division. In fact it actually do that in your case. Consequently, results are undefined and the compiler is free to set the output to 0 or NaN regarding its heuristic. I think that having a pow exponent set to 1.0f (which is well represented) does not save you from this behaviour with pow (note you should use powf for float values and not pow which is for double values.

The proper way to solve this issue is to use a stable numerical computation that do not contains any catastrophic cancellation like in your code. The actual solution is very dependent of the overall computational method. Sometime, one need to use a completely different algorithm (eg. QR factorization instead of LU factorization, 2 step variance computation instead of a 1 step one).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jérôme Richard