'Does unrolling a loop affect the accuracy of the computations within?

Summarized question Does unrolling a loop affect the accuracy of the computations performed within the loop? And if so, why?

Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.

When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.

Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".

For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.

Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...

// C# for Unity
using UnityEngine;

public class UnrollTest : MonoBehaviour
{

    [SerializeField] ComputeShader CS;
    ComputeBuffer CBUnrolled, CBNotUnrolled;
    readonly int N = 3;

    private void Start()
    {

        CBUnrolled = new ComputeBuffer(N, sizeof(double));
        CBNotUnrolled = new ComputeBuffer(N, sizeof(double));

        CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
        CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);

        CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);

        double[] ansUnrolled = new double[N];
        double[] ansNotUnrolled = new double[N];

        CBUnrolled.GetData(ansUnrolled);
        CBNotUnrolled.GetData(ansNotUnrolled);

        for (int i = 0; i < N; i++)
        {
            Debug.Log("Unrolled ans = " + ansUnrolled[i] + 
                "  -  Not Unrolled ans = " + ansNotUnrolled[i] +  
                "  --  Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
        }
        CBUnrolled.Release();
        CBNotUnrolled.Release();
    }
}

#pragma kernel CSMain

RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;

// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
    return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}

// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
    return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}

// Function to solve.
double f(double fnh, double h, double theta)
{
    // Solved using Newton's method.
    int max_iter = 50;
    double epsilon = 1e-8;
    double fxn, dfxn;

    // Define initial guess for k, herby denoted as x.
    double xn = 10.0;

    for (int n = 0; n < max_iter; n++)
    {
        fxn = fDummy(xn, fnh, h, theta);
        
        if (abs(fxn) < epsilon)     // A solution is found.
            return xn;
        
        dfxn = dfDummy(xn, fnh, h, theta);

        if (dfxn == 0.0)    // No solution found.
            return xn;

        xn = xn - fxn / dfxn;
    }

    // No solution found.
    return xn;
}

[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    int N = 3;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;   // Example values.
   
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {   
        _CBNotUnrolled[i] = f(fnh, h, theta);
        theta += dtheta;
    }
    
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;          // Example values.

    [unroll(N)] for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = f(fnh, h, theta);
        theta += dtheta;
    }
}

Image of Unity console when running the above

Edit After some more testing, the deviation has been narrowed down to the following code, giving a difference of about 1e-17 between the exact same code unrolled vs not unrolled. Despite the small difference, I still consider it a valid example of the issue, as I believe they should be equal.

[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    if ((int) threadID.x != 1)
        return;
    
    int N = 3;
    double k = 1.0;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {
        _CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
   
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    [unroll(N)]
    for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
}

Image of Unity console when running the edited script above

Edit 2 The following is the compiled code for the kernel given in Edit 1. Unfortunately, my experience with assembly language is limited, and I am not capable of spotting if this script shows any errors, or if it is useful to the problem at hand.

**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
//       Double-precision floating point
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
      cs_5_0
      dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
      dcl_uav_structured u0, 8
      dcl_uav_structured u1, 8
      dcl_input vThreadID.x
      dcl_temps 2
      dcl_thread_group 64, 1, 1
   0: ine r0.x, vThreadID.x, l(1)
   1: if_nz r0.x
   2:   ret 
   3: endif 
   4: dmov r0.xy, d(-0.161000l, 0.000000l)
   5: mov r0.z, l(0)
   6: loop 
   7:   ige r0.w, r0.z, l(3)
   8:   breakc_nz r0.w
   9:   dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
  10:   dadd r1.xy, -r1.zwzw, r1.xyxy
  11:   store_structured u1.xy, r0.z, l(0), r1.xyxx
  12:   dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
  13:   iadd r0.z, r0.z, l(1)
  14: endloop 
  15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
  16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
  17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
  18: ret 
// Approximately 0 instruction slots used

Edit 3 After reaching out to Microsoft, (see https://docs.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because

"The pragma unroll [(n)] is keil compiler which Unity uses topic"

Solution 1:^[1]

This is driver, hardware, compiler, and unity dependent.

In essence, the HLSL specification has somewhat looser guarantees for rounding behavior of mathematical operations than regular IEEE-754 floating point.

First, it is implementation-dependent whether operations round up or down.

IEEE-754 requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result, known as round-to-nearest-even. Direct3D 10, however, defines a looser requirement: 32-bit floating-point operations produce a result that is within one unit-last-place (1 ULP) of the infinitely-precise result. This means that, for example, hardware is allowed to truncate results to 32-bit rather than perform round-to-nearest-even, as that would result in error of at most one ULP.

See https://docs.microsoft.com/en-us/windows/win32/direct3d10/d3d10-graphics-programming-guide-resources-float-rules#32-bit-floating-point-rules

Going one step further, the HLSL compiler itself has many fast-math optimizations that can violate IEEE-754 float conformance; see, for example:

D3DCOMPILE_IEEE_STRICTNESS - Forces strict compile, which might not allow for legacy syntax. By default, the compiler disables strictness on deprecated syntax.
D3DCOMPILE_OPTIMIZATION_LEVEL3 - Directs the compiler to use the highest optimization level. If you set this constant, the compiler produces the best possible code but might take significantly longer to do so. Set this constant for final builds of an application when performance is the most important factor. D3DCOMPILE_PARTIAL_PRECISION - Directs the compiler to perform all computations with partial precision. If you set this constant, the compiled code might run faster on some hardware.

Source: https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/d3dcompile-constants

This particularly matters for your scenario, because if optimizations are enabled, the existence of loop unrolling can trigger constant folding optimizations that reduce the computational cost of your code and change the precision of its results (potentially even improving them). Note that when constant folding occurs, the compiler has to decide how to perform rounding, and that might disagree with what your hardware FPUs would do.

Oh, and note that IEEE-754 does not place constraints on the precision, let alone require implementation, of "additional operations" (e.g. sin, cos, tanh, atan, ln, etc); it purely recommends them.

See, a very common case where this goes wrong and sin gets quantized to 4 different values on intel integrated graphics, but otherwise has reasonable precision on alternative hardware: sin(x) only returns 4 different values for moderately large input on GLSL fragment shader, Intel HD4000

Also, note that Unity does not guarantee that a float in shader is actually a 32-bit float; on certain hardware (e.g. mobile), it can even be backed by a 16-bit half or an 11-bit fixed.

High precision: float Highest precision floating point value; generally 32 bits (just like float from regular programming languages).

... One complication of float/half/fixed data type usage is that PC GPUs are always high precision. That is, for all the PC (Windows/Mac/Linux) GPUs, it does not matter whether you write float, half or fixed data types in your shaders. They always compute everything in full 32-bit floating point precision.

The half and fixed types only become relevant when targeting mobile GPUs, where these types primarily exist for power (and sometimes performance) constraints. Keep in mind that you need to test your shaders on mobile to see whether or not you are running into precision/numerical issues.

Even on mobile GPUs, the different precision support varies between GPU families.

Source: https://docs.unity3d.com/Manual/SL-DataTypesAndPrecision.html

I don't believe Unity exposes compiler flags to developers; you are at its whim as to what optimizations it passes to dxc/fxc. Given it's primarily used for games, you can bet they enable optimizations.

Source: https://forum.unity.com/threads/possible-to-set-directx-compiler-flags-in-shaders.453790/

Finally, check out "Floating-Point Determinism" by Bruce Dawson if you want an in-depth dive into this topic; I will add that this problem also exists if you want consistent results between languages (since languages themselves can implement math functions themselves rather than using hardware intrinsics, e.g. for better precision), when cross-compiling (since different compilers / backends can optimize differently or use different system libraries), or when running managed code across different runtimes (e.g. since JIT can do different optimiztions).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Does unrolling a loop affect the accuracy of the computations within?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]