'For LOOP throw away by Compiler optimization

I was doing an experiment to measure the execution time of ''for loop'' on microcotroller. This ''for loop'' contain some integer and pointer operation.

Case 1: when I set compiler optimizaion flag to '' none'' (no optimization) there is assembly code generated and I can measure the execution time.

Case 2: When I set the compiler optimization to ''speed'' (optimized for speed) then there is no assembly code generated for this loop. It seems like , Compiler throw out this ''for loop''

/* the basic concept behind this code is data manipulation in an array.Therefore I created an array then with the help of loops, tried to manipulate data*/

   int abc[1000];
    for(n=0; n<1000; n++)
             {
                 abc[n]= 0xaa;                         
             }
    for(n=2; n<1000; n=n+2)
             {
                 abc[n]= 0xbb;                         
             }
    for(n=5; n<1000; n=n+2)
             {
    for(i=(n+n); i<1000; i++)
                {
                   abc[i]= i;                         
                }
             }

Could anyone explain why compiler throw out this loop , when I set compiler flag to speed.



Solution 1:[1]

If you don't use abc afterwards, it's possible that the optimizer recognized it (and all writes to it) as "dead" and removed it completely.

Solution 2:[2]

The compiler looks at your code and sees that abc is set and never used. Some compilers give you a warning about this. Since abc is never used the compiler optimizes it out, because whats the point in setting a variable if you never use it.

You could make abc volatile, but that would probably defeat the purpose of your test. Making the variable volatile would tell the compiler it can't make any assumptions about its use. When you make the variable volatile the compiler may not be able to make any optimizations, so the timing would be the same with and without optimizations.

Solution 3:[3]

here are a couple of COMPLETE examples as yours is incomplete to the point we cannot accurately answer your question.

unsigned int fun ( void )
{
    unsigned int x[8];
    unsigned int ra;
    unsigned int rb;

    for(ra=0;ra<5;ra++) x[ra]=ra;
    rb=0; for(ra=0;ra<5;ra++) rb+=x[ra];
    return(rb);
}

void more_fun ( void )
{
    unsigned int x[8];
    unsigned int ra;

    for(ra=0;ra<5;ra++) x[ra]=ra;
}

and an example of optimized compiled output

00000000 <fun>:
   0:   e3a0000a    mov r0, #10
   4:   e12fff1e    bx  lr

00000008 <more_fun>:
   8:   e12fff1e    bx  lr

the second function first, it is very easy to see that neither ra nor x are used outside the function, nothing that they are doing is producing anything of real value, it is dead code with unused variables, so it all goes away and the entire function is optimized to this:

void more_fun ( void )
{
}

as it should be.

Taking this further, and I have take this kind of thing much further using randomizers and other algorithms, and sometimes the compiler figures it out, sometimes not. In this case it was pretty easy.

So none of the gyrations in the fun() function have any runtime value, it is all dead code, the result does not vary based on an input or something global, it is math completely contained within the function with an exact answer that can be pre-computed. So the compiler computes the answer at compile time (0+1+2+3+4 = 10) and removes all of the dead code. basically coming up with the right answer

unsigned int fun ( void )
{
   return(10);
}

If you are wanting to burn time using loops or perhaps even see how loops are implemented, etc, then you can try a couple of things.

void dummy ( unsigned int );
unsigned int fun ( void )
{
    unsigned int ra;
    volatile unsigned int rb;

    rb=0; for(ra=0;ra<5;ra++) rb+=ra;
    return(rb);
}

void more_fun ( void )
{
    unsigned int ra;
    for(ra=0;ra<5;ra++) dummy(ra);
}

which can give something like (compilers vary)

00000000 <fun>:
   0:   e3a03000    mov r3, #0
   4:   e24dd008    sub sp, sp, #8
   8:   e58d3004    str r3, [sp, #4]
   c:   e59d3004    ldr r3, [sp, #4]
  10:   e58d3004    str r3, [sp, #4]
  14:   e59d3004    ldr r3, [sp, #4]
  18:   e2833001    add r3, r3, #1
  1c:   e58d3004    str r3, [sp, #4]
  20:   e59d3004    ldr r3, [sp, #4]
  24:   e2833002    add r3, r3, #2
  28:   e58d3004    str r3, [sp, #4]
  2c:   e59d3004    ldr r3, [sp, #4]
  30:   e2833003    add r3, r3, #3
  34:   e58d3004    str r3, [sp, #4]
  38:   e59d3004    ldr r3, [sp, #4]
  3c:   e2833004    add r3, r3, #4
  40:   e58d3004    str r3, [sp, #4]
  44:   e59d0004    ldr r0, [sp, #4]
  48:   e28dd008    add sp, sp, #8
  4c:   e12fff1e    bx  lr

00000050 <more_fun>:
  50:   e92d4010    push    {r4, lr}
  54:   e3a04000    mov r4, #0
  58:   e1a00004    mov r0, r4
  5c:   e2844001    add r4, r4, #1
  60:   ebfffffe    bl  0 <dummy>
  64:   e3540005    cmp r4, #5
  68:   1afffffa    bne 58 <more_fun+0x8>
  6c:   e8bd4010    pop {r4, lr}
  70:   e12fff1e    bx  lr

The volatile solution as you can see is pretty ugly it is saying I want you to actually go out and touch ram every time you use this variable, read it from ram when you want to do anything with it and write it back after each step. The more_fun() solution does not rely on hoping the compiler honors volatile like you would hope (why does the compiler honor volatile on a a local, dead, variable, seems wrong), instead if you force the compiler to call an external function (one that is not in the optimization domain and cannot be inlined as a result and possibly show dead code if for example dummy() doesnt use the input variable). Being such a small loop the compiler could implement a loop or could unroll it you could probably ask it to try to unroll if it makes sense so implement it as a loop as it has above or possibly implement it as a series of calls

void more_fun ( void )
{   
   dummy(0);
   dummy(1);
   dummy(2);
   dummy(3);
   dummy(4);
}

The beauty of all of this is that with free tools like gnu, although not the best compiler as far as the tightest/fastest code (one size fits all fits no one well), it does compile to object or binary obviously and has a disassembler that will disassemble both objects and binaries so you can play with simple functions like these and examine what the compile options do and start to understand what dead code is or looks like. Doesnt cost you anything but time.

Most folks that somewhat get this go for the volatile solution, if you are trying to do some hand optimization and starting slow and building up, volatile is going to mess with your results not help you, it overworks the processor in an unnatural way than if you didnt have it there, you could eventually come up with real code that calls other functions that is in a loop whose performance will vary greatly with and without the volatile on one of the variables.

Benchmarking in a word is bogus anyway, way too easy to manipulate the results even if using the same exact source code on the same exact computer with the same compiler. All that matters is your actual program that performs your actual task, has this performance. Implement it a different way and measure that if possible and add some margin and decide which is faster and go with it, or if about the same go with the easier to read and/or maintain.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 shredder
Solution 3 old_timer