'Why don't compilers optimize trivial wrapper function pointers?

Consider the following code snippet

#include <vector>
#include <cstdlib>

void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); } 
void wrapper2(double& a, int x) { calculate2(a, x); } 

typedef void (*Func)(double&, int);

int main()
{
    std::vector<std::pair<double, Func>> pairs = {
        std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
        std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
    };

    for (auto& [a, wrapper] : pairs)
        (*wrapper)(a, 5);

    return pairs[0].first + pairs[1].first;
}

With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:

mov     ebp, OFFSET FLAT:wrapper2(double&, int)   # tmp118,

which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.

Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).

What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?

Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).

Edit 2. Much simpler example of the same pattern:

#include <vector>
#include <cstdlib>

typedef void (*Func)(double&, int);

static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); } 

int main() {
    double a = 5.0;
    Func f;
    if (rand() % 2)
        f = &wrapper; // f = &calculate;
    else
        f = &wrapper;
    f(a, 0); 
    return 0;
}

gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.



Solution 1:[1]

You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.

I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.

Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.

Solution 2:[2]

You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.

gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.

These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.

There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.

You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.

Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 M.M
Solution 2 halfer