'Why is calling snprintf() so slow?

Our internal program is written in C and makes extensive use of snprintf() for many pieces, and I noticed that during debugging with perf record/report, it's spending a lot of time on the following:

       │      _IO_vfprintf_internal():                                                                                                                                                                                             ▒
       │        mov    -0x510(%rbp),%rdx                                                                                                                                                                                           ▒
       │        mov    %r12,%rsi                                                                                                                                                                                                   ▒
       │        mov    %r15,%rdi                                                                                                                                                                                                   ▒
       │      → callq  *0x38(%rax)                                                                                                                                                                                                 ▒
       │        cmp    %rax,-0x510(%rbp)                                                                                                                                                                                           ▒
       │        mov    -0x530(%rbp),%r9                                                                                                                                                                                            ▒
       │      ↑ jne    91a                                                                                                                                                                                                         ▒
       │        mov    -0x4d0(%rbp),%esi                                                                                                                                                                                           ▒
       │        mov    -0x540(%rbp),%ecx                                                                                                                                                                                           ▒
       │        mov    $0x7fffffff,%eax                                                                                                                                                                                            ▒
       │        sub    %esi,%eax                                                                                                                                                                                                   ▒
       │        add    %esi,%ecx                                                                                                                                                                                                   ▒
       │        cltq                                                                                                                                                                                                               ▒
       │        cmp    %rax,-0x510(%rbp)                                                                                                                                                                                           ▒
       │      ↑ jbe    252b                                                                                                                                                                                                        ▒
       │      ↑ jmpq   28f0                                                                                                                                                                                                        ▒
       │4a70:   xor    %eax,%eax                                                                                                                                                                                                   ▒
       │        or     $0xffffffffffffffff,%rcx                                                                                                                                                                                    ▒
       │        mov    %r12,%rdi                                                                                                                                                                                                   ▒
 99.52 │        repnz  scas %es:(%rdi),%al

It seems regardless of the GCC version/glibc, I've tried GCC 4.8.5 all the way to GCC 9, -O2 to -O3, etc. I've even used a sandbox with glibc from -HEAD.

Edit: I found a test program that demonstrates this problem quite well with perf:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

char dst[20000];

static void test_strcpy(int i) {
    size_t len = sizeof(dst);
    char* src = (char*)malloc(len + 1);
    memset(src, 'a', len);
    src[len] = 0;

    while (i--)
        strcpy(dst, src);

    free(src);
}

static void test_strncat(int i) {
    size_t len = sizeof(dst);
    char* src = (char*)malloc(len + 1);
    memset(src, 'a', len);
    src[len] = 0;

    while (i--) {
    dst[0] = 0;
        strncat(dst, src, sizeof(dst));
    }

    free(src);
}

#define BENCH(FN, ARG)                          \
do {                                    \
    struct timeval start, end;                      \
    long sec, msec;                         \
    gettimeofday(&start, NULL);                     \
    FN(ARG);                                \
    gettimeofday(&end, NULL);                       \
    sec = end.tv_sec - start.tv_sec;                    \
    msec = end.tv_usec - start.tv_usec;                 \
    if (msec < 0) {                         \
    --sec;                              \
    msec += 1000000;                        \
    }                                   \
    printf("%13s(%5d): %2ld sec %6ld msec\n", #FN, ARG, sec, msec); \
} while (0)

static void test_snprintf(int i) {
    size_t len = sizeof(dst);
    char* src = (char*)malloc(len + 1);
    memset(src, 'a', len);
    src[len] = 0;

    while (i--)
        snprintf(dst, sizeof(dst), "%s", src);

    free(src);
}

int main(int argc, char *argv[]) {
int i;
for (i = 0; i < 10; i++) {
    BENCH(test_strcpy, 10);
    BENCH(test_strcpy, 100);
    BENCH(test_strcpy, 1000);
    BENCH(test_strcpy, 10000);

    BENCH(test_strncat, 10);
    BENCH(test_strncat, 100);
    BENCH(test_strncat, 1000);
    BENCH(test_strncat, 10000);

    BENCH(test_snprintf, 10);
    BENCH(test_snprintf, 100);
    BENCH(test_snprintf, 1000);
    BENCH(test_snprintf, 10000);
    printf("\n");
}
    return 0;
}

It seems as if repnz scasb is the culprit, regardless of the string size. I've read that repnz scasb shouldn't be used for small strings because the setup cost on silicon is quite high.

Solution 1:^[1]

GCC ~~bug 88809~~ is closed, fixed for GCC10, affecting GCC9 and earlier.

GCC10 and later should no longer expand strlen inline with slow crap like repnz scasb when compiling glibc. (See Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? for details of how strlen compiled in affected GCC versions. The version you show using repnz scasb looks like the -O1 worst case.)

snprintf is still not fast in general (lots of overhead parsing the format string and passing varargs through wrapper functions). But with that bugfix, it should hopefully scale similarly to large strings as strcpy, or at least as strlen + memcpy which makes two passes over the data. So with large inputs that lead to spending most of its time in strlen and/or memcpy, it will at least be doing that efficiently. Probably with the time evenly split between strlen and memcpy, if it internally uses memcpy instead of a hand-written loop. Instead of 99% in strlen like you found with perf record, unless that was only counting sample hits in this function itself, not in calls to memcpy.

Your benchmark test-cases with medium to large strings like 1000 and 10k bytes will still spend most of their time on that, not parsing format strings and varargs / wrapper function overhead, but for small sizes that overhead will dominate vs. strncpy that just goes straight into copying, finding the length along the way.

As pointed out in comments, your strncat loop is an anti-pattern. Building up a large string that way re-scans the already-copied part every time, giving O(N * M) run time to concatenate N strings of length M.

Or in your case, once you fill up the destination, then it just costs O(dstsize) time per call to basically memchr it for a terminating zero.

To avoid this, use strlen and memcpy yourself to keep track of the point to append to. Or if you can use POSIX-2008 functions, use stpcpy which returns the end of the copy, the place where you want to start copying the next. Unfortunately this doesn't give an easy way to do bounds checks.

stpcpy should be as fast as strcpy in glibc, I assume using the same hand-written asm. It's been part of glibc since 1992.

See also strcpy() return value about the poor API design of ISO C string functions that date back very early. (Separate from the strncpy design that makes it suck for avoiding buffer overflows, not forcing the destination to be zero-terminated.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	ecm

'Why is calling snprintf() so slow?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]