'Why is calling snprintf() so slow?
Our internal program is written in C and makes extensive use of snprintf()
for many pieces, and I noticed that during debugging with perf record/report, it's spending a lot of time on the following:
│ _IO_vfprintf_internal(): ▒
│ mov -0x510(%rbp),%rdx ▒
│ mov %r12,%rsi ▒
│ mov %r15,%rdi ▒
│ → callq *0x38(%rax) ▒
│ cmp %rax,-0x510(%rbp) ▒
│ mov -0x530(%rbp),%r9 ▒
│ ↑ jne 91a ▒
│ mov -0x4d0(%rbp),%esi ▒
│ mov -0x540(%rbp),%ecx ▒
│ mov $0x7fffffff,%eax ▒
│ sub %esi,%eax ▒
│ add %esi,%ecx ▒
│ cltq ▒
│ cmp %rax,-0x510(%rbp) ▒
│ ↑ jbe 252b ▒
│ ↑ jmpq 28f0 ▒
│4a70: xor %eax,%eax ▒
│ or $0xffffffffffffffff,%rcx ▒
│ mov %r12,%rdi ▒
99.52 │ repnz scas %es:(%rdi),%al
It seems regardless of the GCC version/glibc, I've tried GCC 4.8.5 all the way to GCC 9, -O2
to -O3
, etc. I've even used a sandbox with glibc from -HEAD.
Edit: I found a test program that demonstrates this problem quite well with perf:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
char dst[20000];
static void test_strcpy(int i) {
size_t len = sizeof(dst);
char* src = (char*)malloc(len + 1);
memset(src, 'a', len);
src[len] = 0;
while (i--)
strcpy(dst, src);
free(src);
}
static void test_strncat(int i) {
size_t len = sizeof(dst);
char* src = (char*)malloc(len + 1);
memset(src, 'a', len);
src[len] = 0;
while (i--) {
dst[0] = 0;
strncat(dst, src, sizeof(dst));
}
free(src);
}
#define BENCH(FN, ARG) \
do { \
struct timeval start, end; \
long sec, msec; \
gettimeofday(&start, NULL); \
FN(ARG); \
gettimeofday(&end, NULL); \
sec = end.tv_sec - start.tv_sec; \
msec = end.tv_usec - start.tv_usec; \
if (msec < 0) { \
--sec; \
msec += 1000000; \
} \
printf("%13s(%5d): %2ld sec %6ld msec\n", #FN, ARG, sec, msec); \
} while (0)
static void test_snprintf(int i) {
size_t len = sizeof(dst);
char* src = (char*)malloc(len + 1);
memset(src, 'a', len);
src[len] = 0;
while (i--)
snprintf(dst, sizeof(dst), "%s", src);
free(src);
}
int main(int argc, char *argv[]) {
int i;
for (i = 0; i < 10; i++) {
BENCH(test_strcpy, 10);
BENCH(test_strcpy, 100);
BENCH(test_strcpy, 1000);
BENCH(test_strcpy, 10000);
BENCH(test_strncat, 10);
BENCH(test_strncat, 100);
BENCH(test_strncat, 1000);
BENCH(test_strncat, 10000);
BENCH(test_snprintf, 10);
BENCH(test_snprintf, 100);
BENCH(test_snprintf, 1000);
BENCH(test_snprintf, 10000);
printf("\n");
}
return 0;
}
It seems as if repnz scasb
is the culprit, regardless of the string size. I've read that repnz scasb
shouldn't be used for small strings because the setup cost on silicon is quite high.
Solution 1:[1]
GCC bug 88809 is closed, fixed for GCC10, affecting GCC9 and earlier.
GCC10 and later should no longer expand strlen
inline with slow crap like repnz scasb
when compiling glibc. (See Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? for details of how strlen
compiled in affected GCC versions. The version you show using repnz scasb
looks like the -O1
worst case.)
snprintf
is still not fast in general (lots of overhead parsing the format string and passing varargs through wrapper functions). But with that bugfix, it should hopefully scale similarly to large strings as strcpy
, or at least as strlen
+ memcpy
which makes two passes over the data. So with large inputs that lead to spending most of its time in strlen and/or memcpy, it will at least be doing that efficiently. Probably with the time evenly split between strlen and memcpy, if it internally uses memcpy instead of a hand-written loop. Instead of 99% in strlen like you found with perf record
, unless that was only counting sample hits in this function itself, not in calls to memcpy.
Your benchmark test-cases with medium to large strings like 1000 and 10k bytes will still spend most of their time on that, not parsing format strings and varargs / wrapper function overhead, but for small sizes that overhead will dominate vs. strncpy
that just goes straight into copying, finding the length along the way.
As pointed out in comments, your strncat
loop is an anti-pattern. Building up a large string that way re-scans the already-copied part every time, giving O(N * M)
run time to concatenate N strings of length M.
Or in your case, once you fill up the destination, then it just costs O(dstsize) time per call to basically memchr it for a terminating zero.
To avoid this, use strlen
and memcpy
yourself to keep track of the point to append to. Or if you can use POSIX-2008 functions, use stpcpy
which returns the end of the copy, the place where you want to start copying the next. Unfortunately this doesn't give an easy way to do bounds checks.
stpcpy
should be as fast as strcpy
in glibc, I assume using the same hand-written asm. It's been part of glibc since 1992.
See also strcpy() return value about the poor API design of ISO C string functions that date back very early. (Separate from the strncpy
design that makes it suck for avoiding buffer overflows, not forcing the destination to be zero-terminated.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | ecm |