'Does gcc initialize long strings to `""` but not short ones?

Note: I know that reading an uninitialized string is undefined behaviour. This question is specifically about the GCC implementation.

I am using GCC version 6.2.1 and I have observed that uninitialized strings of length greater than 100 or so are initialized to "". Reading an uninitialized string is undefined behaviour, so the compiler is free to set it to "" if it wants to, and it seems that GCC is doing this when the string is long enough. Of course I would never rely on this behaviour in production code - I am just curious about where this behaviour comes from in GCC. If it's not in the GCC code somewhere then it's a very strange coincidence that it keeps happening.

If I write the following program

/* string_initialization.c */
#include <stdio.h>

int main()
{
  char short_string[10];
  char long_string[100];
  char long_long_string[1000];

  printf("%s\n", short_string);
  printf("%s\n", long_string);
  printf("%s\n", long_long_string);

  return(0);
}

and compile and run it with GCC, I get:

$ ./string_initialization
�QE�


$

(sometimes the first string is empty as well). This suggests that if a string is long enough, then GCC will initialize it to "", but otherwise it will not always do so.

If I compile the following program with GCC and run it:

#include <stdio.h>

int main()
{
  char long_string[100];
  int i;

  for (i = 0 ; i < 100 ; ++i)
  {
    printf("%d ", long_string[i]);
  }
  printf("\n");

  return(0);
}

then I get

0 0 0 0 0 0 0 0 -1 -75 -16 0 0 0 0 0 -62 0 0 0 0 0 0 0 15 84 -42 -17 -4 127 0 0 14 84 -42 -17 -4 127 0 0 69 109 79 -50 46 127 0 0 1 0 0 0 0 0 0 0 -35 5 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -112 5 64 0 0 0 0 0 80 4 64 0 0 0 0 0 16 85 -42 -17 

so just the start of the string is being initialized to 0, not the whole thing.

I'd like to look into the GCC source code to see what the policy is, but I don't know that code base well enough to know where to look.

Background: My CS student turned in some work in which they declared a string to have length 1000 because "otherwise strange symbols are printed". You can probably guess why. I want to be able to give them a good answer as to why this was going on and why their "fix" worked.

Update: Thanks to those of you who gave useful answers. I've just found out that my computer prints out an empty string if the string is of length 1000, but garbage if the string is of length 960. See pts's answer for a good explanation. Of course, all this is completely system-dependent and is not part of GCC.



Solution 1:[1]

As others have commented before, reading uninitialized data (e.g. elements of short_string) is undefined behavior according to the C standard.

If you are interested in what actually happens when compiling it by GCC and running it on Linux, here are some insights.

main is not the first function which gets run when your program starts. The entry point is usually called _start, and it calls main. What is on the stack in these uninitialized arrays when main is running depends on what has been put there before, i.e. what _start has done before calling main. What _start does depends on GCC and the libc.

To figure out what actually happens, you may want to compile your program with gcc -static -g, and run it in a debugger, something like this:

$ gcc -static -g -o myprog myprog.c
$ gdb ./myprog
(gdb) b _start
(gdb) run
(gdb) s

Instead of s you may want to issue other GDB commands to get the disassembly of _start, and run it instruction-by-instruction.

One possible explanation why your program is reading more 0s from an uninitialized long array than from an uninitialized short array, is probably that the stack was (mostly) all 0s in the beginning, before _start started running, then _start has overwritten some bytes of the stack, but the beginning of the long array is in part of the stack which hasn't been overwritten by _start, so it's still all 0s. Use a debugger to confirm.

You may also be interested in reading data from uninitialized global arrays. These arrays are guaranteed to be initialized to 0 by the C standard, and this is implemented by GCC putting them into the .bss section. See how about .bss section not zero initialized about how .bss is initialized.

Solution 2:[2]

GCC doesn't initialize those strings, at all, ever. You are just seeing that the stack happens to contain zeros and you are imagining that is some intentional behaviour of the compiler. It isn't.

Compare your results with http://coliru.stacked-crooked.com/a/38f3e70be871af61 which shows that even if the first few bytes of the array happen to be zero the first time the function is called, the bytes are not zero the second time (because I made the stack dirty, and the compiler doesn't initialize the array).

You cannot assume that some undefined behaviour is reliable, repeatable or intentional. That's a very dangerous assumption.

Solution 3:[3]

The answer is simple the reason this happens is due to undefined behavior caused by reading values of uninitialized variables

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2 Jonathan Wakely
Solution 3