'Why does ctypes.cast() appear to trigger a memory leak?

Using python 3.9.9 (on Windows 10), I've been experiencing 'Out of memory" related issues for an application that makes heavy use of ctypes.

I was able to boil these problems down to a simple reproducer, which very quickly triggers a similar MemoryError:

import ctypes

if __name__ == '__main__':
    
    i = 0
    while True:
        print("i = %d" % (i))
        i = i + 1
        barray = bytearray(10485760)
        ubuffer = (ctypes.c_char * len(barray)).from_buffer(barray)
        c_ptr = ctypes.cast(ubuffer, ctypes.POINTER(ctypes.c_char))

On my system, after ~350-ish iterations of the while loop, a MemoryError is triggered:

...
i = 336
i = 337
i = 338
i = 339
i = 340
i = 341
Traceback (most recent call last):
  File "C:\..\crash_reproducer.py", line 9, in <module>
    barray = bytearray(10485760)
MemoryError

Can someone help to explain what is going on here?

Secondly, once I remove the 'ctype.cast', I no longer experience a MemoryError. Any idea why that would be?



Solution 1:[1]

I'm going to answer your question by making several assertions, which I will then support below:

  1. What determines whether your process as currently will stop running is not what operating system it runs on, and not even whether the process is 32 bit or 64 bit, but rather the maximum size the process is allowed to reach before it fails to allocate more memory or is killed.

  2. Even in an environment such as yours where the program crashes, the reason for the crash is not a memory leak but rather that garbage collection has simply not happened early enough to prevent the crash.

  3. The growth of the process is caused primarily by the fact that the bytearray objects assigned to barray each time through the loop, under the program as currently written, are not actually freed until garbage collection occurs and that each bytearray object holds a large amount of memory, as determined when the bytearray is created.

  4. The reason garbage collection happens too late on some platforms is that the logic for garbage collection does not take into account the total memory held by the bytearray.

  5. The reason that the statement that uses ctypes.cast introduces the crash in your case is that each time that statement is run it introduces a new reference cycle of python objects and a chain of references that leads from one of the objects in the cycle to the bytearray. As the program is written, none of the objects in each reference cycle can be freed until the cycle is broken and in turn this means that the reference count for any bytearray object held by that loop will not go to 0. I'll show what the cycle is and what the chain of references is leading to the bytearray.

To support assertion (1) it suffices to show that reducing the maximum virtual memory allowed to the process makes it stop prematurely. On my Linux system, a 64 bit process will not stop with your program but if I reduce the maximum size it will:

$ (ulimit -v 131072; python3 usectypes.py)
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
Traceback (most recent call last):
  File "usectypes.py", line 9, in <module>
    barray = bytearray(10485760)
MemoryError

To support assertion (2) it suffices to show that making garbage collection happen earlier prevents the crash, as one can do by adding two lines to the program from the question, running the program and observing that it does not crash. I'll leave running the program as an exercise for the reader, but here is what it looks like with the two extra lines:

import ctypes
import gc

if __name__ == '__main__':
    
    i = 0
    while True:
        print("i = %d" % (i))
        i = i + 1
        barray = bytearray(10485760)
        ubuffer = (ctypes.c_char * len(barray)).from_buffer(barray)
        c_ptr = ctypes.cast(ubuffer, ctypes.POINTER(ctypes.c_char))
        gc.collect()

To support assertion (3) one can simply, as an exercise for the reader, show that one can modify the original program by decreasing the constant on the line that creates the bytearray and that this will allow the program to run even as 32 bit Windows process and can show that increasing the constant on that line can cause the program running to terminate prematurely when it is run as a 64-bit Linux process. Here is the line in question:

        barray = bytearray(10485760)

So, for example, you should be able to make your program not run out of memory as a 32-bit windows process by changing the line to:

        barray = bytearray(1048576)

Similarly, I made the program, running in a 64-bit python process on my Linux system, terminate on the 15th time through the loop, even without using ulimit to artificially reduce the virtual memory allowed to that program, by changing that line to look as shown:

        barray = bytearray(1048576000)

The above doesn't fully show assertion (4), in that it doesn't actually look into the python source to show that the garbage collection logic doesn't account at all for the size taken by the bytearray. However, it at least strongly suggests that the garbage collection logic doesn't account sufficiently for the size, because one can take the Linux case and make it stop working after multiple times through the loop. If the constant used above had been too large to work even for a single allocation, the program would have failed the first time through the loop. The code for cpython is open source and one can verify this assertion further by looking at it, but I am not going to show that here.

To show assertion (5) I ran the original program for several minutes on Linux, where it didn't crash for me, gathered a live core of the process using gcore and analyzed the core with the open source tool chap.

I started it in background and ignored the output because there are ways to see the value of i from the core if one needs it, but I gave it more than 3 minutes to run:

$ python3 usectypes.py >/dev/null &
[1] 658100
$ sleep 180

To get a suitable core, I set coredump_filter for that process so that I'd get all the sections I needed in the core then created the core using gcore.

$ echo 0x37 >/proc/658100/coredump_filter
$ sudo gcore 658100
[sudo] password for tim: 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffb381f8a51 in __memset_avx2_erms () from /lib/x86_64-linux-gnu/libc.so.6
warning: target file /proc/658100/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.658100
[Inferior 1 (process 658100) detached]
$ 

After I had the core I started chap and used two commands to show that out of the 3,800,186,880 bytes of memory used by the process, 3,764,390,712 bytes were used by allocations of size 0xa00008, where 0xa00008 is 10 MiB + 8 so almost certainly associated with those bytearray instances, and supporting the theory that they are not being freed until garbage collection happens.

$ chap core.658100*
chap> count writable
27 writable ranges use 0xe2824000 (3,800,186,880) bytes.
chap> summarize used /minsize a00000
Unrecognized allocations have 359 instances taking 0xe0600b38(3,764,390,712) bytes.
   Unrecognized allocations of size 0xa00008 have 359 instances taking 0xe0600b38(3,764,390,712) bytes.
359 allocations use 0xe0600b38 (3,764,390,712) bytes.

I used the following command to sample those large allocations:

chap> describe used /minsize a00000 /geometricSample 100
Anchored allocation at 212d4b0 of size a00008

Anchored allocation at 3ff2dae0 of size a00008

2 allocations use 0x1400010 (20,971,536) bytes.

I picked one of these and looked at the allocations that referenced it, confirming that it was associated with a bytearray:

chap> describe exactincoming 212d4b0
Anchored allocation at 7ffb376addb0 of size c0
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x904080 (memoryview)

Anchored allocation at 7ffb376dc930 of size 40
This allocation matches pattern SimplePythonObject.
This has reference count 1 and python type 0x9005a0 (bytearray)

Anchored allocation at 7ffb376e32b0 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x904220 (managedbuffer)

Anchored allocation at 7ffb376e3830 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x211e830 (c_char_Array_10485760)

4 allocations use 0x200 (512) bytes.

One can look at the (bytearrayobject.c in the source for cpython)[https://github.com/python/cpython/blob/main/Objects/bytearrayobject.c] to see that the bytearray object regards itself as the sole owner of the big buffer that is the target of the ob_bytes field.

static void
bytearray_dealloc(PyByteArrayObject *self)
{
    if (self->ob_exports > 0) {
        PyErr_SetString(PyExc_SystemError,
                        "deallocated bytearray object has exported buffers");
        PyErr_Print();
    }
    if (self->ob_bytes != 0) {
        PyObject_Free(self->ob_bytes);
    }
    Py_TYPE(self)->tp_free((PyObject *)self);
}

This means that to understand why the big buffers are being kept in memory, we need to understand why the corresponding bytearray objects are still in memory. One way to do that from chap is with a command like this:

chap> describe allocation 7ffb376dc930 /extend %SimplePythonObject<- /extend %ContainerPythonObject<- /extend %PyDictKeysObject<- /skipUnfavoredReferences true /commentExtensions true
Anchored allocation at 7ffb376dc930 of size 40
This allocation matches pattern SimplePythonObject.
This has reference count 1 and python type 0x9005a0 (bytearray)

# Allocation at 0x7ffb376dc930 is referenced by allocation at 0x7ffb376addb0.
Anchored allocation at 7ffb376addb0 of size c0
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x904080 (memoryview)

# Allocation at 0x7ffb376addb0 is referenced by allocation at 0x7ffb37722710.
Anchored allocation at 7ffb37722710 of size b0
This allocation matches pattern PyDictKeysObject.

# Allocation at 0x7ffb37722710 is referenced by allocation at 0x7ffb376dca30.
Anchored allocation at 7ffb376dca30 of size 40
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x90bf00 (dict)

# Allocation at 0x7ffb376dca30 is referenced by allocation at 0x7ffb376e3830.
Anchored allocation at 7ffb376e3830 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x211e830 (c_char_Array_10485760)

# Allocation at 0x7ffb376e3830 is referenced by allocation at 0x7ffb37722710.
# Allocation at 0x7ffb37722710 was already visited.

# Allocation at 0x7ffb376dc930 is referenced by allocation at 0x7ffb376e32b0.
Anchored allocation at 7ffb376e32b0 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x904220 (managedbuffer)

# Allocation at 0x7ffb376e32b0 is referenced by allocation at 0x7ffb376addb0.
# Allocation at 0x7ffb376addb0 was already visited.

6 allocations use 0x2f0 (752) bytes.

The above shows that a cycle holds the bytearray because the bytearray is held by a memoryview which is held by a %PyDictKeysObject which is held by a dict, which is held by a c_char_Array_10485760 which is cyclically referenced by the same %PyDictKeysObject. This is mostly not surprising, because we were expecting a cycle, but one interesting thing is that the c_char_Array_10485760 type is associated with ubuffer rather than c_ptr. So this means that even though the assignment to c_ptr was apparently necessary to cause the cycle, the target of c_ptr was not actually part of the cycle.

To verify this we actually need to see what the statement that assigns c_ptr actually does. One way to do this, without actually looking at the ctypes code, is to gather a core before and after the statement and observe that the cycle was not present before the statement but was present after it.

To do this we can modify the original program slightly to sleep before and after the assignment to c_ptr as shown:

import ctypes
import time

if __name__ == '__main__':
    
    i = 0
    while True:
        print("i = %d" % (i))
        i = i + 1
        barray = bytearray(10485760)
        ubuffer = (ctypes.c_char * len(barray)).from_buffer(barray)
        print("sleep after assign ubuffer")
        time.sleep(300)
        c_ptr = ctypes.cast(ubuffer, ctypes.POINTER(ctypes.c_char))
        print("sleep after assign c_ptr")
        time.sleep(300)

We can run the slightly modified program roughly as before, remembering to set the coredump_filter for it so that we get sufficient information in the core:

$ python3 usectypeswithsleeps.py &
[1] 76026
$ i = 0
sleep after assign ubuffer
echo 0x37 >/proc/76026/coredump_filter
$ sudo gcore 76026
[sudo] password for tim: 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd4131a0faa in select () from /lib/x86_64-linux-gnu/libc.so.6
warning: target file /proc/76026/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.76026
[Inferior 1 (process 76026) detached]
$ mv core.76026 core.76026_before_assign_c_ptr
$ sleep after assign c_ptr
$ sudo gcore 76026
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd4131a0faa in select () from /lib/x86_64-linux-gnu/libc.so.6
warning: target file /proc/76026/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.76026
[Inferior 1 (process 76026) detached]
$ mv core.76026 core.76026_after_assign_c_ptr

Now it just remains to look at the two cores and see that the cycle did not exist in the first but did in the second.

Looking at the core from before the assignment, there is only one large buffer because we gathered the core before the first assignment of c_ptr and as expected it is referenced by a bytearray:

chap> describe used ? /minsize a00000 /extend ?@0<-%SimplePythonObject
Anchored allocation at 7fd411cc7010 of size a00ff0

Anchored allocation at 7fd412755cb0 of size 40
This allocation matches pattern SimplePythonObject.
This has reference count 2 and python type 0x9005a0 (bytearray)

2 allocations use 0xa01030 (10,489,904) bytes.

The bytearray is referenced in a few more places that the one we looked at in the larger core. For example in this case the bytearray is actually currently the target of barray, which explains why it is referenced by the %PyDictKeysObject associated with the main function.

chap> describe incoming 7fd412755cb0 /skipUnfavoredReferences true
Anchored allocation at 2972e20 of size 248
This allocation matches pattern PyDictKeysObject.
"__name__" : "__main__"
"__file__" : "usectypeswithsleeps.py"

Anchored allocation at 7fd412751440 of size 50
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 0 and python type 0x90c0a0 (tuple)

Anchored allocation at 7fd4127a0630 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x904220 (managedbuffer)

Anchored allocation at 7fd4127bcf30 of size c0
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x904080 (memoryview)

4 allocations use 0x3d8 (984) bytes.

As seen above, the bytearray is referenced by a memoryview, as expected and, as seen below, that memoryview is held by a %PyDictKeysObject, which is held by a dict.

chap> describe incoming 7fd4127bcf30 /skipUnfavoredReferences true
Anchored allocation at 7fd412765a80 of size b0
This allocation matches pattern PyDictKeysObject.

1 allocations use 0xb0 (176) bytes.
chap> describe incoming 7fd412765a80 /skipUnfavoredReferences true
Anchored allocation at 7fd41287edb0 of size 40
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x90bf00 (dict)

1 allocations use 0x40 (64) bytes.
chap> describe incoming 7fd41287edb0 /skipUnfavoredReferences true
Anchored allocation at 7fd41273ad30 of size 40
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 0 and python type 0x90c0a0 (tuple)

Anchored allocation at 7fd4127a0e30 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x2968980 (c_char_Array_10485760)

2 allocations use 0xc0 (192) bytes.

However, at this point when we look at the incoming references the c_char_Array_10485760 we can see that it is referenced only by a local variable for the main function, meaning that it is not yet involved in a cycle.

chap> describe incoming 7fd4127a0e30 /skipUnfavoredReferences true Anchored allocation at 2972e20 of size 248 This allocation matches pattern PyDictKeysObject. "name" : "main" "file" : "usectypeswithsleeps.py"

1 allocations use 0x248 (584) bytes.

When we look at the core from after the assignment to c_ptr we can see that in addition to that local variable reference it has some others, including one from the %PyDictKeysObject at 0x7fd412765a80 and following that back we can see that the c_char_Array_10485760 is now in a cycle:

chap> describe incoming 7fd4127a0e30 /skipUnfavoredReferences true
Anchored allocation at 2972e20 of size 248
This allocation matches pattern PyDictKeysObject.
"__name__" : "__main__"
"__file__" : "usectypeswithsleeps.py"

Anchored allocation at 7fd41273ad30 of size 40
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 0 and python type 0x90c0a0 (tuple)

Anchored allocation at 7fd412765a80 of size b0
This allocation matches pattern PyDictKeysObject.

Anchored allocation at 7fd4127c49e0 of size 1f0
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 0 and python type 0x906420 (frame)

4 allocations use 0x528 (1,320) bytes.
chap> describe incoming 7fd412765a80 /skipUnfavoredReferences true
Anchored allocation at 7fd41287edb0 of size 40
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 2 and python type 0x90bf00 (dict)

1 allocations use 0x40 (64) bytes.
chap> describe incoming 7fd41287edb0 /skipUnfavoredReferences true
Anchored allocation at 7fd4127a0e30 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 2 and python type 0x2968980 (c_char_Array_10485760)

Anchored allocation at 7fd4127a0eb0 of size 80
This allocation matches pattern ContainerPythonObject.
This has a PyGC_Head at the start so the real PyObject is at offset 0x10.
This has reference count 1 and python type 0x2977010 (LP_c_char)

2 allocations use 0x100 (256) bytes.

So assertion 5 is now proven. One thing that sheds a bit of light on why the statement that assigned c_ptr was involved, is that just after the statement the LP_c_char at 0x7fd4127a0eb0 (which was the target of c_ptr) was referencing the same dict as the c_char_Array_10485760 was referencing. To me it seems like a mild bug that when c_ptr is reassigned and the LP_c_char that was previously the target of c_ptr is freed, that the cycle stays present, but at least for now you can work around this by using gc.collect() from time to time.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tim Boddy