'Occasional deadlock in multiprocessing.Pool
I have N
independent tasks that are executed in a multiprocessing.Pool
of size os.cpu_count()
(8 in my case), with maxtasksperchild=1
(i.e. a fresh worker process is created for each new task).
The main script can be simplified to:
import subprocess as sp
import multiprocessing as mp
def do_work(task: dict) -> dict:
res = {}
# ... work ...
for i in range(5):
out = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE, check=False, timeout=60)
res[i] = out.stdout.decode('utf-8')
# ... some more work ...
return res
if __name__ == '__main__':
tasks = load_tasks_from_file(...) # list of dicts
logger = mp.get_logger()
results = []
with mp.Pool(processes=os.cpu_count(), maxtasksperchild=1) as pool:
for i, res in enumerate(pool.imap_unordered(do_work, tasks), start=1):
results.append(res)
logger.info('PROGRESS: %3d/%3d', i, len(tasks))
dump_results_to_file(results)
The pool sometimes gets stuck. The traceback when I do a KeyboardInterrupt
is here.
It indicates that the pool won't fetch new tasks and/or worker processes are stuck in a queue / pipe recv()
call. I was unable to reproduce this deterministically, varying different configs of my experiments. There's a chance that if I run the same code again, it'll finish gracefully.
Further observations:
- Python 3.7.9 on x64 Linux
- start method for multiprocessing is
fork
(usingspawn
does not solve the issue) strace
reveals that the processes are stuck in afutex wait
; gdb's backtrace also shows:do_futex_wait.constprop
- disabling logging / explicit flushing does not help
- there's no bug in how a task is defined (i.e. they are all loadable).
Update: It seems that deadlock occurs even with a pool of size = 1.
strace
reports that the process is blocked on trying to acquire some lock located at 0x564c5dbcd000
:
futex(0x564c5dbcd000, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
and gdb
confirms:
(gdb) bt
#0 0x00007fcb16f5d014 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
#1 0x00007fcb16f5d118 in __new_sem_wait_slow.constprop.0 () from /usr/lib/libpthread.so.0
#2 0x0000564c5cec4ad9 in PyThread_acquire_lock_timed (lock=0x564c5dbcd000, microseconds=-1, intr_flag=0)
at /tmp/build/80754af9/python_1598874792229/work/Python/thread_pthread.h:372
#3 0x0000564c5ce4d9e2 in _enter_buffered_busy (self=self@entry=0x7fcafe1e7e90)
at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:282
#4 0x0000564c5cf50a7e in _io_BufferedWriter_write_impl.isra.2 (self=0x7fcafe1e7e90)
at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:1929
#5 _io_BufferedWriter_write (self=0x7fcafe1e7e90, arg=<optimized out>)
at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/clinic/bufferedio.c.h:396
Solution 1:[1]
The deadlock occurred due to high memory usage in workers, thus triggering the OOM killer which abruptly terminated the worker subprocesses, leaving the pool in a messy state.
This script reproduces my original problem.
For the time being I am considering switching to a ProcessPoolExecutor
which will throw a BrokenProcessPool
exception when an abrupt worker termination occurs.
References:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |