'Why SCHED_FIFO threads are assigned to the same physical CPU even though idle CPUs are available?
While debugging some performance issue in app I'm working on, I found out weird behaviour of kernel scheduler. It seems that busy SCHED_FIFO tasks tend to be scheduled on logical cores from the same physical CPU even though there are idle physical CPUs in the system.
8624 root -81 0 97.0g 49g 326m R 100 52.7 48:13.06 26 Worker0 <-- CPU 6 and 26
8629 root -81 0 97.0g 49g 326m R 100 52.7 44:56.26 6 Worker5 <-- the same physical core
8625 root -81 0 97.0g 49g 326m R 82 52.7 58:20.65 23 Worker1
8627 root -81 0 97.0g 49g 326m R 67 52.7 55:28.86 27 Worker3
8626 root -81 0 97.0g 49g 326m R 67 52.7 46:04.55 32 Worker2
8628 root -81 0 97.0g 49g 326m R 59 52.7 44:23.11 5 Worker4
Initially threads shuffle between cores, but at some point most CPU intensive threads ends up locked on the samey physical core and doesn't seem to move from there. There is no affinity set for Worker threads.
I tried to reproduce it with synthetic load by running 12 instances of:
chrt -f 10 yes > /dev/null &
And here is what I got:
25668 root -11 0 2876 752 656 R 100 0.0 0:17.86 20 yes
25663 root -11 0 2876 744 656 R 100 0.0 0:19.10 25 yes
25664 root -11 0 2876 752 656 R 100 0.0 0:18.79 6 yes
25665 root -11 0 2876 804 716 R 100 0.0 0:18.54 7 yes
25666 root -11 0 2876 748 656 R 100 0.0 0:18.31 8 yes
25667 root -11 0 2876 812 720 R 100 0.0 0:18.08 29 yes <--- core9
25669 root -11 0 2876 744 656 R 100 0.0 0:17.62 9 yes <--- core9
25670 root -11 0 2876 808 720 R 100 0.0 0:17.37 2 yes
25671 root -11 0 2876 748 656 R 100 0.0 0:17.15 23 yes <--- core3
25672 root -11 0 2876 804 712 R 100 0.0 0:16.94 4 yes
25674 root -11 0 2876 748 656 R 100 0.0 0:16.35 3 yes <--- core3
25673 root -11 0 2876 812 716 R 100 0.0 0:16.68 1 yes
This is server with 20 physical cores, so there is 8 remaining idle cores and threads are still scheduled on the same physical core. This is reproducible and persistent. It doesn't seem to happen for non-SCHED_FIFO threads. Also it started after migrating past kernel 4.19
Is this correct behaviour for SCHED_FIFO threads? Is there any flag or config option that can change this scheduler behaviour?
Solution 1:[1]
If I'm understanding correctly, you're trying to use SCHED_FIFO
with hyperthreading ("HT") enabled, which results in multiple thread processors per physical core. My understanding is that HT-awareness within the Linux kernel is mainly through the load balancing and scheduler domains within CFS (the default scheduler these days). See https://stackoverflow.com/a/29587579/2530418 for more info.
Using SCHED_FIFO
or SCHED_RR
would then essentially bypass HT handling, since RT scheduling doesn't really go through CFS.
My approach to dealing with this in the past has been to disable hyperthreading. For cases where you actually need real-time behavior, this is usually the right latency/performance tradeoff to make anyway (see https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application#Hyper_threading). Whether this is appropriate really depends on what problem you're trying to solve.
Aside: I suspect if you actually need SCHED_FIFO
behavior then disabling HT is what you'll want to do, but it's also common for people think that they need SCHED_FIFO
when it's the wrong tool for the job. My suspicion is that there may be a better option than using SCHED_FIFO
since you're describing running on a conventional server rather than an embedded system, but that's an over-generalizing guess. Hard to say without more specifics about the issue.
Solution 2:[2]
The problem was caused by this particular change: https://lkml.iu.edu/hypermail/linux/kernel/1806.0/04887.html
Per CPU core watchdog threads were removed
watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
Before, there were run periodically every 4 seconds and because they were absolutely highest priority, they were causing periodic rescheduling. When they are gone, there is nothing that can pre-empt SCHED_FIFO threads and migrate them to "better" core. So this was all just side effect of watchdog implementation. In general there is no mechanism in kernel that will perform rebalancing of runaway RT threads.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | cha5on |
Solution 2 |