'Why SCHED_FIFO threads are assigned to the same physical CPU even though idle CPUs are available?

While debugging some performance issue in app I'm working on, I found out weird behaviour of kernel scheduler. It seems that busy SCHED_FIFO tasks tend to be scheduled on logical cores from the same physical CPU even though there are idle physical CPUs in the system.

 8624 root     -81   0 97.0g  49g 326m R  100 52.7  48:13.06 26 Worker0 <-- CPU 6 and 26 
 8629 root     -81   0 97.0g  49g 326m R  100 52.7  44:56.26  6 Worker5 <-- the same physical core
 8625 root     -81   0 97.0g  49g 326m R   82 52.7  58:20.65 23 Worker1
 8627 root     -81   0 97.0g  49g 326m R   67 52.7  55:28.86 27 Worker3
 8626 root     -81   0 97.0g  49g 326m R   67 52.7  46:04.55 32 Worker2
 8628 root     -81   0 97.0g  49g 326m R   59 52.7  44:23.11  5 Worker4

Initially threads shuffle between cores, but at some point most CPU intensive threads ends up locked on the samey physical core and doesn't seem to move from there. There is no affinity set for Worker threads.

I tried to reproduce it with synthetic load by running 12 instances of:

chrt -f 10 yes > /dev/null &

And here is what I got:

25668 root     -11   0  2876  752  656 R  100  0.0   0:17.86 20 yes
25663 root     -11   0  2876  744  656 R  100  0.0   0:19.10 25 yes
25664 root     -11   0  2876  752  656 R  100  0.0   0:18.79  6 yes
25665 root     -11   0  2876  804  716 R  100  0.0   0:18.54  7 yes
25666 root     -11   0  2876  748  656 R  100  0.0   0:18.31  8 yes
25667 root     -11   0  2876  812  720 R  100  0.0   0:18.08 29 yes <--- core9
25669 root     -11   0  2876  744  656 R  100  0.0   0:17.62  9 yes <--- core9
25670 root     -11   0  2876  808  720 R  100  0.0   0:17.37  2 yes 
25671 root     -11   0  2876  748  656 R  100  0.0   0:17.15 23 yes <--- core3
25672 root     -11   0  2876  804  712 R  100  0.0   0:16.94  4 yes
25674 root     -11   0  2876  748  656 R  100  0.0   0:16.35  3 yes <--- core3
25673 root     -11   0  2876  812  716 R  100  0.0   0:16.68  1 yes

This is server with 20 physical cores, so there is 8 remaining idle cores and threads are still scheduled on the same physical core. This is reproducible and persistent. It doesn't seem to happen for non-SCHED_FIFO threads. Also it started after migrating past kernel 4.19

Is this correct behaviour for SCHED_FIFO threads? Is there any flag or config option that can change this scheduler behaviour?

Solution 1:^[1]

If I'm understanding correctly, you're trying to use SCHED_FIFO with hyperthreading ("HT") enabled, which results in multiple thread processors per physical core. My understanding is that HT-awareness within the Linux kernel is mainly through the load balancing and scheduler domains within CFS (the default scheduler these days). See https://stackoverflow.com/a/29587579/2530418 for more info.

Using SCHED_FIFO or SCHED_RR would then essentially bypass HT handling, since RT scheduling doesn't really go through CFS.

My approach to dealing with this in the past has been to disable hyperthreading. For cases where you actually need real-time behavior, this is usually the right latency/performance tradeoff to make anyway (see https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application#Hyper_threading). Whether this is appropriate really depends on what problem you're trying to solve.

Aside: I suspect if you actually need SCHED_FIFO behavior then disabling HT is what you'll want to do, but it's also common for people think that they need SCHED_FIFO when it's the wrong tool for the job. My suspicion is that there may be a better option than using SCHED_FIFO since you're describing running on a conventional server rather than an embedded system, but that's an over-generalizing guess. Hard to say without more specifics about the issue.

Solution 2:^[2]

The problem was caused by this particular change: https://lkml.iu.edu/hypermail/linux/kernel/1806.0/04887.html

Per CPU core watchdog threads were removed

watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);

Before, there were run periodically every 4 seconds and because they were absolutely highest priority, they were causing periodic rescheduling. When they are gone, there is nothing that can pre-empt SCHED_FIFO threads and migrate them to "better" core. So this was all just side effect of watchdog implementation. In general there is no mechanism in kernel that will perform rebalancing of runaway RT threads.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	cha5on
Solution 2

'Why SCHED_FIFO threads are assigned to the same physical CPU even though idle CPUs are available?

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]