'CPU Load average rule for 5 minutes

We are using Prometheus-Grafana. Now we want to set alert for CPU load average of 5 minutes.

We have 60 servers which have different CPU core like few machine have 1 core, 2 core, 6 core, 8 core etc.

The below Rule will give the result for load 5 minutes. But it will not differentiate machine is single core or multicore.

- name: alerting_rules
    rules:
      - alert: LoadAverage15m
        expr: node_load5 >= 0.75
        labels:
          severity: major
        annotations:
          summary: "Instance {{ $labels.instance }} - high load average"
          description: "{{ $labels.instance  }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 5 minutes."

I have tried below rule but it also not working:

- alert: LoadAverage5minutes
    expr: node_load5/count(node_cpu{mode="idle"}) without (cpu,mode) >= 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Load average is high for 5 minutes (instance {{ $labels.instance }})"
      description: "Load is high \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Can you please help me what changes are required in my rule so it can work.

Thanks.



Solution 1:[1]

The following expression should work:

expr: node_load5 / count by (instance, job) (node_cpu_seconds_total{mode="idle"}) >= 0.95

Solution 2:[2]

The following query alerts when the average CPU usage for the last 5 minutes exceeds 95% on a particular instance:

avg(
  sum(
    rate(node_cpu_seconds_total{mode!="idle"}[5m])
  ) without (mode)
) without (cpu) > 0.95

There may be applications, which cannot scale to multiple CPU cores. Such applications won't be noticed by the query above if instance contains more than a single CPU core. For example, if an application can use only a single CPU core and it runs on an instance with two CPU cores, then the query above won't trigger, since the average CPU usage doesn't exceed 50%. For such cases the following alerting query is recommended to use:

max(
  sum(
    rate(node_cpu_seconds_total{mode!="idle"}[5m])
  ) without (mode)
) without (cpu) > 0.95

This query alerts when at least a single CPU core is loaded for more than 95% during the last 5 minutes on a particular instance.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Marcelo Ávila de Oliveira
Solution 2 valyala