'How to calculate containers' cpu usage in kubernetes with prometheus as monitoring?

I want to calculate the cpu usage of all pods in a kubernetes cluster. I found two metrics in prometheus may be useful:

container_cpu_usage_seconds_total: Cumulative cpu time consumed per cpu in seconds.
process_cpu_seconds_total: Total user and system CPU time spent in seconds.

Cpu Usage of all pods = increment per second of sum(container_cpu_usage_seconds_total{id="/"})/increment per second of sum(process_cpu_seconds_total)

However, I found every second's increment of container_cpu_usage{id="/"} larger than the increment of sum(process_cpu_seconds_total). So the usage may be larger than 1...



Solution 1:[1]

This I'm using to get CPU usage at cluster level:

sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100

I also track the CPU usage for each pod.

sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name)

I have a complete kubernetes-prometheus solution on GitHub, maybe can help you with more metrics: https://github.com/camilb/prometheus-kubernetes

enter image description here

enter image description here

Solution 2:[2]

I created my own prometheus exporter (https://github.com/google-cloud-tools/kube-eagle), primarily to get a better overview of my resource utilization on a per node basis. But it also offers a more intuitive way monitoring your CPU and RAM resources. The query to get the cluster wide CPU usage would look like this:

sum(eagle_pod_container_resource_usage_cpu_cores)

But you can also easily get the CPU usage by namespace, node or nodepool.

Solution 3:[3]

The following query returns per-container average number of CPUs used during the last 5 minutes:

rate(container_cpu_usage_seconds_total{container!~"POD|"}[5m])

The lookbehind window in square brackets (5m in the case above) can be changed to the needed value. See possible time duration values here.

The container!~"POD|" filter removes metrics related to cgroups hierarchy (see this answer for more details) and metrics for e.g. pause containers (see these docs).

Since each pod can contain multiple containers, then the following query can be used for returning per-pod average number of CPUs used during the last 5 minutes:

sum(
  rate(container_cpu_usage_seconds_total{container!~"POD|"}[5m])
) by (namespace,pod)

Solution 4:[4]

Well you can use below query as well:

avg (rate (container_cpu_usage_seconds_total{id="/"}[1m]))

Solution 5:[5]

I prefer to use this metric per doc

sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container_name!="POD"}[5m])) by (pod_name, container_name) /
sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", container_name!="POD"}/container_spec_cpu_period{name!~".*prometheus.*", image!="", container_name!="POD"}) by (pod_name, container_name)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 kentor
Solution 3
Solution 4 slm
Solution 5 zangw