'How to sum prometheus counters when k8s pods restart

I'm running Prometheus in a kubernetes cluster. All is running find and my UI pods are counting visitors.

enter image description here

Please ignore the title, what you see here is the query at the bottom of the image. It's a counter. The gaps in the graph are due to pods restarting. I have two pods running simultaneously!

Now suppose I would like to count the total of visitors, so I need to sum over all the pods

enter image description here

This is what I expect considering the first image, right?

However, I don't want the graph to drop when a pod restarts. I would like to have something cumulative over a specified amount of time (somehow ignoring pods restarting). Hope this makes any sense. Any suggestions?

UPDATE

Below is suggested to do the following

enter image description here

Its a bit hard to see because I've plotted everything there, but the suggested answer sum(rate(NumberOfVisitors[1h])) * 3600 is the continues green line there. What I don't understand now is the value of 3 it has? Also why does the value increase after 21:55, because I can see some values before that.

As the approach seems to be ok, I noticed that the actual increase is actually 3, going from 1 to 4. In the graph below I've used just one time series to reduce noise

enter image description here



Solution 1:[1]

Rate, then sum, then multiply by the time range in seconds. That will handle rollovers on counters too.

Solution 2:[2]

Prometheus doesn't provide the ability to sum counters, which may be reset. Additionally, the increase() function in Prometheus has some issues, which may prevent from using it for querying counter increase over the specified time range:

  • It may return fractional values over integer counters because of extrapolation. See this issue for details.
  • It may miss counter increase between raw sample just before the lookbehind window in square brackets and the first raw sample inside the lookbehind window. For example, increase(NumberOfVisitors[1m]) at timestamp t may miss the counter increase between the last raw sample just before the t-1m time and the first raw sample at (t-1m ... t] time range. See more details here and here.
  • It may miss the increase for the first raw sample in a time series. For example, if the NumberOfVisitors counter is increased to 10 just before the first scrape of this counter by Prometheus, then increase() over the time range with the first sample would under-count the counter increase by 10.

Prometheus developers are going to fix these issues - see this design doc. In the mean time it is possible to use VictoriaMetrics - its' increase() function is free from these issues.

Returning to the original question - the sum of multiple counters, which may be reset, can be returned with the following MetricsQL query in VictoriaMetrics:

running_sum(sum(increase(NumberOfVisitor)))

It uses the following functions:

  • increase() for calculating increase per each counter between adjacent points on the graph.
  • sum() for summing the calculated increases per each point on the graph.
  • running_sum() for calculating the running sum over per-point increases on the graph.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 coderanger
Solution 2 valyala