'Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.

When there is a process restart within a 2m interval and I query:

sum(increase(my_metric_total[2m])) 

I get a value less than expected.

For example, in a simple experiment I mock:

  • 3 lcm_restarts
  • 1 process restart
  • 2 lcm_restarts

All within a 2 minute interval.

Upon querying:

sum(increase(lcm_restarts[2m])) 

I receive a value of ~4.5 when I am expecting 5.

lcm_restarts graph

sum(increase(lcm_restarts[2m])) result

Could someone please explain?



Solution 1:[1]

Pretty concise and well-prepared first question here. Please keep this spirit!

When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.

For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.

Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.

So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.

Solution 2:[2]

Prometheus may return unexpected results from increase() function due to the following reasons:

  • Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
  • Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
  • Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.

These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andreas Jägle
Solution 2 valyala