'Prometheus alerting rule not detecting first time metric increase
I have one counter metric error_in_execution
. Whenever the error appears counter.inc();
called.
I have the following alert expression that triggers when the counter increase.
expr: increase(error_in_execution[5m]) > 0
for: 5m
Now the issue is, when there is no metric exists and an error appear the first time, the counter value increase to 1. Which is not detected by this alert expression and it did not trigger. Then when the counter increases to 2. Alert triggered.
The following example would be easy to understand.
Time 0:
Prometheus: error_in_execution --> No Metric Exsist.
Alert: increase(error_in_execution[5m]) > 0 --> Not triggered
Time 1: Error occur [error_in_execution.inc()]
Prometheus: error_in_execution --> 1
Alert: increase(error_in_execution[5m]) > 0 --> Still Not triggered <<<<<< It should be triggered. ( Please help here)
Time 2: Error occur [error_in_execution.inc()]
Prometheus: error_in_execution --> 2
Alert: increase(error_in_execution[5m]) > 0 --> Alert triggerd.
Solution 1:[1]
This is a "normal" behaviour. If the metric does not exist before and is then initialized with the value 1
, this is not considered in functions like increase()
or rate()
.
To catch the very first error, you need to make sure, that the metric exists from the beginning when your application starts having the initial value 0
, then the first incrementatation will trigger your alert.
Solution 2:[2]
I think I found a workaround for this.
For counters that existed before t, increase(_metric_[t])
is equivalent to _metric_ - _metric_ offset t
. (it's not, but that is a different issue).
For counters that did not exist before t, the increase is simply the metrics value _metric_ - 0 = _metric_
.
We can find out whether a metric existed at point t by querying it _metric_ offset t
. And we can use that as a WHERE NOT EXISTS
filter using the unless
operator.
Putting it together, we get following query:
( _metric_ unless _metric offset 1d ) or ( _metric_ - _metric_ offset 1d )
^-----------new counters------------^ ^--------existing counters------^
Example
One event happens each timeframe, we want to measure the increase over 2 timeframes.
Expected:
- none for each query frame before the first occurrence
- one for the query frame on first occurrence
- 2 for each query frame beyond the first occurrence
t0 t1 t2 t3 t4 t5
_metric_ - - 1 2 3 4
_metric offset 2t - - - - 1 2
__ unless __ offset 2t - - 1 2 - -
__ <minus> __ offset 2t - - - - 2 2
=====================================================
() or () - - 1 2 2 2
Grafana example graphtotal
is the raw counter value, increase
is the result of the query. It is still split in two series because the metric name is dropped on the -
operation, but not on unless
. But summing them up again works well, and is something you will probably do anyways.
Grafana graph with sum
It's really a shame prometheus makes it so hard for everyone who does not use it to display cpu temperature. This is one of the instances where my pride to have found a solution is only surpassed by my exasperation that it was necessary in the first place.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jens Baitinger |
Solution 2 | lazySaur |