'Silence prometheus alerts based on label value / Ignore alerts from label

tl;dr

I have a label in prometheus called "ignore" with value "yes":

metric_test{label1="label1",ignore="yes"} 1

I want to disable alerts for any metrics with this label. I don't want to manually edit 500+ alerts. Alerts should not appear in the prometheus GUI.

Is there a solution that does this natively?


I have various machines and services whose metrics are collected with exporters such as kubernetes_exporter or node_exporter.

I have an alert, Uptime, which fires when a machine goes down.

All the machines have this alert, and the alert is like below:

- alert: Uptime
  expr: up{} == 0
  for: 2m
  labels:
    severity: critical

There are some machines that I don't care about. They are costantly turned off at unplanned times and aren't generally important, so I'd like to exclude such machines from the above alert.

What I found works is modifying the above expression adding the ignored machines:

- alert: Uptime
  expr: up{ignore!="yes"} == 0
  for: 2m
  labels:
    severity: critical

So say I have a kubernetes namespace I don't care about, I can add namespace!="test" and Prometheus would still be collecting metrics BUT without firing alerts.

I found it could also be used by adding at the end of the expr AND up{ignore!="yes"}, while leaving the rest of the expression intact.

Awesome!

The following approach is discussed here Disable alerting for a specific hosts, while alerting for all the others

But there is a problem with this approach, and that is as more ignore rules are added, the more conditions you have to add. You can, of course, group things together like discussed above, adding a label enableAlert="true" so that every machine with enableAlert="false" gets ignored. This would work but still requires manual work, and also manual configuration for every alert.

So, let's see other possible solutions:

Relabeling

As discussed here Prometheus config to ignore scraping of metrics for a specific namespace in Kubernetes , one can drop metrics when a certain label's value is present, for example a kubernetes namespace.

See an example from the above discussion:

  relabel_configs:
  # This will ignore scraping targets from 'ignored_namespace_1', 
  # 'ignored_namespace_2', and 'ignored_namespace_N'.
  - source_labels: [__meta_kubernetes_namespace]
    action: drop
    regex: ignored_namespace_1|ignored_namespace_2|ignored_namespace_N

The problem with this approach is those metrics will be lost, which is why this is not a feasible solution for my use case.

Another similar thread here, official docs here here

null receiver

This solution is discussed here How to silence Prometheus Alertmanager using config files? Essentially, this involves defining a receiver: "null" on alert manager, and routing the alerts to nothing.

Effectively this acts as an alert manager silence

Similar solutions are defining an infinite silence on alert manager and using inhibition rules (still discussed in the thread above, an example here https://github.com/prometheus/alertmanager/blob/main/doc/examples/simple.yml )

The disadvantage of alert manager-based solutions is that the alerts still appear in prometheus. This is a major "No-go" for me as I still frequently take a look at the web GUI to see if everything is working, and seeing lots of unsent fires wouldn't be easy to work with.

Custom tool, prometheus-alert-overrider

Now, while I couldn't find other solutions, this problem has been discussed in this post , in which the author shows a different way to silence alerts in a dev kubernetes environment.

They wrote a custom preprocessor to automatically add ignore rules, and went on with the original approach shown at the start of this question, but adding the enabled and override keywords to prometheus, which allows them to write something like:

- alert: DisableKubeDev
  override: ["K8S.*"]
  enabled: false
  expr: '{kubernetes_cluster="kube-dev"}'

To disable all the alerts starting with K8S on the kube-dev cluster, in a much cleaner way than I could previously do.

The disadvantage of this solution is that it requires maintaining another project, with its own dependencies and updates.

The above right now seems like the best solution, but are there any native solutions one could use directly in Prometheus, or alternatives? I couldn't find anything on the official docs.

The above tool seems very simple and compact but it makes sense to ask for native or alternative solutions that may offer increased simplicity or a different approach.

Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source