'Alerts firing on Prometheus but not on Alertmanager
I can't seem to find out why Alertmanager is not getting alerts from Prometheus. I would appreciate a swift assistance on this challenge. I'm fairly new with using Prometheus and Alertmanager. I am using a webhook for MsTeams to push the notifications from alertmanager.
Alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['critical','severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'alert_channel'
receivers:
- name: 'alert_channel'
webhook_configs:
- url: 'http://localhost:2000/alert_channel'
send_resolved: true
prometheus.yml - (Just a part of it)
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- alert_rules.yml
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'kafka'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'
static_configs:
- targets: ['localhost:8080']
labels:
service: 'Kafka'
alertmanager.service
[Unit]
Description=Prometheus Alert Manager
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/data/alertmanager \
--web.listen-address=127.0.0.1:9093
Restart=always
[Install]
WantedBy=multi-user.target
groups:
- name: alert_rules
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: "critical"
annotations:
summary: "Service {{ $labels.service }} down!"
description: "{{ $labels.service }} of job {{ $labels.job }} has been down for more than 1 minute."
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 25
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"} < 40
for: 1s
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 40% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
But I don't see those alerts on alertmanager
I'm out of ideas at this point. Please I need help. I've been on this since last week.
Solution 1:[1]
You have a mistake in your Alertmanager configuration. group_by
expects a collection of label names and from what I am seeing critical
is a label value, not the name. So simply remove critical
and you should be good to go.
Also check out this blog posts, quite helpful https://www.robustperception.io/whats-the-difference-between-group_interval-group_wait-and-repeat_interval
Edit 1
If you want the receiver alert_channel
to only receive alerts that have the severity critical
you have to create a route and with a match
attribute.
Something along these lines:
route:
group_by: ['...'] # good if very low volum
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
routes:
- match:
- severity: critical
receiver: alert_channel
Edit 2
If this does not work as well try out this:
route:
group_by: ['...']
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
receiver: alert_channel
This should work. Check your Prometheus logs and see if you find hints there
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |