'Alertmanager not sending out alert to slack
I configured alertmanager with prometheus and according to the prometheus interface the alerts are firing. However there is not slack message showing up and I am wondering if maybe ufw needs to be configured or if there is any other config I missed.
The alertmanager service is running and prometheus shows "firing".
Here are my config files:
alertmanager.yml:
global:
slack_api_url: 'https://hooks.slack.com/services/my_id_removed...'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'slack_general'
receivers:
#- name: 'web.hook'
# webhook_configs:
# - url: 'http://127.0.0.1:5001/'
- name: slack_general
slack_configs:
- channel: '#alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
prometheus.yml:
global:
scrape_interval: 10s
evaluation_interval: 15s # Evaluates rules every 15s. Default is 1m
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- rules.yml
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090', 'localhost:9104']
- job_name: 'node_exporter_metrics'
scrape_interval: 5s
static_configs:
- targets: ['leo:9100', 'dog:9100']
- job_name: 'alertmanager'
static_configs:
- targets: ['localhost:9093']
alerts.yml:
groups:
- name: test
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 8 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: Host out of disk space (instance {{ $labels.instance }})
description: "Disk is almost full (< 8% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
sudo systemctl status alertmanager.service
● alertmanager.service - Alertmanager for prometheus
Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2022-03-05 22:02:31 CET; 6min ago
Main PID: 9398 (alertmanager)
Tasks: 30 (limit: 154409)
Memory: 21.8M
CGroup: /system.slice/alertmanager.service
└─9398 /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.098Z caller=cluster.go:184 component=cluster msg="setting advertise address explicitly" addr=192.168.0.2 port=9094
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.099Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.127Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/opt/alertmanager/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.128Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/opt/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=main.go:518 msg=Listening address=:9093
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
Mar 05 22:02:33 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:33.099Z caller=cluster.go:696 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000022377s
Mar 05 22:02:41 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:41.101Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.002298352s
Solution 1:[1]
I believe to have found the reason for alerts comming in sometimes and sometimes not. The alertmanager failed to autostart after reboot since it did not wait for prometheus.
This fixed it:
sudo nano /etc/systemd/system/alertmanager.service
Add "wants" and "after":
[Unit]
Description=Alertmanager for prometheus
Wants=network-online.target
After=network-online.target
Reboot.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | merlin |