Prometheus 监控方案学习笔记(五):告警通知组件的安装和配置(含 Prometheus Alert 和 Prometheus Webhook Dingtalk)
                            huty
                            2022年12月01日  ·  阅读 4,200
                        
                    Prometheus Alert 的安装和配置
Docker Compose 方式
- 创建相关目录
 
mkdir -p /apps/prometheus-alert
- 编辑 docker-compose.yml 文件
 
vim /apps/prometheus-alert/docker-compose.yml
version: "3"
services:
  prometheus-alert:
    image: feiyu563/prometheus-alert:v4.8.2
    container_name: prometheus-prometheus-alert
    hostname: prometheus-alert
    restart: always
    ports:
      - 8080:8080
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /apps/prometheus-alert/conf:/app/conf
      - /apps/prometheus-alert/db:/app/db
      - /apps/prometheus-alert/logs:/app/logs
networks:
  default:
    external:
      name: prometheus
- 编辑 app.conf 文件
Prometheus Alert 的配置文件,模板参考:https://github.com/feiyu563/PrometheusAlert/blob/master/conf/app-example.conf 
vim /apps/prometheus-alert/conf/app.conf
文件内容略,具体内容见上方连接
- 配置 docker 网段 prometheus
检查是否存在 prometheus 网段: 
docker network list
若不存在,则创建:
docker network create prometheus --subnet 10.21.22.0/24
- 启动 Prometheus Alert 容器
 
cd /apps/prometheus-alert
docker-compose up -d
- 检查 Prometheus Alert 容器状态、查看 Prometheus Alert 容器日志
 
cd /apps/prometheus-alert
docker-compose ps
docker-compose logs -f
- 配置 Alertmanager
编辑 Alertmanager 的alertmanager.yml配置文件,添加receivers,参考文档:https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/system.md 
vim /apps/alertmanager/conf/alertmanager.yml
global:
  resolve_timeout: 5m
route:
  group_by: ['instance']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'alert-dd'
  routes:
  - receiver: 'alert-dx'
    continue: true
    match:
      severity: emergency
  - receiver: 'alert-dd'
    continue: true
    match_re:
      severity: warning|critical
receivers:
- name: 'alert-dd'
  webhook_configs:
  - url: 'http://127.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd-my&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxx'
- name: 'alert-dx'
  webhook_configs:
  - url: 'http://127.0.0.1:8080/prometheusalert?type=txdx&tpl=prometheus-dx-my&phone=xxx'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']
注意:
type: 为通知类型,dd为钉钉通知、txdx为腾讯云短信、alydx为阿里云短信,其他见官方文档( https://feiyu563.gitbook.io/prometheusalert/base-install/base-restful#url-can-shu-jie-shi )tpl: 为通知模板名称access_token: 为钉钉机器人的 AccessTokenphone: 为手机号,多个手机号之间用,分割
- 重启 Alertmanager
 
cd /apps/alertmanager
docker-compose restart
- 新增自定义模板
登录 Prometheus Alert ,新增自定义模板文件 
- 钉钉告警模板
模板名称:prometheus-dd-my
模板类型:钉钉
模板用途:Prometheus
模板内容: 
{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
# Prometheus 恢复信息
## {{$v.labels.alertname}}
### {{$v.annotations.description}}
========== 以下为详细信息 ========== <br>
**告警级别:** {{$v.labels.severity}} <br>
**告警状态:** {{$v.status}} <br>
**告警类型:** {{$v.labels.alertname}} <br>
**告警主题:** {{$v.annotations.summary}} <br>
**故障资源:** {{$v.labels.instance}} <br>
**故障时间:** {{TimeFormat $v.startsAt "2006-01-02 15:04:05"}} <br>
**恢复时间:** {{TimeFormat $v.endsAt "2006-01-02 15:04:05"}} <br>
**资源标签:** <br>
- **name:** {{$v.labels.name}} <br>
{{else}}
# **Prometheus 告警信息**
## **{{$v.labels.alertname}}**
### {{$v.annotations.description}}
========== 以下为详细信息 ========== <br>
**告警级别:** {{$v.labels.severity}} <br>
**告警状态:** {{$v.status}} <br>
**告警类型:** {{$v.labels.alertname}} <br>
**告警主题:** {{$v.annotations.summary}} <br>
**故障资源:** {{$v.labels.instance}} <br>
**故障时间:** {{TimeFormat $v.startsAt "2006-01-02 15:04:05"}} <br>
**资源标签:** <br>
- **name:** {{$v.labels.name}} <br>
{{end}}
{{ end }}
{{ $urimsg:=""}}{{ range $key,$value:=.commonLabels }}{{$urimsg =  print $urimsg $key "%3D%22" $value "%22%2C" }}{{end}}
- 短信告警模板(以 腾讯云 为例,使用其他云供应商时修改 
模版类型即可)
模板名称:prometheus-dx-my
模板类型:腾讯云短信
模板用途:Prometheus
模板内容: 
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}
[Prometheus恢复信息]
{{$v.labels.alertname}}
故障描述: {{$v.annotations.description}}
故障级别: {{$v.labels.severity}}
故障资源: {{$v.labels.instance}}
资源标签: 
- name:   {{$v.labels.name}}
{{else}}
[Prometheus告警信息]
{{$v.labels.alertname}}
故障描述: {{$v.annotations.description}}
故障级别: {{$v.labels.severity}}
故障资源: {{$v.labels.instance}}
资源标签: 
- name:   {{$v.labels.name}}
{{end}}
{{ end }}
注意: 在阿里云新增短信模板时,需要包含 ${code} 变量
10. 制造告警,验证告警消息是否发送
示例消息如下:
Prometheus 告警信息
主机连接失败
主机 127.0.0.1:9100 连接失败!
========== 以下为详细信息 ==========
告警级别: critical
告警状态: firing
告警类型: 主机连接失败
告警主题: 主机 (127.0.0.1:9100) 连接失败
故障资源: 127.0.0.1:9100
故障时间: 2022-10-12 06:10:34
资源标签:
name: 127.0.0.1
Prometheus 恢复信息
主机连接失败
主机 127.0.0.1:9100 连接失败!
========== 以下为详细信息 ==========
告警级别: critical
告警状态: resolved
告警类型: 主机连接失败
告警主题: 主机 (172.19.16.17:9100) 连接失败
故障资源: 127.0.0.1:9100
故障时间: 2022-10-12 06:10:34 
恢复时间: 2022-10-12 06:10:49 
资源标签: 
name: 127.0.0.1
Prometheus Webhook Dingtalk 的安装和配置
Docker Compose 方式
- 目录准备
 
mkdir -pv /apps/prometheus-webhook-dingtalk/{conf,template}
- 编辑 docker-compose.yml 文件
 
vim /apps/prometheus-webhook-dingtalk/docker-compose.yml
version: "3"
services:
  prometheus-webhook-dingtalk:
    image: timonwong/prometheus-webhook-dingtalk:v2.1.0
    container_name: prometheus-prometheus-webhook-dingtalk
    hostname: prometheus-webhook-dingtalk
    restart: always
    volumes:
      - /apps/prometheus-webhook-dingtalk/template/template.tmpl:/etc/prometheus-webhook-dingtalk/templates/legacy/template.tmpl
      - /apps/prometheus-webhook-dingtalk/conf/config.yml:/etc/prometheus-webhook-dingtalk/config.yml
    ports:
      - 8060:8060
    environment:
      - config.file=/etc/prometheus-webhook-dingtalk/config.yml
networks:
  default:
    external:
      name: prometheus
- 编辑配置文件( config.yml )
参考:https://github.com/timonwong/prometheus-webhook-dingtalk 中的 Configuration 部分 
vim /apps/prometheus-webhook-dingtalk/conf/config.yml
templates:
  - /etc/prometheus-webhook-dingtalk/templates/legacy/template.tmpl
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    secret: SEC000000000000000000000
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  webhook_legacy:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    message:
      title: '{{ template "legacy.title" . }}'
      text: '{{ template "email.to.message" . }}'
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      all: true
  webhook_mention_users:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      mobiles: ['156xxxx8827', '189xxxx8325']
- 编辑消息模板文件( template.tmpl )
参考:https://github.com/timonwong/prometheus-webhook-dingtalk/blob/main/template/default.tmpl 
vim /apps/prometheus-webhook-dingtalk/template/template.tmpl
{{ define "email.to.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
=========  **监控告警** =========  
**告警程序:**     Alertmanager   
**告警类型:**    {{ $alert.Labels.alertname }}   
**告警级别:**    {{ $alert.Labels.severity }} 级   
**告警状态:**    {{ .Status }}   
**故障主机:**    {{ $alert.Labels.instance }} {{ $alert.Labels.device }}   
**告警主题:**    {{ .Annotations.summary }}   
**告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}   
**主机标签:**    {{ range .Labels.SortedPairs  }}  </br>  [ {{ .Name }}: {{ .Value | markdown | html }} ]   
{{- end }} </br>
**故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
========= = end =  =========  
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 告警恢复 =========  
**告警程序:**     Alertmanager   
**告警类型:**    {{ .Labels.alertname }}
**告警级别:**    {{ $alert.Labels.severity }} 级
**告警状态:**    {{   .Status }}
**告警主机:**    {{ .Labels.instance }}
**告警主题:**    {{ $alert.Annotations.summary }}  
**告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}  
**故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
**恢复时间:**    {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
========= = **end** =  =========
{{- end }}
{{- end }}
{{- end }}
- 配置 Alertmanager
编辑 Alertmanager 的alertmanager.yml配置文件(需要将webhook_configs的url修改为 pPrometheus Webhook Dingtalk 的地址) 
vim /apps/prometheus-webhook-dingtalk/conf/alertmanager.yml
global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'dingtalk'
receivers:
- name: 'dingtalk'
  webhook_configs:
  - url: 'http://127.0.0.1:8060/dingtalk/webhook_legacy/send'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: '*'
    equal: ['instance']
- 创建 docker 网段 prometheus
检查是否存在 prometheus 网段: 
docker network list
若不存在,则创建:
docker network create prometheus --subnet 10.21.22.0/24
- 启动 Alertmanager 容器
 
cd /apps/prometheus-webhook-dingtalk
docker-compose up -d
- 查看 Alertmanager 容器状态、查看 Alertmanager 容器日志
 
cd /apps/prometheus-webhook-dingtalk
docker-compose ps
docker-compose logs -f
- 验证告警发送是否正常
制造告警,验证钉钉消息是否发送正常,示例如下: 
=========  监控告警 =========  
告警程序:     Alertmanager
告警类型:    主机连接失败
告警级别:    warning 级
告警状态:    firing
故障主机:    127.0.0.1:9100
告警主题:    主机 (127.0.0.1:9100) 连接失败
告警详情:    主机 127.0.0.1:9100 连接失败!
主机标签:       [name: 127.0.0.1] 
故障时间:    2022-09-28 13:22:02
========= = end =  =========
========= 告警恢复 =========
告警程序:     Alertmanager
告警类型:    主机连接失败
告警级别:    warning 级
告警状态:    resolved
告警主机:    127.0.0.1:9100
告警主题:    主机 (127.0.0.1:9100) 连接失败
告警详情:    主机 127.0.0.1:9100 连接失败!
故障时间:    2022-09-28 13:22:02
恢复时间:    2022-09-28 13:23:02  
========= = end =  =========
                分类:
                                Prometheus 监控体系
                    
                    标签:
                                Prometheus
                    
                
评论已关闭