DM v5.4.0 Prometheus replication lag alert (dm_worker.rules.yml) 如何撰寫才可在超過特定threshold秒數後觸發告警

【 TiDB 使用环境`】生产环境
【 TiDB 版本】v5.2.2
【 DM 版本】v5.4.0
【遇到的问题】
在DM v2.0.7版本中 我們可以透過編寫以下rule (dm_worker.rules.yml) 去即時得到複寫延遲告警

   - alert: DM_worker_replication_lag_warning
    expr: dm_syncer_replication_lag{job="dm_worker"} >= 10
    for: 1m
    labels:
      level: warning
    annotations:
      summary: DM replication lag more than 10s and exceed 1 minute
      description: 'task: {{ $labels.task }}, Lag: {{ $value }}'

  - alert: DM_worker_replication_lag_critical
    expr: dm_syncer_replication_lag{job="dm_worker"} >= 30
    labels:
      level: critical
    annotations:
      summary: DM replication lag more than 30s
      description: 'task: {{ $labels.task }}, Lag: {{ $value }}'

但發現上述rule在DM V5.4.0中已經失效
查看文件與Grafana的圖形後發現function應該已經改成dm_syncer_replication_lag_gauge
“expr”: “dm_syncer_replication_lag_gauge{instance=~"$instance",task=~"$task"}”


https://docs.pingcap.com/tidb/stable/monitor-a-dm-cluster

因此嘗試改寫rule 如下:

  - alert: DM_worker_replication_lag_test
    expr: dm_syncer_replication_lag_gauge{job="dm_worker"}  > 5
    labels:
      level: critical
    annotations:
      summary: DM replication lag more than 5s
      description: 'cluster: xxx-dmcluster, task: {{ $labels.task }}, Lag: {{ $value }}'

但reload Prometheus後 故意pause-task使之落後 但仍無法觸發alert告警
http://[Prometheus_IP]:9090/alerts

另外DM v5.4.0的OpenAPI中也沒有提供相對應的API可以查詢replication落後
https://docs.pingcap.com/zh/tidb/stable/dm-open-api

若有顧問老師 有試出來的話 再麻煩不吝指教 感謝!!!

【问题现象及影响】

無法即時獲得replication lag超過特定秒數的告警

1 个赞

弄了一下午 終於搞定了 上來回覆一下解法
dm_worker.rules.yml

  - alert: DM_worker_replication_lag_warning
    expr: dm_syncer_replication_lag_gauge{job="dm_worker"} > 10
    for: 1m
    labels:
      level: warning
    annotations:
      summary: DM replication lag more than 10s and exceed 1 minute
      description: 'task: {{ $labels.task }}, Lag: {{ $value }}'

  - alert: DM_worker_replication_lag_critial
    expr: dm_syncer_replication_lag_gauge{job="dm_worker"} > 30
    labels:
      level: critical
    annotations:
      summary: DM replication lag more than 30s
      escription: 'cluster: xxxxx-dmcluster, task: {{ $labels.task }}, Lag: {{ $value }}'

alertmanager.yml
需加上routes這個tag下的內容

global:
  # 未收到標記告警通知,等待 timeout 時間之後事件標記為 resolved
  resolve_timeout: 5m
  # The Slack webhook URL.
  slack_api_url: 'https://hooks.slack.com/services/T43QNF23S/B02RNU7CJEL/xxxxxxxxxxxxx'

route:
  # A default receiver
  receiver: "db-alert-slack"

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ["env", "instance", "alertname", "type", "group", "job"]

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 3m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3m

  routes:
    # replication
    - match_re:
        alertname: DM_worker_replication_lag_warning | DM_worker_replication_lag_critical
      group_by: [alertname, task]

receivers:
  - name: 'db-alert-slack'
    slack_configs:
    - channel: '#xxxxx-dm-alert'
      title:   '{{ .CommonLabels.alertname }}'
      text:    '{{ .CommonAnnotations.summary }}  {{ .CommonAnnotations.description }}'
      send_resolved: true
4 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。