alerts: inverse grouping in vmagent alerts (#5429)

Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing.

Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact.

(cherry picked from commit 75196d7234)
This commit is contained in:
7840vz 2023-12-11 17:01:29 +03:00 committed by hagen1778
parent 15a7542ef8
commit 231f6c7fb2
No known key found for this signature in database
GPG key ID: 3BF75F3741CA9640

View file

@ -10,7 +10,7 @@ groups:
concurrency: 2
rules:
- alert: PersistentQueueIsDroppingData
expr: sum(increase(vm_persistentqueue_bytes_dropped_total[5m])) by (job, instance) > 0
expr: sum(increase(vm_persistentqueue_bytes_dropped_total[5m])) without (path) > 0
for: 10m
labels:
severity: critical
@ -21,7 +21,7 @@ groups:
on instance {{ $labels.instance }} for the last 10m."
- alert: RejectedRemoteWriteDataBlocksAreDropped
expr: sum(increase(vmagent_remotewrite_packets_dropped_total[5m])) by (job, instance) > 0
expr: sum(increase(vmagent_remotewrite_packets_dropped_total[5m])) without (url) > 0
for: 15m
labels:
severity: warning
@ -31,7 +31,7 @@ groups:
remote-write server data blocks. Check the logs to find the reason for rejects."
- alert: TooManyScrapeErrors
expr: sum(increase(vm_promscrape_scrapes_failed_total[5m])) by (job, instance) > 0
expr: increase(vm_promscrape_scrapes_failed_total[5m]) > 0
for: 15m
labels:
severity: warning
@ -41,9 +41,9 @@ groups:
- alert: TooManyWriteErrors
expr: |
(sum(increase(vm_ingestserver_request_errors_total[5m])) by (job, instance)
(sum(increase(vm_ingestserver_request_errors_total[5m])) without (name,net,type)
+
sum(increase(vmagent_http_request_errors_total[5m])) by (job, instance)) > 0
sum(increase(vmagent_http_request_errors_total[5m])) without (path,protocol)) > 0
for: 15m
labels:
severity: warning
@ -52,7 +52,7 @@ groups:
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} responds with errors to write requests for last 15m."
- alert: TooManyRemoteWriteErrors
expr: sum(rate(vmagent_remotewrite_retries_count_total[5m])) by(job, instance, url) > 0
expr: rate(vmagent_remotewrite_retries_count_total[5m]) > 0
for: 15m
labels:
severity: warning
@ -65,9 +65,9 @@ groups:
- alert: RemoteWriteConnectionIsSaturated
expr: |
(
sum(rate(vmagent_remotewrite_send_duration_seconds_total[5m])) by(job, instance, url)
rate(vmagent_remotewrite_send_duration_seconds_total[5m])
/
max(vmagent_remotewrite_queues) by(job, instance, url)
vmagent_remotewrite_queues
) > 0.9
for: 15m
labels: