alerts: simplify aggregation of alerting rules

This is follow-up after
75196d7234

It updates some of the alerting rules to remove unnecessary aggregations.
It keeps aggregations for expressions which are using multiple time series
filters to make sure their label will match.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit is contained in:
hagen1778 2023-12-11 15:17:30 +01:00
parent 75196d7234
commit 8fb68152e6
No known key found for this signature in database
GPG key ID: 3BF75F3741CA9640
4 changed files with 13 additions and 13 deletions

View file

@ -81,7 +81,7 @@ groups:
Possible reasons for errors are misconfiguration, overload, network blips or unreachable components." Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."
- alert: RowsRejectedOnIngestion - alert: RowsRejectedOnIngestion
expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0 expr: rate(vm_rows_ignored_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@ -113,7 +113,7 @@ groups:
expr: | expr: |
sum(increase(vm_new_timeseries_created_total[24h])) sum(increase(vm_new_timeseries_created_total[24h]))
> >
(sum(vm_cache_entries{type="storage/hour_metric_ids"})* 3) (sum(vm_cache_entries{type="storage/hour_metric_ids"}) * 3)
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@ -155,7 +155,7 @@ groups:
Consider to increase the limit as fast as possible." Consider to increase the limit as fast as possible."
- alert: LabelsLimitExceededOnIngestion - alert: LabelsLimitExceededOnIngestion
expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0 expr: increase(vm_metrics_with_dropped_labels_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning

View file

@ -55,7 +55,7 @@ groups:
Consider to either increase available CPU resources or decrease the load on the process." Consider to either increase available CPU resources or decrease the load on the process."
- alert: TooManyLogs - alert: TooManyLogs
expr: sum(increase(vm_log_messages_total{level="error"}[5m])) by (job, instance) > 0 expr: sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@ -65,7 +65,7 @@ groups:
Worth to check logs for specific error messages." Worth to check logs for specific error messages."
- alert: TooManyTSIDMisses - alert: TooManyTSIDMisses
expr: sum(rate(vm_missing_tsids_for_metric_id_total[5m])) by (job, instance) > 0 expr: rate(vm_missing_tsids_for_metric_id_total[5m]) > 0
for: 10m for: 10m
labels: labels:
severity: critical severity: critical

View file

@ -18,7 +18,7 @@ groups:
Check vmalert's logs for detailed error message." Check vmalert's logs for detailed error message."
- alert: AlertingRulesError - alert: AlertingRulesError
expr: sum(increase(vmalert_alerting_rules_errors_total[5m])) by(job, instance, group, file) > 0 expr: sum(increase(vmalert_alerting_rules_errors_total[5m])) without(alertname, id) > 0
for: 5m for: 5m
labels: labels:
severity: warning severity: warning
@ -29,7 +29,7 @@ groups:
Check vmalert's logs for detailed error message." Check vmalert's logs for detailed error message."
- alert: RecordingRulesError - alert: RecordingRulesError
expr: sum(increase(vmalert_recording_rules_errors_total[5m])) by(job, instance, group, file) > 0 expr: sum(increase(vmalert_recording_rules_errors_total[5m])) without(recording, id) > 0
for: 5m for: 5m
labels: labels:
severity: warning severity: warning
@ -40,7 +40,7 @@ groups:
Check vmalert's logs for detailed error message." Check vmalert's logs for detailed error message."
- alert: RecordingRulesNoData - alert: RecordingRulesNoData
expr: sum(vmalert_recording_rules_last_evaluation_samples) by(job, group, recording, file) < 1 expr: sum(vmalert_recording_rules_last_evaluation_samples) without(recording, id) < 1
for: 30m for: 30m
labels: labels:
severity: info severity: info
@ -52,7 +52,7 @@ groups:
or incorrect query expression." or incorrect query expression."
- alert: TooManyMissedIterations - alert: TooManyMissedIterations
expr: sum(increase(vmalert_iteration_missed_total[5m])) by(job, instance, group, file) > 0 expr: increase(vmalert_iteration_missed_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@ -65,7 +65,7 @@ groups:
If rule expressions are taking longer than expected, please see https://docs.victoriametrics.com/Troubleshooting.html#slow-queries." If rule expressions are taking longer than expected, please see https://docs.victoriametrics.com/Troubleshooting.html#slow-queries."
- alert: RemoteWriteErrors - alert: RemoteWriteErrors
expr: sum(increase(vmalert_remotewrite_errors_total[5m])) by(job, instance) > 0 expr: increase(vmalert_remotewrite_errors_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@ -75,7 +75,7 @@ groups:
or recording rules to the configured remote write URL. Check vmalert's logs for detailed error message." or recording rules to the configured remote write URL. Check vmalert's logs for detailed error message."
- alert: AlertmanagerErrors - alert: AlertmanagerErrors
expr: sum(increase(vmalert_alerts_send_errors_total[5m])) by(job, instance, addr) > 0 expr: increase(vmalert_alerts_send_errors_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning

View file

@ -61,7 +61,7 @@ groups:
Please verify if clients are sending correct requests." Please verify if clients are sending correct requests."
- alert: RowsRejectedOnIngestion - alert: RowsRejectedOnIngestion
expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0 expr: rate(vm_rows_ignored_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning
@ -124,7 +124,7 @@ groups:
See also https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183" See also https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183"
- alert: LabelsLimitExceededOnIngestion - alert: LabelsLimitExceededOnIngestion
expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0 expr: increase(vm_metrics_with_dropped_labels_total[5m]) > 0
for: 15m for: 15m
labels: labels:
severity: warning severity: warning