VictoriaMetrics/deployment/docker/alerts.yml

# File contains default list of alerts for VictoriaMetrics single server.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
  # Alerts group for VM single assumes that Grafana dashboard
  # https://grafana.com/grafana/dashboards/10229 is installed.
  # Pls update the `dashboard` annotation according to your setup.
  - name: vmsingle
    interval: 30s
    concurrency: 2
    rules:
      - alert: DiskRunsOutOfSpaceIn3Days
        expr: |
          vm_free_disk_space_bytes / ignoring(path)
          (
             (
              rate(vm_rows_added_to_storage_total[1d]) -
              ignoring(type) rate(vm_deduplicated_samples_total{type="merge"}[1d])
             )
            * scalar(
              sum(vm_data_size_bytes{type!="indexdb"}) /
              sum(vm_rows{type!="indexdb"})
             )
          ) < 3 * 24 * 3600 > 0
        for: 30m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=73&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} will run out of disk space soon"
          description: "Taking into account current ingestion rate, free disk space will be enough only
            for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.\n
            Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."

      - alert: DiskRunsOutOfSpace
        expr: |
          sum(vm_data_size_bytes) by(instance) /
          (
           sum(vm_free_disk_space_bytes) by(instance) +
           sum(vm_data_size_bytes) by(instance)
          ) > 0.8
        for: 30m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} will run out of disk space soon"
          description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n
            Having less than 20% of free disk space could cripple merges processes and overall performance.
            Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."

      - alert: RequestErrorsToAPI
        expr: increase(vm_http_request_errors_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }}"
          summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"
          description: "Requests to path {{ $labels.path }} are receiving errors.
            Please verify if clients are sending correct requests."

      - alert: ConcurrentFlushesHitTheLimit
        expr: avg_over_time(vm_concurrent_addrows_current[1m]) >= vm_concurrent_addrows_capacity
        for: 15m
        labels:
          severity: warning
          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=59&var-instance={{ $labels.instance }}"
          summary: "VictoriaMetrics on instance {{ $labels.instance }} is constantly hitting concurrent flushes limit"
          description: "The limit of concurrent flushes on instance {{ $labels.instance }} is equal to number of CPUs.\n
            When VictoriaMetrics constantly hits the limit it means that storage is overloaded and requires more CPU."

      - alert: RowsRejectedOnIngestion
        expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=58&var-instance={{ $labels.instance }}"
          summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
          description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
            following reason: \"{{ $labels.reason }}\""

      - alert: TooHighChurnRate
        expr: |
          (
             sum(rate(vm_new_timeseries_created_total[5m])) by(instance)
             /
             sum(rate(vm_rows_inserted_total[5m])) by (instance)
           ) > 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
          summary: "Churn rate is more than 10% on \"{{ $labels.instance }}\" for the last 15m"
          description: "VM constantly creates new time series on \"{{ $labels.instance }}\".\n
            This effect is known as Churn Rate.\n
            High Churn Rate tightly connected with database performance and may
            result in unexpected OOM's or slow queries."

      - alert: TooHighChurnRate24h
        expr: |
          sum(increase(vm_new_timeseries_created_total[24h])) by(instance)
          >
          (sum(vm_cache_entries{type="storage/hour_metric_ids"}) by(instance) * 3)
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
          summary: "Too high number of new series on \"{{ $labels.instance }}\" created over last 24h"
          description: "The number of created new time series over last 24h is 3x times higher than
            current number of active series on \"{{ $labels.instance }}\".\n
            This effect is known as Churn Rate.\n
            High Churn Rate tightly connected with database performance and may
            result in unexpected OOM's or slow queries."

      - alert: TooHighSlowInsertsRate
        expr: |
          (
             sum(rate(vm_slow_row_inserts_total[5m])) by(instance)
             /
             sum(rate(vm_rows_inserted_total[5m])) by (instance)
           ) > 0.05
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=68&var-instance={{ $labels.instance }}"
          summary: "Percentage of slow inserts is more than 5% on \"{{ $labels.instance }}\" for the last 15m"
          description: "High rate of slow inserts on \"{{ $labels.instance }}\" may be a sign of resource exhaustion
            for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series."

      - alert: LabelsLimitExceededOnIngestion
        expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=74&var-instance={{ $labels.instance }}"
          summary: "Metrics ingested in ({{ $labels.instance }}) are exceeding labels limit"
          description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
           This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
           correctly or that clients which send these metrics aren't misbehaving."
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`# File contains default list of alerts for VictoriaMetrics single server.`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`# The alerts below are just recommendations and may require some updates`
			`# and threshold calibration according to every specific setup.`
[deployment] add vmalert + alertmanager to docker compose (#885) 2020-11-07 15:00:23 +00:00			`groups:`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`# Alerts group for VM single assumes that Grafana dashboard`
			`# https://grafana.com/grafana/dashboards/10229 is installed.`
			# Pls update the `dashboard` annotation according to your setup.
			`- name: vmsingle`
			`interval: 30s`
			`concurrency: 2`
[deployment] add vmalert + alertmanager to docker compose (#885) 2020-11-07 15:00:23 +00:00			`rules:`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`- alert: DiskRunsOutOfSpaceIn3Days`
			`expr: \|`
alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1436) 2021-07-07 07:31:09 +00:00			`vm_free_disk_space_bytes / ignoring(path)`
			`(`
			`(`
			`rate(vm_rows_added_to_storage_total[1d]) -`
			`ignoring(type) rate(vm_deduplicated_samples_total{type="merge"}[1d])`
			`)`
			`* scalar(`
			`sum(vm_data_size_bytes{type!="indexdb"}) /`
			`sum(vm_rows{type!="indexdb"})`
			`)`
deployment/docker/alerts.yml: backport a42063909f4bd38b6243a19c664d7add9bf7b637 2022-07-12 16:52:59 +00:00			`) < 3 * 24 * 3600 > 0`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`for: 30m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=73&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} will run out of disk space soon"`
			`description: "Taking into account current ingestion rate, free disk space will be enough only`
			`for {{ $value \| humanizeDuration }} on instance {{ $labels.instance }}.\n`
			`Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."`

add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 15:24:25 +00:00			`- alert: DiskRunsOutOfSpace`
			`expr: \|`
			`sum(vm_data_size_bytes) by(instance) /`
			`(`
			`sum(vm_free_disk_space_bytes) by(instance) +`
			`sum(vm_data_size_bytes) by(instance)`
			`) > 0.8`
			`for: 30m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} will run out of disk space soon"`
			`description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n`
			`Having less than 20% of free disk space could cripple merges processes and overall performance.`
			`Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."`

docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`- alert: RequestErrorsToAPI`
			`expr: increase(vm_http_request_errors_total[5m]) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
dashboards: update VM single dash (#3400) The change list is the following: * bump Grafana version to 9.2.6; * replace old "Graph" panel with "TimeSeries" panel; * show % usage of Mem and CPU additionally to of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-29 18:28:22 +00:00			`show_at: dashboard`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }}"`
			`summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"`
			`description: "Requests to path {{ $labels.path }} are receiving errors.`
			`Please verify if clients are sending correct requests."`

			`- alert: ConcurrentFlushesHitTheLimit`
add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 15:24:25 +00:00			`expr: avg_over_time(vm_concurrent_addrows_current[1m]) >= vm_concurrent_addrows_capacity`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`for: 15m`
			`labels:`
			`severity: warning`
dashboards: update VM single dash (#3400) The change list is the following: * bump Grafana version to 9.2.6; * replace old "Graph" panel with "TimeSeries" panel; * show % usage of Mem and CPU additionally to of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-29 18:28:22 +00:00			`show_at: dashboard`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=59&var-instance={{ $labels.instance }}"`
add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 15:24:25 +00:00			`summary: "VictoriaMetrics on instance {{ $labels.instance }} is constantly hitting concurrent flushes limit"`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`description: "The limit of concurrent flushes on instance {{ $labels.instance }} is equal to number of CPUs.\n`
			`When VictoriaMetrics constantly hits the limit it means that storage is overloaded and requires more CPU."`

			`- alert: RowsRejectedOnIngestion`
			`expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=58&var-instance={{ $labels.instance }}"`
			`summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"`
			`description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the`
			`following reason: \"{{ $labels.reason }}\""`

			`- alert: TooHighChurnRate`
			`expr: \|`
			`(`
			`sum(rate(vm_new_timeseries_created_total[5m])) by(instance)`
			`/`
			`sum(rate(vm_rows_inserted_total[5m])) by (instance)`
			`) > 0.1`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"`
			`summary: "Churn rate is more than 10% on \"{{ $labels.instance }}\" for the last 15m"`
			`description: "VM constantly creates new time series on \"{{ $labels.instance }}\".\n`
			`This effect is known as Churn Rate.\n`
			`High Churn Rate tightly connected with database performance and may`
			`result in unexpected OOM's or slow queries."`

deployment: add new alert `TooHighChurnRate24h` (#1154) Alert `TooHighChurnRate24h` suppose to cover cases when churn rate is low but results in multiple times higher number than total number of active series. 2021-03-29 09:38:03 +00:00			`- alert: TooHighChurnRate24h`
			`expr: \|`
			`sum(increase(vm_new_timeseries_created_total[24h])) by(instance)`
			`>`
			`(sum(vm_cache_entries{type="storage/hour_metric_ids"}) by(instance) * 3)`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"`
			`summary: "Too high number of new series on \"{{ $labels.instance }}\" created over last 24h"`
			`description: "The number of created new time series over last 24h is 3x times higher than`
			`current number of active series on \"{{ $labels.instance }}\".\n`
			`This effect is known as Churn Rate.\n`
			`High Churn Rate tightly connected with database performance and may`
			`result in unexpected OOM's or slow queries."`

docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`- alert: TooHighSlowInsertsRate`
			`expr: \|`
			`(`
			`sum(rate(vm_slow_row_inserts_total[5m])) by(instance)`
			`/`
			`sum(rate(vm_rows_inserted_total[5m])) by (instance)`
alerts: lower the threshold for TooHighSlowInsertsRate (#2210) Lowering threshold from 50% to 5% will be more sufficient for discovering un-healthy system state. It also goes in sync with alert definition in cluster branch. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-02-18 11:42:24 +00:00			`) > 0.05`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=68&var-instance={{ $labels.instance }}"`
alerts: lower the threshold for TooHighSlowInsertsRate (#2210) Lowering threshold from 50% to 5% will be more sufficient for discovering un-healthy system state. It also goes in sync with alert definition in cluster branch. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-02-18 11:42:24 +00:00			`summary: "Percentage of slow inserts is more than 5% on \"{{ $labels.instance }}\" for the last 15m"`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 11:03:15 +00:00			`description: "High rate of slow inserts on \"{{ $labels.instance }}\" may be a sign of resource exhaustion`
			`for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series."`

alerts: add new alert `LabelsLimitExceededOnIngestion` (#1359) 2021-06-09 09:15:36 +00:00			`- alert: LabelsLimitExceededOnIngestion`
			`expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
Fix vmsingle dashboard link (#1894) 2021-12-02 12:43:30 +00:00			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=74&var-instance={{ $labels.instance }}"`
alerts: add new alert `LabelsLimitExceededOnIngestion` (#1359) 2021-06-09 09:15:36 +00:00			`summary: "Metrics ingested in ({{ $labels.instance }}) are exceeding labels limit"`
			description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
			This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
docker-compose: move `TooManyLogs` into `vm-health` alerts set (#3199) 2022-10-05 17:23:36 +00:00			`correctly or that clients which send these metrics aren't misbehaving."`