VictoriaMetrics/deployment/docker/alerts-health.yml

# File contains default list of alerts for various VM components.
# The following alerts are recommended for use for any VM installation.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
  - name: vm-health
    # note the `job` filter and update accordingly to your setup
    rules:
      - alert: TooManyRestarts
        expr: changes(process_start_time_seconds{job=~".*(victoriametrics|vmselect|vminsert|vmstorage|vmagent|vmalert|vmsingle|vmalertmanager|vmauth).*"}[15m]) > 2
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} too many restarts (instance {{ $labels.instance }})"
          description: "Job {{ $labels.job }} (instance {{ $labels.instance }}) has restarted more than twice in the last 15 minutes.
            It might be crashlooping."

      - alert: ServiceDown
        expr: up{job=~".*(victoriametrics|vmselect|vminsert|vmstorage|vmagent|vmalert|vmsingle|vmalertmanager|vmauth).*"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."

      - alert: ProcessNearFDLimits
        expr: (process_max_fds - process_open_fds) < 100
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
          description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
          Consider to increase the limit as fast as possible."

      - alert: TooHighMemoryUsage
        expr: (min_over_time(process_resident_memory_anon_bytes[10m]) / vm_available_memory_bytes) > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "It is more than 80% of memory used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\")"
          description: "Too high memory usage may result into multiple issues such as OOMs or degraded performance.
           Consider to either increase available memory or decrease the load on the process."

      - alert: TooHighCPUUsage
        expr: rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "More than 90% of CPU is used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
          description: "Too high CPU usage may be a sign of insufficient resources and make process unstable.
               Consider to either increase available CPU resources or decrease the load on the process."

      - alert: TooManyLogs
        expr: sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Too many logs printed for job \"{{ $labels.job }}\" ({{ $labels.instance }})"
          description: "Logging rate for job \"{{ $labels.job }}\" ({{ $labels.instance }}) is {{ $value }} for last 15m.\n
         Worth to check logs for specific error messages."

      - alert: TooManyTSIDMisses
        expr: rate(vm_missing_tsids_for_metric_id_total[5m]) > 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Too many TSID misses for job \"{{ $labels.job }}\" ({{ $labels.instance }})"
          description: "The rate of TSID misses during query lookups is too high for \"{{ $labels.job }}\" ({{ $labels.instance }}).\n
         Make sure you're running VictoriaMetrics of v1.85.3 or higher.\n
         Related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502"

      - alert: ConcurrentInsertsHitTheLimit
        expr: avg_over_time(vm_concurrent_insert_current[1m]) >= vm_concurrent_insert_capacity
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit"
          description: "The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs.\n
            Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU.
            In some cases for components like vmagent or vminsert the alert might trigger if there are too many clients
            making write attempts. If vmagent's or vminsert's CPU usage and network saturation are at normal level, then 
            it might be worth adjusting `-maxConcurrentInserts` cmd-line flag."

      - alert: IndexDBRecordsDrop
        expr: increase(vm_indexdb_items_dropped_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "IndexDB skipped registering items during data ingestion with reason={{ $labels.reason }}."
          description: "VictoriaMetrics could skip registering new timeseries during ingestion if they fail the validation process. 
          For example, `reason=too_long_item` means that time series cannot exceed 64KB. Please, reduce the number 
          of labels or label values for such series. Or enforce these limits via `-maxLabelsPerTimeseries` and 
          `-maxLabelValueLen` command-line flags."
alerts: move `ConcurrentFlushesHitTheLimit` alert to health alerts The `ConcurrentFlushesHitTheLimit` could be related to components like vminsert, vmstorage, vm-single-node and vmagent. Moving this alert to the `health` section of alerts will be benefitial for all components and will remove the duplicates from single/cluster alerts. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-08-03 08:45:21 +00:00			`# File contains default list of alerts for various VM components.`
			`# The following alerts are recommended for use for any VM installation.`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`# The alerts below are just recommendations and may require some updates`
			`# and threshold calibration according to every specific setup.`
			`groups:`
			`- name: vm-health`
			# note the `job` filter and update accordingly to your setup
			`rules:`
			`- alert: TooManyRestarts`
alerting: account for `vmauth` component for alerts `ServiceDown` and `TooManyRestarts` Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-03 14:45:33 +00:00			`expr: changes(process_start_time_seconds{job=~".(victoriametrics\|vmselect\|vminsert\|vmstorage\|vmagent\|vmalert\|vmsingle\|vmalertmanager\|vmauth)."}[15m]) > 2`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "{{ $labels.job }} too many restarts (instance {{ $labels.instance }})"`
			`description: "Job {{ $labels.job }} (instance {{ $labels.instance }}) has restarted more than twice in the last 15 minutes.`
			`It might be crashlooping."`

			`- alert: ServiceDown`
alerting: account for `vmauth` component for alerts `ServiceDown` and `TooManyRestarts` Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-03 14:45:33 +00:00			`expr: up{job=~".(victoriametrics\|vmselect\|vminsert\|vmstorage\|vmagent\|vmalert\|vmsingle\|vmalertmanager\|vmauth)."} == 0`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`for: 2m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"`
			`description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."`

			`- alert: ProcessNearFDLimits`
			`expr: (process_max_fds - process_open_fds) < 100`
			`for: 5m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"`
			`description: "Exhausting OS file descriptors limit can cause severe degradation of the process.`
			`Consider to increase the limit as fast as possible."`

			`- alert: TooHighMemoryUsage`
deployment/alerts: make `TooHighMemoryUsage` more tolerable to spikes Using `min_over_time` should reduce the amount of false positives when component is running in near-the-threshold state. Now it should trigger only if all collected samples were above the threshold on 10m interval. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-24 07:39:46 +00:00			`expr: (min_over_time(process_resident_memory_anon_bytes[10m]) / vm_available_memory_bytes) > 0.8`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`for: 5m`
			`labels:`
			`severity: critical`
			`annotations:`
deployment/alerts: update `TooHighMemoryUsage` annotation The memory usage isn't measured on 5m interval anymore. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-24 07:53:44 +00:00			`summary: "It is more than 80% of memory used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\")"`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`description: "Too high memory usage may result into multiple issues such as OOMs or degraded performance.`
			`Consider to either increase available memory or decrease the load on the process."`

			`- alert: TooHighCPUUsage`
			`expr: rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9`
			`for: 5m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "More than 90% of CPU is used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"`
			`description: "Too high CPU usage may be a sign of insufficient resources and make process unstable.`
docker-compose: move `TooManyLogs` into `vm-health` alerts set (#3199) 2022-10-05 17:23:36 +00:00			`Consider to either increase available CPU resources or decrease the load on the process."`

			`- alert: TooManyLogs`
alerts: simplify aggregation of alerting rules This is follow-up after https://github.com/VictoriaMetrics/VictoriaMetrics/commit/75196d7234afde97f9be46b36f25a0f2675731f9 It updates some of the alerting rules to remove unnecessary aggregations. It keeps aggregations for expressions which are using multiple time series filters to make sure their label will match. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-12-11 14:17:30 +00:00			`expr: sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 0`
docker-compose: move `TooManyLogs` into `vm-health` alerts set (#3199) 2022-10-05 17:23:36 +00:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Too many logs printed for job \"{{ $labels.job }}\" ({{ $labels.instance }})"`
			`description: "Logging rate for job \"{{ $labels.job }}\" ({{ $labels.instance }}) is {{ $value }} for last 15m.\n`
			`Worth to check logs for specific error messages."`
alerts: add `TooManyTSIDMisses` alerting rule (#3959) See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502#issuecomment-1358374954 Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-03-17 08:46:51 +00:00
			`- alert: TooManyTSIDMisses`
alerts: simplify aggregation of alerting rules This is follow-up after https://github.com/VictoriaMetrics/VictoriaMetrics/commit/75196d7234afde97f9be46b36f25a0f2675731f9 It updates some of the alerting rules to remove unnecessary aggregations. It keeps aggregations for expressions which are using multiple time series filters to make sure their label will match. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-12-11 14:17:30 +00:00			`expr: rate(vm_missing_tsids_for_metric_id_total[5m]) > 0`
alerts: add `TooManyTSIDMisses` alerting rule (#3959) See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502#issuecomment-1358374954 Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-03-17 08:46:51 +00:00			`for: 10m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Too many TSID misses for job \"{{ $labels.job }}\" ({{ $labels.instance }})"`
			`description: "The rate of TSID misses during query lookups is too high for \"{{ $labels.job }}\" ({{ $labels.instance }}).\n`
			`Make sure you're running VictoriaMetrics of v1.85.3 or higher.\n`
			`Related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502"`
alerts: move `ConcurrentFlushesHitTheLimit` alert to health alerts The `ConcurrentFlushesHitTheLimit` could be related to components like vminsert, vmstorage, vm-single-node and vmagent. Moving this alert to the `health` section of alerts will be benefitial for all components and will remove the duplicates from single/cluster alerts. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-08-03 08:45:21 +00:00
			`- alert: ConcurrentInsertsHitTheLimit`
			`expr: avg_over_time(vm_concurrent_insert_current[1m]) >= vm_concurrent_insert_capacity`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit"`
			`description: "The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs.\n`
			`Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU.`
			`In some cases for components like vmagent or vminsert the alert might trigger if there are too many clients`
			`making write attempts. If vmagent's or vminsert's CPU usage and network saturation are at normal level, then`
			it might be worth adjusting `-maxConcurrentInserts` cmd-line flag."
lib/mergeset: adds tracking for indexdb records drop (#6297) It allows to create alert for possible item drops at indexdb. It may happen, if ingested metric size exceeds max indexdb item size. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com> 2024-05-24 12:55:20 +00:00
			`- alert: IndexDBRecordsDrop`
			`expr: increase(vm_indexdb_items_dropped_total[5m]) > 0`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "IndexDB skipped registering items during data ingestion with reason={{ $labels.reason }}."`
			`description: "VictoriaMetrics could skip registering new timeseries during ingestion if they fail the validation process.`
			For example, `reason=too_long_item` means that time series cannot exceed 64KB. Please, reduce the number
			of labels or label values for such series. Or enforce these limits via `-maxLabelsPerTimeseries` and
			`-maxLabelValueLen` command-line flags."