VictoriaMetrics/deployment/docker/alerts-cluster.yml

# File contains default list of alerts for VictoriaMetrics cluster.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
  # Alerts group for VM cluster assumes that Grafana dashboard
  # https://grafana.com/grafana/dashboards/11176 is installed.
  # Please, update the `dashboard` annotation according to your setup.
  - name: vmcluster
    interval: 30s
    concurrency: 2
    rules:
      - alert: DiskRunsOutOfSpaceIn3Days
        expr: |
          vm_free_disk_space_bytes / ignoring(path)
          (
             (
              rate(vm_rows_added_to_storage_total[1d]) -
              ignoring(type) rate(vm_deduplicated_samples_total{type="merge"}[1d])
             )
            * scalar(
              sum(vm_data_size_bytes{type!~"indexdb.*"}) /
              sum(vm_rows{type!~"indexdb.*"})
             )
          ) < 3 * 24 * 3600 > 0
        for: 30m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=113&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} will run out of disk space in 3 days"
          description: "Taking into account current ingestion rate, free disk space will be enough only
              for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.\n
              Consider to limit the ingestion rate, decrease retention or scale the disk space up if possible."

      - alert: DiskRunsOutOfSpace
        expr: |
          sum(vm_data_size_bytes) by(instance) /
          (
           sum(vm_free_disk_space_bytes) by(instance) +
           sum(vm_data_size_bytes) by(instance)
          ) > 0.8
        for: 30m
        labels:
          severity: critical
        annotations:
          dashboard: http://localhost:3000/d/oS7Bi_0Wz?viewPanel=200&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} will run out of disk space soon"
          description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n
            Having less than 20% of free disk space could cripple merges processes and overall performance.
            Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."

      - alert: RequestErrorsToAPI
        expr: increase(vm_http_request_errors_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=52&var-instance={{ $labels.instance }}"
          summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"
          description: "Requests to path {{ $labels.path }} are receiving errors.
            Please verify if clients are sending correct requests."

      - alert: RPCErrors
        expr: |
          (
           sum(increase(vm_rpc_connection_errors_total[5m])) by(job, instance)
           +
           sum(increase(vm_rpc_dial_errors_total[5m])) by(job, instance)
           +
           sum(increase(vm_rpc_handshake_errors_total[5m])) by(job, instance)
          ) > 0
        for: 15m
        labels:
          severity: warning
          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=44&var-instance={{ $labels.instance }}"
          summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"
          description: "RPC errors are interconnection errors between cluster components.\n
            Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."

      - alert: ConcurrentFlushesHitTheLimit
        expr: avg_over_time(vm_concurrent_addrows_current[1m]) >= vm_concurrent_addrows_capacity
        for: 15m
        labels:
          severity: warning
          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=133&var-instance={{ $labels.instance }}"
          summary: "vmstorage on instance {{ $labels.instance }} is constantly hitting concurrent flushes limit"
          description: "The limit of concurrent flushes on instance {{ $labels.instance }} is equal to number of CPUs.\n
            When vmstorage constantly hits the limit it means that storage is overloaded and requires more CPU."

      - alert: RowsRejectedOnIngestion
        expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=135&var-instance={{ $labels.instance }}"
          summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
          description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
            following reason: \"{{ $labels.reason }}\""

      - alert: TooHighChurnRate
        expr: |
          (
             sum(rate(vm_new_timeseries_created_total[5m]))
             /
             sum(rate(vm_rows_inserted_total[5m]))
           ) > 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"
          summary: "Churn rate is more than 10% for the last 15m"
          description: "VM constantly creates new time series.\n
            This effect is known as Churn Rate.\n
            High Churn Rate tightly connected with database performance and may
            result in unexpected OOM's or slow queries."

      - alert: TooHighChurnRate24h
        expr: |
          sum(increase(vm_new_timeseries_created_total[24h]))
          >
          (sum(vm_cache_entries{type="storage/hour_metric_ids"})* 3)
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"
          summary: "Too high number of new series created over last 24h"
          description: "The number of created new time series over last 24h is 3x times higher than
            current number of active series.\n
            This effect is known as Churn Rate.\n
            High Churn Rate tightly connected with database performance and may
            result in unexpected OOM's or slow queries."

      - alert: TooHighSlowInsertsRate
        expr: |
          (
             sum(rate(vm_slow_row_inserts_total[5m]))
             /
             sum(rate(vm_rows_inserted_total[5m]))
           ) > 0.05
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=108"
          summary: "Percentage of slow inserts is more than 5% for the last 15m"
          description: "High rate of slow inserts may be a sign of resource exhaustion
            for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series."

      - alert: ProcessNearFDLimits
        expr: (process_max_fds - process_open_fds) < 100
        for: 5m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=117&var-instance={{ $labels.instance }}"
          summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
          description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
          Consider to increase the limit as fast as possible."

      - alert: LabelsLimitExceededOnIngestion
        expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=116&var-instance={{ $labels.instance }}"
          summary: "Metrics ingested to vminsert on {{ $labels.instance }} are exceeding labels limit"
          description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
           This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
           correctly or that clients which send these metrics aren't misbehaving."

      - alert: VminsertVmstorageConnectionIsSaturated
        expr: rate(vm_rpc_send_duration_seconds_total[5m]) > 0.9
        for: 15m
        labels:
          severity: warning
          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=139&var-instance={{ $labels.instance }}"
          summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"
          description: "The connection between vminsert (instance {{ $labels.instance }}) and vmstorage (instance {{ $labels.addr }})
            is saturated by more than 90% and vminsert won't be able to keep up.\n
            This usually means that more vminsert or vmstorage nodes must be added to the cluster in order to increase
            the total number of vminsert -> vmstorage links."
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`# File contains default list of alerts for VictoriaMetrics cluster.`
			`# The alerts below are just recommendations and may require some updates`
			`# and threshold calibration according to every specific setup.`
			`groups:`
			`# Alerts group for VM cluster assumes that Grafana dashboard`
			`# https://grafana.com/grafana/dashboards/11176 is installed.`
			# Please, update the `dashboard` annotation according to your setup.
			`- name: vmcluster`
			`interval: 30s`
			`concurrency: 2`
			`rules:`
			`- alert: DiskRunsOutOfSpaceIn3Days`
			`expr: \|`
			`vm_free_disk_space_bytes / ignoring(path)`
			`(`
			`(`
			`rate(vm_rows_added_to_storage_total[1d]) -`
			`ignoring(type) rate(vm_deduplicated_samples_total{type="merge"}[1d])`
			`)`
			`* scalar(`
{dashboards,alerts}: subtitute `{type="indexdb"}` with `{type=~"indexdb.*"}` inside queries after 8189770c50165b62867327ad388f2c2ef237ab6f Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337 2022-12-05 23:59:52 +00:00			`sum(vm_data_size_bytes{type!~"indexdb.*"}) /`
			`sum(vm_rows{type!~"indexdb.*"})`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`)`
			`) < 3 * 24 * 3600 > 0`
			`for: 30m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=113&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} will run out of disk space in 3 days"`
			`description: "Taking into account current ingestion rate, free disk space will be enough only`
			`for {{ $value \| humanizeDuration }} on instance {{ $labels.instance }}.\n`
			`Consider to limit the ingestion rate, decrease retention or scale the disk space up if possible."`

			`- alert: DiskRunsOutOfSpace`
			`expr: \|`
			`sum(vm_data_size_bytes) by(instance) /`
			`(`
			`sum(vm_free_disk_space_bytes) by(instance) +`
			`sum(vm_data_size_bytes) by(instance)`
			`) > 0.8`
			`for: 30m`
			`labels:`
			`severity: critical`
			`annotations:`
dashboards: add `Disk space usage %` and `Disk space usage % by type` panels (#3436) The new panels have been added to the vmstorage and drilldown rows. `Disk space usage %` is supposed to show disk space usage percentage. This panel is now also referred by `DiskRunsOutOfSpace` alerting rule. This panel has Drilldown option to show absolute values. `Disk space usage % by type` shows the relation between datapoints and indexdb size. It supposed to help identify cases when indexdb starts to take too much disk space. This panel has Drilldown option to show absolute values. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-12-05 07:35:33 +00:00			`dashboard: http://localhost:3000/d/oS7Bi_0Wz?viewPanel=200&var-instance={{ $labels.instance }}"`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`summary: "Instance {{ $labels.instance }} will run out of disk space soon"`
			`description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n`
			`Having less than 20% of free disk space could cripple merges processes and overall performance.`
			`Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."`

			`- alert: RequestErrorsToAPI`
			`expr: increase(vm_http_request_errors_total[5m]) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
dashboards: cluster dashboard update (#3380) The purpose of the update is to make the dash more usable for large installations with many instances. Panels which showed metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg aggregations in % instead. This supposed to help immediately to identify resource shortage and remain usable for small and big installations. For cases when detailed info is needed, to the bottom of the dashboard a new row `Drilldown` was added. Panels like Mem or CPU now contain a `data-link` named `Drilldown` (cis shown on line click) which takes user to more detailed panel. The change list is the following: * bump Grafana version to 9.1.0; * replace old "Graph" panel with "TimeSeries" panel; * improve Uptime panel to show number of instances per job; * show % usage of Mem and CPU instead of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add `Drilldown` section for detailed resource usage; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts-cluster.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-24 02:03:25 +00:00			`show_at: dashboard`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=52&var-instance={{ $labels.instance }}"`
			`summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"`
			`description: "Requests to path {{ $labels.path }} are receiving errors.`
			`Please verify if clients are sending correct requests."`

			`- alert: RPCErrors`
			`expr: \|`
			`(`
			`sum(increase(vm_rpc_connection_errors_total[5m])) by(job, instance)`
			`+`
			`sum(increase(vm_rpc_dial_errors_total[5m])) by(job, instance)`
			`+`
			`sum(increase(vm_rpc_handshake_errors_total[5m])) by(job, instance)`
			`) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
dashboards: cluster dashboard update (#3380) The purpose of the update is to make the dash more usable for large installations with many instances. Panels which showed metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg aggregations in % instead. This supposed to help immediately to identify resource shortage and remain usable for small and big installations. For cases when detailed info is needed, to the bottom of the dashboard a new row `Drilldown` was added. Panels like Mem or CPU now contain a `data-link` named `Drilldown` (cis shown on line click) which takes user to more detailed panel. The change list is the following: * bump Grafana version to 9.1.0; * replace old "Graph" panel with "TimeSeries" panel; * improve Uptime panel to show number of instances per job; * show % usage of Mem and CPU instead of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add `Drilldown` section for detailed resource usage; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts-cluster.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-24 02:03:25 +00:00			`show_at: dashboard`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=44&var-instance={{ $labels.instance }}"`
			`summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"`
			`description: "RPC errors are interconnection errors between cluster components.\n`
			`Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."`

			`- alert: ConcurrentFlushesHitTheLimit`
			`expr: avg_over_time(vm_concurrent_addrows_current[1m]) >= vm_concurrent_addrows_capacity`
			`for: 15m`
			`labels:`
			`severity: warning`
dashboards: cluster dashboard update (#3380) The purpose of the update is to make the dash more usable for large installations with many instances. Panels which showed metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg aggregations in % instead. This supposed to help immediately to identify resource shortage and remain usable for small and big installations. For cases when detailed info is needed, to the bottom of the dashboard a new row `Drilldown` was added. Panels like Mem or CPU now contain a `data-link` named `Drilldown` (cis shown on line click) which takes user to more detailed panel. The change list is the following: * bump Grafana version to 9.1.0; * replace old "Graph" panel with "TimeSeries" panel; * improve Uptime panel to show number of instances per job; * show % usage of Mem and CPU instead of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add `Drilldown` section for detailed resource usage; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts-cluster.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-24 02:03:25 +00:00			`show_at: dashboard`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=133&var-instance={{ $labels.instance }}"`
			`summary: "vmstorage on instance {{ $labels.instance }} is constantly hitting concurrent flushes limit"`
			`description: "The limit of concurrent flushes on instance {{ $labels.instance }} is equal to number of CPUs.\n`
			`When vmstorage constantly hits the limit it means that storage is overloaded and requires more CPU."`

			`- alert: RowsRejectedOnIngestion`
			`expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=135&var-instance={{ $labels.instance }}"`
			`summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"`
			`description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the`
			`following reason: \"{{ $labels.reason }}\""`

			`- alert: TooHighChurnRate`
			`expr: \|`
			`(`
			`sum(rate(vm_new_timeseries_created_total[5m]))`
			`/`
			`sum(rate(vm_rows_inserted_total[5m]))`
			`) > 0.1`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"`
			`summary: "Churn rate is more than 10% for the last 15m"`
			`description: "VM constantly creates new time series.\n`
			`This effect is known as Churn Rate.\n`
			`High Churn Rate tightly connected with database performance and may`
			`result in unexpected OOM's or slow queries."`

			`- alert: TooHighChurnRate24h`
			`expr: \|`
			`sum(increase(vm_new_timeseries_created_total[24h]))`
			`>`
			`(sum(vm_cache_entries{type="storage/hour_metric_ids"})* 3)`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"`
			`summary: "Too high number of new series created over last 24h"`
			`description: "The number of created new time series over last 24h is 3x times higher than`
			`current number of active series.\n`
			`This effect is known as Churn Rate.\n`
			`High Churn Rate tightly connected with database performance and may`
			`result in unexpected OOM's or slow queries."`

			`- alert: TooHighSlowInsertsRate`
			`expr: \|`
			`(`
			`sum(rate(vm_slow_row_inserts_total[5m]))`
			`/`
			`sum(rate(vm_rows_inserted_total[5m]))`
			`) > 0.05`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=108"`
			`summary: "Percentage of slow inserts is more than 5% for the last 15m"`
			`description: "High rate of slow inserts may be a sign of resource exhaustion`
			`for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series."`

			`- alert: ProcessNearFDLimits`
			`expr: (process_max_fds - process_open_fds) < 100`
			`for: 5m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=117&var-instance={{ $labels.instance }}"`
			`summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"`
			`description: "Exhausting OS file descriptors limit can cause severe degradation of the process.`
			`Consider to increase the limit as fast as possible."`

			`- alert: LabelsLimitExceededOnIngestion`
			`expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=116&var-instance={{ $labels.instance }}"`
			`summary: "Metrics ingested to vminsert on {{ $labels.instance }} are exceeding labels limit"`
			description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
			This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
			`correctly or that clients which send these metrics aren't misbehaving."`

			`- alert: VminsertVmstorageConnectionIsSaturated`
			`expr: rate(vm_rpc_send_duration_seconds_total[5m]) > 0.9`
			`for: 15m`
			`labels:`
			`severity: warning`
dashboards: cluster dashboard update (#3380) The purpose of the update is to make the dash more usable for large installations with many instances. Panels which showed metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg aggregations in % instead. This supposed to help immediately to identify resource shortage and remain usable for small and big installations. For cases when detailed info is needed, to the bottom of the dashboard a new row `Drilldown` was added. Panels like Mem or CPU now contain a `data-link` named `Drilldown` (cis shown on line click) which takes user to more detailed panel. The change list is the following: * bump Grafana version to 9.1.0; * replace old "Graph" panel with "TimeSeries" panel; * improve Uptime panel to show number of instances per job; * show % usage of Mem and CPU instead of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add `Drilldown` section for detailed resource usage; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts-cluster.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-24 02:03:25 +00:00			`show_at: dashboard`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 08:48:38 +00:00			`annotations:`
			`dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=139&var-instance={{ $labels.instance }}"`
			`summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"`
			`description: "The connection between vminsert (instance {{ $labels.instance }}) and vmstorage (instance {{ $labels.addr }})`
			`is saturated by more than 90% and vminsert won't be able to keep up.\n`
			`This usually means that more vminsert or vmstorage nodes must be added to the cluster in order to increase`
			`the total number of vminsert -> vmstorage links."`
dashboards: cluster dashboard update (#3380) The purpose of the update is to make the dash more usable for large installations with many instances. Panels which showed metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg aggregations in % instead. This supposed to help immediately to identify resource shortage and remain usable for small and big installations. For cases when detailed info is needed, to the bottom of the dashboard a new row `Drilldown` was added. Panels like Mem or CPU now contain a `data-link` named `Drilldown` (cis shown on line click) which takes user to more detailed panel. The change list is the following: * bump Grafana version to 9.1.0; * replace old "Graph" panel with "TimeSeries" panel; * improve Uptime panel to show number of instances per job; * show % usage of Mem and CPU instead of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add `Drilldown` section for detailed resource usage; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts-cluster.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-24 02:03:25 +00:00