mirror of
https://github.com/VictoriaMetrics/VictoriaMetrics.git
synced 2024-11-21 14:44:00 +00:00
deployment/docker: move cluster compose env to master branch (#3130)
* deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit is contained in:
parent
6a794ec5d5
commit
5714a68ac6
12 changed files with 618 additions and 192 deletions
|
@ -177,3 +177,16 @@ package-via-docker-386:
|
||||||
|
|
||||||
remove-docker-images:
|
remove-docker-images:
|
||||||
docker image ls --format '{{.Repository}}\t{{.ID}}' | awk '{print $$2}' | xargs docker image rm -f
|
docker image ls --format '{{.Repository}}\t{{.ID}}' | awk '{print $$2}' | xargs docker image rm -f
|
||||||
|
|
||||||
|
|
||||||
|
docker-single-up:
|
||||||
|
docker-compose -f deployment/docker/docker-compose.yml up
|
||||||
|
|
||||||
|
docker-single-down:
|
||||||
|
docker-compose -f deployment/docker/docker-compose.yml down -v
|
||||||
|
|
||||||
|
docker-cluster-up:
|
||||||
|
docker-compose -f deployment/docker/docker-compose-cluster.yml up
|
||||||
|
|
||||||
|
docker-cluster-down:
|
||||||
|
docker-compose -f deployment/docker/docker-compose-cluster.yml down -v
|
|
@ -1,12 +1,33 @@
|
||||||
# Docker compose environment for VictoriaMetrics
|
# Docker compose environment for VictoriaMetrics
|
||||||
|
|
||||||
To spin-up VictoriaMetrics, vmagent, vmalert, Alertmanager and Grafana run the following command:
|
Docker compose environment for VictoriaMetrics includes VictoriaMetrics components,
|
||||||
|
[Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
||||||
|
and [Grafana](https://grafana.com/).
|
||||||
|
|
||||||
`docker-compose up`
|
For starting the docker-compose environment ensure you have docker installed and running and access to the Internet.
|
||||||
|
All commands should be executed from the root directory of this repo.
|
||||||
|
|
||||||
For clustered version check [docker compose in cluster branch](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster/deployment/docker).
|
To spin-up environment for single server VictoriaMetrics run the following command :
|
||||||
|
```
|
||||||
|
make docker-single-up
|
||||||
|
```
|
||||||
|
|
||||||
## VictoriaMetrics
|
To shutdown the docker compose environment for single server run the following command:
|
||||||
|
```
|
||||||
|
make docker-single-down
|
||||||
|
```
|
||||||
|
|
||||||
|
For cluster version the command will be the following:
|
||||||
|
```
|
||||||
|
make docker-cluster-up
|
||||||
|
```
|
||||||
|
|
||||||
|
To shutdown the docker compose environment for cluster version run the following command:
|
||||||
|
```
|
||||||
|
make docker-cluster-down
|
||||||
|
```
|
||||||
|
|
||||||
|
## VictoriaMetrics single server
|
||||||
|
|
||||||
VictoriaMetrics will be accessible on the following ports:
|
VictoriaMetrics will be accessible on the following ports:
|
||||||
|
|
||||||
|
@ -14,6 +35,40 @@ VictoriaMetrics will be accessible on the following ports:
|
||||||
* `--opentsdbListenAddr=:4242`
|
* `--opentsdbListenAddr=:4242`
|
||||||
* `--httpListenAddr=:8428`
|
* `--httpListenAddr=:8428`
|
||||||
|
|
||||||
|
The communication scheme between components is the following:
|
||||||
|
* [vmagent](#vmagent) sends scraped metrics to VictoriaMetrics;
|
||||||
|
* [grafana](#grafana) is configured with datasource pointing to VictoriaMetrics;
|
||||||
|
* [vmalert](#vmalert) is configured to query VictoriaMetrics and send alerts state
|
||||||
|
and recording rules back to it;
|
||||||
|
* [alertmanager](#alertmanager) is configured to receive notifications from vmalert.
|
||||||
|
|
||||||
|
To access `vmalert` via `vmselect`
|
||||||
|
use link [http://localhost:8428/vmalert](http://localhost:8428/vmalert/).
|
||||||
|
|
||||||
|
To access [vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui)
|
||||||
|
use link [http://localhost:8428/vmui](http://localhost:8428/vmui).
|
||||||
|
|
||||||
|
## VictoriaMetrics cluster
|
||||||
|
|
||||||
|
VictoriaMetrics cluster environemnt consists of vminsert, vmstorage and vmselect components. vmselect
|
||||||
|
has exposed port `:8481`, vminsert has exposed port `:8480` and the rest of components are available
|
||||||
|
only inside of environment.
|
||||||
|
|
||||||
|
The communication scheme between components is the following:
|
||||||
|
* [vmagent](#vmagent) sends scraped metrics to vminsert;
|
||||||
|
* vminsert forwards data to vmstorage;
|
||||||
|
* vmselect is connected to vmstorage for querying data;
|
||||||
|
* [grafana](#grafana) is configured with datasource pointing to vmselect;
|
||||||
|
* [vmalert](#vmalert) is configured to query vmselect and send alerts state
|
||||||
|
and recording rules to vminsert;
|
||||||
|
* [alertmanager](#alertmanager) is configured to receive notifications from vmalert.
|
||||||
|
|
||||||
|
To access `vmalert` via `vmselect`
|
||||||
|
use link [http://localhost:8481/select/0/prometheus/vmalert](http://localhost:8481/select/0/prometheus/vmalert/).
|
||||||
|
|
||||||
|
To access [vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui)
|
||||||
|
use link [http://localhost:8481/select/0/prometheus/vmui](http://localhost:8481/select/0/prometheus/vmui).
|
||||||
|
|
||||||
## vmagent
|
## vmagent
|
||||||
|
|
||||||
vmagent is used for scraping and pushing timeseries to
|
vmagent is used for scraping and pushing timeseries to
|
||||||
|
@ -48,6 +103,11 @@ Default credential:
|
||||||
|
|
||||||
Grafana is provisioned by default with following entities:
|
Grafana is provisioned by default with following entities:
|
||||||
|
|
||||||
* VictoriaMetrics datasource
|
* `VictoriaMetrics` datasource
|
||||||
* Prometheus datasource
|
* `VictoriaMetrics - cluster` datasource
|
||||||
* VictoriaMetrics overview dashboard
|
* `VictoriaMetrics overview` dashboard
|
||||||
|
* `VictoriaMetrics - cluster` dashboard
|
||||||
|
* `VictoriaMetrics - vmagent` dashboard
|
||||||
|
* `VictoriaMetrics - vmalert` dashboard
|
||||||
|
|
||||||
|
Remember to pick `VictoriaMetrics - cluster` datasource when viewing `VictoriaMetrics - cluster` dashboard.
|
199
deployment/docker/alerts-cluster.yml
Normal file
199
deployment/docker/alerts-cluster.yml
Normal file
|
@ -0,0 +1,199 @@
|
||||||
|
# File contains default list of alerts for VictoriaMetrics cluster.
|
||||||
|
# The alerts below are just recommendations and may require some updates
|
||||||
|
# and threshold calibration according to every specific setup.
|
||||||
|
groups:
|
||||||
|
# Alerts group for VM cluster assumes that Grafana dashboard
|
||||||
|
# https://grafana.com/grafana/dashboards/11176 is installed.
|
||||||
|
# Please, update the `dashboard` annotation according to your setup.
|
||||||
|
- name: vmcluster
|
||||||
|
interval: 30s
|
||||||
|
concurrency: 2
|
||||||
|
rules:
|
||||||
|
- alert: DiskRunsOutOfSpaceIn3Days
|
||||||
|
expr: |
|
||||||
|
vm_free_disk_space_bytes / ignoring(path)
|
||||||
|
(
|
||||||
|
(
|
||||||
|
rate(vm_rows_added_to_storage_total[1d]) -
|
||||||
|
ignoring(type) rate(vm_deduplicated_samples_total{type="merge"}[1d])
|
||||||
|
)
|
||||||
|
* scalar(
|
||||||
|
sum(vm_data_size_bytes{type!="indexdb"}) /
|
||||||
|
sum(vm_rows{type!="indexdb"})
|
||||||
|
)
|
||||||
|
) < 3 * 24 * 3600 > 0
|
||||||
|
for: 30m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=113&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Instance {{ $labels.instance }} will run out of disk space in 3 days"
|
||||||
|
description: "Taking into account current ingestion rate, free disk space will be enough only
|
||||||
|
for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.\n
|
||||||
|
Consider to limit the ingestion rate, decrease retention or scale the disk space up if possible."
|
||||||
|
|
||||||
|
- alert: DiskRunsOutOfSpace
|
||||||
|
expr: |
|
||||||
|
sum(vm_data_size_bytes) by(instance) /
|
||||||
|
(
|
||||||
|
sum(vm_free_disk_space_bytes) by(instance) +
|
||||||
|
sum(vm_data_size_bytes) by(instance)
|
||||||
|
) > 0.8
|
||||||
|
for: 30m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
dashboard: http://localhost:3000/d/oS7Bi_0Wz?viewPanel=110&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Instance {{ $labels.instance }} will run out of disk space soon"
|
||||||
|
description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n
|
||||||
|
Having less than 20% of free disk space could cripple merges processes and overall performance.
|
||||||
|
Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."
|
||||||
|
|
||||||
|
- alert: RequestErrorsToAPI
|
||||||
|
expr: increase(vm_http_request_errors_total[5m]) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=52&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"
|
||||||
|
description: "Requests to path {{ $labels.path }} are receiving errors.
|
||||||
|
Please verify if clients are sending correct requests."
|
||||||
|
|
||||||
|
- alert: RPCErrors
|
||||||
|
expr: |
|
||||||
|
(
|
||||||
|
sum(increase(vm_rpc_connection_errors_total[5m])) by(job, instance)
|
||||||
|
+
|
||||||
|
sum(increase(vm_rpc_dial_errors_total[5m])) by(job, instance)
|
||||||
|
+
|
||||||
|
sum(increase(vm_rpc_handshake_errors_total[5m])) by(job, instance)
|
||||||
|
) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=44&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"
|
||||||
|
description: "RPC errors are interconnection errors between cluster components.\n
|
||||||
|
Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."
|
||||||
|
|
||||||
|
- alert: ConcurrentFlushesHitTheLimit
|
||||||
|
expr: avg_over_time(vm_concurrent_addrows_current[1m]) >= vm_concurrent_addrows_capacity
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=133&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "vmstorage on instance {{ $labels.instance }} is constantly hitting concurrent flushes limit"
|
||||||
|
description: "The limit of concurrent flushes on instance {{ $labels.instance }} is equal to number of CPUs.\n
|
||||||
|
When vmstorage constantly hits the limit it means that storage is overloaded and requires more CPU."
|
||||||
|
|
||||||
|
- alert: TooManyLogs
|
||||||
|
expr: sum(increase(vm_log_messages_total{level="error"}[5m])) by (job, instance) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=104&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Too many logs printed for job \"{{ $labels.job }}\" ({{ $labels.instance }})"
|
||||||
|
description: "Logging rate for job \"{{ $labels.job }}\" ({{ $labels.instance }}) is {{ $value }} for last 15m.\n
|
||||||
|
Worth to check logs for specific error messages."
|
||||||
|
|
||||||
|
- alert: RowsRejectedOnIngestion
|
||||||
|
expr: sum(rate(vm_rows_ignored_total[5m])) by (instance, reason) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=135&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
|
||||||
|
description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
|
||||||
|
following reason: \"{{ $labels.reason }}\""
|
||||||
|
|
||||||
|
- alert: TooHighChurnRate
|
||||||
|
expr: |
|
||||||
|
(
|
||||||
|
sum(rate(vm_new_timeseries_created_total[5m]))
|
||||||
|
/
|
||||||
|
sum(rate(vm_rows_inserted_total[5m]))
|
||||||
|
) > 0.1
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"
|
||||||
|
summary: "Churn rate is more than 10% for the last 15m"
|
||||||
|
description: "VM constantly creates new time series.\n
|
||||||
|
This effect is known as Churn Rate.\n
|
||||||
|
High Churn Rate tightly connected with database performance and may
|
||||||
|
result in unexpected OOM's or slow queries."
|
||||||
|
|
||||||
|
- alert: TooHighChurnRate24h
|
||||||
|
expr: |
|
||||||
|
sum(increase(vm_new_timeseries_created_total[24h]))
|
||||||
|
>
|
||||||
|
(sum(vm_cache_entries{type="storage/hour_metric_ids"})* 3)
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"
|
||||||
|
summary: "Too high number of new series created over last 24h"
|
||||||
|
description: "The number of created new time series over last 24h is 3x times higher than
|
||||||
|
current number of active series.\n
|
||||||
|
This effect is known as Churn Rate.\n
|
||||||
|
High Churn Rate tightly connected with database performance and may
|
||||||
|
result in unexpected OOM's or slow queries."
|
||||||
|
|
||||||
|
- alert: TooHighSlowInsertsRate
|
||||||
|
expr: |
|
||||||
|
(
|
||||||
|
sum(rate(vm_slow_row_inserts_total[5m]))
|
||||||
|
/
|
||||||
|
sum(rate(vm_rows_inserted_total[5m]))
|
||||||
|
) > 0.05
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=108"
|
||||||
|
summary: "Percentage of slow inserts is more than 5% for the last 15m"
|
||||||
|
description: "High rate of slow inserts may be a sign of resource exhaustion
|
||||||
|
for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series."
|
||||||
|
|
||||||
|
- alert: ProcessNearFDLimits
|
||||||
|
expr: (process_max_fds - process_open_fds) < 100
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=117&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
|
||||||
|
description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
|
||||||
|
Consider to increase the limit as fast as possible."
|
||||||
|
|
||||||
|
- alert: LabelsLimitExceededOnIngestion
|
||||||
|
expr: sum(increase(vm_metrics_with_dropped_labels_total[5m])) by (instance) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=116&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Metrics ingested to vminsert on {{ $labels.instance }} are exceeding labels limit"
|
||||||
|
description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
|
||||||
|
This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
|
||||||
|
correctly or that clients which send these metrics aren't misbehaving."
|
||||||
|
|
||||||
|
- alert: VminsertVmstorageConnectionIsSaturated
|
||||||
|
expr: rate(vm_rpc_send_duration_seconds_total[5m]) > 0.9
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=139&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"
|
||||||
|
description: "The connection between vminsert (instance {{ $labels.instance }}) and vmstorage (instance {{ $labels.addr }})
|
||||||
|
is saturated by more than 90% and vminsert won't be able to keep up.\n
|
||||||
|
This usually means that more vminsert or vmstorage nodes must be added to the cluster in order to increase
|
||||||
|
the total number of vminsert -> vmstorage links."
|
54
deployment/docker/alerts-health.yml
Normal file
54
deployment/docker/alerts-health.yml
Normal file
|
@ -0,0 +1,54 @@
|
||||||
|
# File contains default list of alerts for VM components.
|
||||||
|
# The alerts below are just recommendations and may require some updates
|
||||||
|
# and threshold calibration according to every specific setup.
|
||||||
|
groups:
|
||||||
|
- name: vm-health
|
||||||
|
# note the `job` filter and update accordingly to your setup
|
||||||
|
rules:
|
||||||
|
- alert: TooManyRestarts
|
||||||
|
expr: changes(process_start_time_seconds{job=~"victoriametrics|vmselect|vminsert|vmstorage|vmagent|vmalert"}[15m]) > 2
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "{{ $labels.job }} too many restarts (instance {{ $labels.instance }})"
|
||||||
|
description: "Job {{ $labels.job }} (instance {{ $labels.instance }}) has restarted more than twice in the last 15 minutes.
|
||||||
|
It might be crashlooping."
|
||||||
|
|
||||||
|
- alert: ServiceDown
|
||||||
|
expr: up{job=~"victoriametrics|vmselect|vminsert|vmstorage|vmagent|vmalert"} == 0
|
||||||
|
for: 2m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
|
||||||
|
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
|
||||||
|
|
||||||
|
- alert: ProcessNearFDLimits
|
||||||
|
expr: (process_max_fds - process_open_fds) < 100
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
|
||||||
|
description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
|
||||||
|
Consider to increase the limit as fast as possible."
|
||||||
|
|
||||||
|
- alert: TooHighMemoryUsage
|
||||||
|
expr: (process_resident_memory_anon_bytes / vm_available_memory_bytes) > 0.9
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "It is more than 90% of memory used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
|
||||||
|
description: "Too high memory usage may result into multiple issues such as OOMs or degraded performance.
|
||||||
|
Consider to either increase available memory or decrease the load on the process."
|
||||||
|
|
||||||
|
- alert: TooHighCPUUsage
|
||||||
|
expr: rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "More than 90% of CPU is used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
|
||||||
|
description: "Too high CPU usage may be a sign of insufficient resources and make process unstable.
|
||||||
|
Consider to either increase available CPU resources or decrease the load on the process."
|
122
deployment/docker/alerts-vmagent.yml
Normal file
122
deployment/docker/alerts-vmagent.yml
Normal file
|
@ -0,0 +1,122 @@
|
||||||
|
# File contains default list of alerts for vmagent service.
|
||||||
|
# The alerts below are just recommendations and may require some updates
|
||||||
|
# and threshold calibration according to every specific setup.
|
||||||
|
groups:
|
||||||
|
# Alerts group for vmagent assumes that Grafana dashboard
|
||||||
|
# https://grafana.com/grafana/dashboards/12683 is installed.
|
||||||
|
# Pls update the `dashboard` annotation according to your setup.
|
||||||
|
- name: vmagent
|
||||||
|
interval: 30s
|
||||||
|
concurrency: 2
|
||||||
|
rules:
|
||||||
|
- alert: PersistentQueueIsDroppingData
|
||||||
|
expr: sum(increase(vm_persistentqueue_bytes_dropped_total[5m])) by (job, instance) > 0
|
||||||
|
for: 10m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=49&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Instance {{ $labels.instance }} is dropping data from persistent queue"
|
||||||
|
description: "Vmagent dropped {{ $value | humanize1024 }} from persistent queue
|
||||||
|
on instance {{ $labels.instance }} for the last 10m."
|
||||||
|
|
||||||
|
- alert: RejectedRemoteWriteDataBlocksAreDropped
|
||||||
|
expr: sum(increase(vmagent_remotewrite_packets_dropped_total[5m])) by (job, instance) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=79&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} drops the rejected by
|
||||||
|
remote-write server data blocks. Check the logs to find the reason for rejects."
|
||||||
|
|
||||||
|
- alert: TooManyScrapeErrors
|
||||||
|
expr: sum(increase(vm_promscrape_scrapes_failed_total[5m])) by (job, instance) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=31&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to scrape targets for last 15m"
|
||||||
|
|
||||||
|
- alert: TooManyWriteErrors
|
||||||
|
expr: |
|
||||||
|
(sum(increase(vm_ingestserver_request_errors_total[5m])) by (job, instance)
|
||||||
|
+
|
||||||
|
sum(increase(vmagent_http_request_errors_total[5m])) by (job, instance)) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=77&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} responds with errors to write requests for last 15m."
|
||||||
|
|
||||||
|
- alert: TooManyRemoteWriteErrors
|
||||||
|
expr: sum(rate(vmagent_remotewrite_retries_count_total[5m])) by(job, instance, url) > 0
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=61&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to push to remote storage"
|
||||||
|
description: "Vmagent fails to push data via remote write protocol to destination \"{{ $labels.url }}\"\n
|
||||||
|
Ensure that destination is up and reachable."
|
||||||
|
|
||||||
|
- alert: RemoteWriteConnectionIsSaturated
|
||||||
|
expr: |
|
||||||
|
sum(rate(vmagent_remotewrite_send_duration_seconds_total[5m])) by(job, instance, url)
|
||||||
|
> 0.9 * max(vmagent_remotewrite_queues) by(job, instance, url)
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=84&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Remote write connection from \"{{ $labels.job }}\" (instance {{ $labels.instance }}) to {{ $labels.url }} is saturated"
|
||||||
|
description: "The remote write connection between vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }}) and destination \"{{ $labels.url }}\"
|
||||||
|
is saturated by more than 90% and vmagent won't be able to keep up.\n
|
||||||
|
This usually means that `-remoteWrite.queues` command-line flag must be increased in order to increase
|
||||||
|
the number of connections per each remote storage."
|
||||||
|
|
||||||
|
- alert: PersistentQueueForWritesIsSaturated
|
||||||
|
expr: rate(vm_persistentqueue_write_duration_seconds_total[5m]) > 0.9
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=98&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Persistent queue writes for instance {{ $labels.instance }} are saturated"
|
||||||
|
description: "Persistent queue writes for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
|
||||||
|
are saturated by more than 90% and vmagent won't be able to keep up with flushing data on disk.
|
||||||
|
In this case, consider to decrease load on the vmagent or improve the disk throughput."
|
||||||
|
|
||||||
|
- alert: PersistentQueueForReadsIsSaturated
|
||||||
|
expr: rate(vm_persistentqueue_read_duration_seconds_total[5m]) > 0.9
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=99&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Persistent queue reads for instance {{ $labels.instance }} are saturated"
|
||||||
|
description: "Persistent queue reads for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
|
||||||
|
are saturated by more than 90% and vmagent won't be able to keep up with reading data from the disk.
|
||||||
|
In this case, consider to decrease load on the vmagent or improve the disk throughput."
|
||||||
|
|
||||||
|
- alert: SeriesLimitHourReached
|
||||||
|
expr: (vmagent_hourly_series_limit_current_series / vmagent_hourly_series_limit_max_series) > 0.9
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=88&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Instance {{ $labels.instance }} reached 90% of the limit"
|
||||||
|
description: "Max series limit set via -remoteWrite.maxHourlySeries flag is close to reaching the max value.
|
||||||
|
Then samples for new time series will be dropped instead of sending them to remote storage systems."
|
||||||
|
|
||||||
|
- alert: SeriesLimitDayReached
|
||||||
|
expr: (vmagent_daily_series_limit_current_series / vmagent_daily_series_limit_max_series) > 0.9
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=90&var-instance={{ $labels.instance }}"
|
||||||
|
summary: "Instance {{ $labels.instance }} reached 90% of the limit"
|
||||||
|
description: "Max series limit set via -remoteWrite.maxDailySeries flag is close to reaching the max value.
|
||||||
|
Then samples for new time series will be dropped instead of sending them to remote storage systems."
|
|
@ -1,60 +1,7 @@
|
||||||
# File contains default list of alerts for vm-single and vmagent services.
|
# File contains default list of alerts for VictoriaMetrics single server.
|
||||||
# The alerts below are just recommendations and may require some updates
|
# The alerts below are just recommendations and may require some updates
|
||||||
# and threshold calibration according to every specific setup.
|
# and threshold calibration according to every specific setup.
|
||||||
groups:
|
groups:
|
||||||
- name: vm-health
|
|
||||||
# note the `job` filter and update accordingly to your setup
|
|
||||||
rules:
|
|
||||||
# note the `job` filter and update accordingly to your setup
|
|
||||||
- alert: TooManyRestarts
|
|
||||||
expr: changes(process_start_time_seconds{job=~"victoriametrics|vmagent|vmalert"}[15m]) > 2
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "{{ $labels.job }} too many restarts (instance {{ $labels.instance }})"
|
|
||||||
description: "Job {{ $labels.job }} has restarted more than twice in the last 15 minutes.
|
|
||||||
It might be crashlooping."
|
|
||||||
|
|
||||||
- alert: ServiceDown
|
|
||||||
expr: up{job=~"victoriametrics|vmagent|vmalert"} == 0
|
|
||||||
for: 2m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
|
|
||||||
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
|
|
||||||
|
|
||||||
- alert: ProcessNearFDLimits
|
|
||||||
expr: (process_max_fds - process_open_fds) < 100
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
|
|
||||||
description: "Exhausting OS file descriptors limit can cause severe degradation of the process.
|
|
||||||
Consider to increase the limit as fast as possible."
|
|
||||||
|
|
||||||
- alert: TooHighMemoryUsage
|
|
||||||
expr: (process_resident_memory_anon_bytes / vm_available_memory_bytes) > 0.9
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "It is more than 90% of memory used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
|
|
||||||
description: "Too high memory usage may result into multiple issues such as OOMs or degraded performance.
|
|
||||||
Consider to either increase available memory or decrease the load on the process."
|
|
||||||
|
|
||||||
- alert: TooHighCPUUsage
|
|
||||||
expr: rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
summary: "More than 90% of CPU is used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
|
|
||||||
description: "Too high CPU usage may be a sign of insufficient resources and make process unstable.
|
|
||||||
Consider to either increase available CPU resources or decrease the load on the process."
|
|
||||||
|
|
||||||
|
|
||||||
# Alerts group for VM single assumes that Grafana dashboard
|
# Alerts group for VM single assumes that Grafana dashboard
|
||||||
# https://grafana.com/grafana/dashboards/10229 is installed.
|
# https://grafana.com/grafana/dashboards/10229 is installed.
|
||||||
# Pls update the `dashboard` annotation according to your setup.
|
# Pls update the `dashboard` annotation according to your setup.
|
||||||
|
@ -207,123 +154,4 @@ groups:
|
||||||
summary: "Metrics ingested in ({{ $labels.instance }}) are exceeding labels limit"
|
summary: "Metrics ingested in ({{ $labels.instance }}) are exceeding labels limit"
|
||||||
description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
|
description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
|
||||||
This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
|
This prevents from ingesting metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
|
||||||
correctly or that clients which send these metrics aren't misbehaving."
|
correctly or that clients which send these metrics aren't misbehaving."
|
||||||
|
|
||||||
# Alerts group for vmagent assumes that Grafana dashboard
|
|
||||||
# https://grafana.com/grafana/dashboards/12683 is installed.
|
|
||||||
# Pls update the `dashboard` annotation according to your setup.
|
|
||||||
- name: vmagent
|
|
||||||
interval: 30s
|
|
||||||
concurrency: 2
|
|
||||||
rules:
|
|
||||||
- alert: PersistentQueueIsDroppingData
|
|
||||||
expr: sum(increase(vm_persistentqueue_bytes_dropped_total[5m])) by (job, instance) > 0
|
|
||||||
for: 10m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=49&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Instance {{ $labels.instance }} is dropping data from persistent queue"
|
|
||||||
description: "Vmagent dropped {{ $value | humanize1024 }} from persistent queue
|
|
||||||
on instance {{ $labels.instance }} for the last 10m."
|
|
||||||
|
|
||||||
- alert: RejectedRemoteWriteDataBlocksAreDropped
|
|
||||||
expr: sum(increase(vmagent_remotewrite_packets_dropped_total[5m])) by (job, instance) > 0
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=79&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} drops the rejected by
|
|
||||||
remote-write server data blocks. Check the logs to find the reason for rejects."
|
|
||||||
|
|
||||||
- alert: TooManyScrapeErrors
|
|
||||||
expr: sum(increase(vm_promscrape_scrapes_failed_total[5m])) by (job, instance) > 0
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=31&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to scrape targets for last 15m"
|
|
||||||
|
|
||||||
- alert: TooManyWriteErrors
|
|
||||||
expr: |
|
|
||||||
(sum(increase(vm_ingestserver_request_errors_total[5m])) by (job, instance)
|
|
||||||
+
|
|
||||||
sum(increase(vmagent_http_request_errors_total[5m])) by (job, instance)) > 0
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=77&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} responds with errors to write requests for last 15m."
|
|
||||||
|
|
||||||
- alert: TooManyRemoteWriteErrors
|
|
||||||
expr: sum(rate(vmagent_remotewrite_retries_count_total[5m])) by(job, instance, url) > 0
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=61&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to push to remote storage"
|
|
||||||
description: "Vmagent fails to push data via remote write protocol to destination \"{{ $labels.url }}\"\n
|
|
||||||
Ensure that destination is up and reachable."
|
|
||||||
|
|
||||||
- alert: RemoteWriteConnectionIsSaturated
|
|
||||||
expr: |
|
|
||||||
sum(rate(vmagent_remotewrite_send_duration_seconds_total[5m])) by(job, instance, url)
|
|
||||||
> 0.9 * max(vmagent_remotewrite_queues) by(job, instance, url)
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=84&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Remote write connection from \"{{ $labels.job }}\" (instance {{ $labels.instance }}) to {{ $labels.url }} is saturated"
|
|
||||||
description: "The remote write connection between vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }}) and destination \"{{ $labels.url }}\"
|
|
||||||
is saturated by more than 90% and vmagent won't be able to keep up.\n
|
|
||||||
This usually means that `-remoteWrite.queues` command-line flag must be increased in order to increase
|
|
||||||
the number of connections per each remote storage."
|
|
||||||
|
|
||||||
- alert: PersistentQueueForWritesIsSaturated
|
|
||||||
expr: rate(vm_persistentqueue_write_duration_seconds_total[5m]) > 0.9
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=98&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Persistent queue writes for instance {{ $labels.instance }} are saturated"
|
|
||||||
description: "Persistent queue writes for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
|
|
||||||
are saturated by more than 90% and vmagent won't be able to keep up with flushing data on disk.
|
|
||||||
In this case, consider to decrease load on the vmagent or improve the disk throughput."
|
|
||||||
|
|
||||||
- alert: PersistentQueueForReadsIsSaturated
|
|
||||||
expr: rate(vm_persistentqueue_read_duration_seconds_total[5m]) > 0.9
|
|
||||||
for: 15m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=99&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Persistent queue reads for instance {{ $labels.instance }} are saturated"
|
|
||||||
description: "Persistent queue reads for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
|
|
||||||
are saturated by more than 90% and vmagent won't be able to keep up with reading data from the disk.
|
|
||||||
In this case, consider to decrease load on the vmagent or improve the disk throughput."
|
|
||||||
|
|
||||||
- alert: SeriesLimitHourReached
|
|
||||||
expr: (vmagent_hourly_series_limit_current_series / vmagent_hourly_series_limit_max_series) > 0.9
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=88&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Instance {{ $labels.instance }} reached 90% of the limit"
|
|
||||||
description: "Max series limit set via -remoteWrite.maxHourlySeries flag is close to reaching the max value.
|
|
||||||
Then samples for new time series will be dropped instead of sending them to remote storage systems."
|
|
||||||
|
|
||||||
- alert: SeriesLimitDayReached
|
|
||||||
expr: (vmagent_daily_series_limit_current_series / vmagent_daily_series_limit_max_series) > 0.9
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
annotations:
|
|
||||||
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=90&var-instance={{ $labels.instance }}"
|
|
||||||
summary: "Instance {{ $labels.instance }} reached 90% of the limit"
|
|
||||||
description: "Max series limit set via -remoteWrite.maxDailySeries flag is close to reaching the max value.
|
|
||||||
Then samples for new time series will be dropped instead of sending them to remote storage systems."
|
|
121
deployment/docker/docker-compose-cluster.yml
Normal file
121
deployment/docker/docker-compose-cluster.yml
Normal file
|
@ -0,0 +1,121 @@
|
||||||
|
version: '3.5'
|
||||||
|
services:
|
||||||
|
vmagent:
|
||||||
|
container_name: vmagent
|
||||||
|
image: victoriametrics/vmagent:latest
|
||||||
|
depends_on:
|
||||||
|
- "vminsert"
|
||||||
|
ports:
|
||||||
|
- 8429:8429
|
||||||
|
volumes:
|
||||||
|
- vmagentdata:/vmagentdata
|
||||||
|
- ./prometheus-cluster.yml:/etc/prometheus/prometheus.yml
|
||||||
|
command:
|
||||||
|
- '--promscrape.config=/etc/prometheus/prometheus.yml'
|
||||||
|
- '--remoteWrite.url=http://vminsert:8480/insert/0/prometheus/'
|
||||||
|
restart: always
|
||||||
|
|
||||||
|
grafana:
|
||||||
|
container_name: grafana
|
||||||
|
image: grafana/grafana:9.1.0
|
||||||
|
depends_on:
|
||||||
|
- "vmselect"
|
||||||
|
ports:
|
||||||
|
- 3000:3000
|
||||||
|
restart: always
|
||||||
|
volumes:
|
||||||
|
- grafanadata:/var/lib/grafana
|
||||||
|
- ./provisioning/:/etc/grafana/provisioning/
|
||||||
|
- ./../../dashboards/victoriametrics-cluster.json:/var/lib/grafana/dashboards/vm.json
|
||||||
|
- ./../../dashboards/vmagent.json:/var/lib/grafana/dashboards/vmagent.json
|
||||||
|
- ./../../dashboards/vmalert.json:/var/lib/grafana/dashboards/vmalert.json
|
||||||
|
|
||||||
|
vmstorage-1:
|
||||||
|
container_name: vmstorage-1
|
||||||
|
image: victoriametrics/vmstorage:latest
|
||||||
|
ports:
|
||||||
|
- 8482
|
||||||
|
- 8400
|
||||||
|
- 8401
|
||||||
|
volumes:
|
||||||
|
- strgdata-1:/storage
|
||||||
|
command:
|
||||||
|
- '--storageDataPath=/storage'
|
||||||
|
restart: always
|
||||||
|
vmstorage-2:
|
||||||
|
container_name: vmstorage-2
|
||||||
|
image: victoriametrics/vmstorage:latest
|
||||||
|
ports:
|
||||||
|
- 8482
|
||||||
|
- 8400
|
||||||
|
- 8401
|
||||||
|
volumes:
|
||||||
|
- strgdata-2:/storage
|
||||||
|
command:
|
||||||
|
- '--storageDataPath=/storage'
|
||||||
|
restart: always
|
||||||
|
vminsert:
|
||||||
|
container_name: vminsert
|
||||||
|
image: victoriametrics/vminsert:latest
|
||||||
|
depends_on:
|
||||||
|
- "vmstorage-1"
|
||||||
|
- "vmstorage-2"
|
||||||
|
command:
|
||||||
|
- '--storageNode=vmstorage-1:8400'
|
||||||
|
- '--storageNode=vmstorage-2:8400'
|
||||||
|
ports:
|
||||||
|
- 8480:8480
|
||||||
|
restart: always
|
||||||
|
vmselect:
|
||||||
|
container_name: vmselect
|
||||||
|
image: victoriametrics/vmselect:latest
|
||||||
|
depends_on:
|
||||||
|
- "vmstorage-1"
|
||||||
|
- "vmstorage-2"
|
||||||
|
command:
|
||||||
|
- '--storageNode=vmstorage-1:8401'
|
||||||
|
- '--storageNode=vmstorage-2:8401'
|
||||||
|
- '--vmalert.proxyURL=http://vmalert:8880'
|
||||||
|
ports:
|
||||||
|
- 8481:8481
|
||||||
|
restart: always
|
||||||
|
|
||||||
|
vmalert:
|
||||||
|
container_name: vmalert
|
||||||
|
image: victoriametrics/vmalert:latest
|
||||||
|
depends_on:
|
||||||
|
- "vmselect"
|
||||||
|
ports:
|
||||||
|
- 8880:8880
|
||||||
|
volumes:
|
||||||
|
- ./alerts-cluster.yml:/etc/alerts/alerts.yml
|
||||||
|
- ./alerts-health.yml:/etc/alerts/alerts-health.yml
|
||||||
|
- ./alerts-vmagent.yml:/etc/alerts/alerts-vmagent.yml
|
||||||
|
command:
|
||||||
|
- '--datasource.url=http://vmselect:8481/select/0/prometheus'
|
||||||
|
- '--remoteRead.url=http://vmselect:8481/select/0/prometheus'
|
||||||
|
- '--remoteWrite.url=http://vminsert:8480/insert/0/prometheus'
|
||||||
|
- '--notifier.url=http://alertmanager:9093/'
|
||||||
|
- '--rule=/etc/alerts/*.yml'
|
||||||
|
# display source of alerts in grafana
|
||||||
|
- '-external.url=http://127.0.0.1:3000' #grafana outside container
|
||||||
|
# when copypaste the line below be aware of '$$' for escaping in '$expr'
|
||||||
|
- '--external.alert.source=explore?orgId=1&left=["now-1h","now","VictoriaMetrics",{"expr":"{{$$expr|quotesEscape|crlfEscape|queryEscape}}"},{"mode":"Metrics"},{"ui":[true,true,true,"none"]}]'
|
||||||
|
restart: always
|
||||||
|
|
||||||
|
alertmanager:
|
||||||
|
container_name: alertmanager
|
||||||
|
image: prom/alertmanager:v0.24.0
|
||||||
|
volumes:
|
||||||
|
- ./alertmanager.yml:/config/alertmanager.yml
|
||||||
|
command:
|
||||||
|
- '--config.file=/config/alertmanager.yml'
|
||||||
|
ports:
|
||||||
|
- 9093:9093
|
||||||
|
restart: always
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
vmagentdata: {}
|
||||||
|
strgdata-1: {}
|
||||||
|
strgdata-2: {}
|
||||||
|
grafanadata: {}
|
|
@ -2,7 +2,7 @@ version: "3.5"
|
||||||
services:
|
services:
|
||||||
vmagent:
|
vmagent:
|
||||||
container_name: vmagent
|
container_name: vmagent
|
||||||
image: victoriametrics/vmagent:v1.80.0
|
image: victoriametrics/vmagent:latest
|
||||||
depends_on:
|
depends_on:
|
||||||
- "victoriametrics"
|
- "victoriametrics"
|
||||||
ports:
|
ports:
|
||||||
|
@ -18,7 +18,7 @@ services:
|
||||||
restart: always
|
restart: always
|
||||||
victoriametrics:
|
victoriametrics:
|
||||||
container_name: victoriametrics
|
container_name: victoriametrics
|
||||||
image: victoriametrics/victoria-metrics:v1.80.0
|
image: victoriametrics/victoria-metrics:latest
|
||||||
ports:
|
ports:
|
||||||
- 8428:8428
|
- 8428:8428
|
||||||
- 8089:8089
|
- 8089:8089
|
||||||
|
@ -56,7 +56,7 @@ services:
|
||||||
restart: always
|
restart: always
|
||||||
vmalert:
|
vmalert:
|
||||||
container_name: vmalert
|
container_name: vmalert
|
||||||
image: victoriametrics/vmalert:v1.80.0
|
image: victoriametrics/vmalert:latest
|
||||||
depends_on:
|
depends_on:
|
||||||
- "victoriametrics"
|
- "victoriametrics"
|
||||||
- "alertmanager"
|
- "alertmanager"
|
||||||
|
@ -64,6 +64,8 @@ services:
|
||||||
- 8880:8880
|
- 8880:8880
|
||||||
volumes:
|
volumes:
|
||||||
- ./alerts.yml:/etc/alerts/alerts.yml
|
- ./alerts.yml:/etc/alerts/alerts.yml
|
||||||
|
- ./alerts-health.yml:/etc/alerts/alerts-health.yml
|
||||||
|
- ./alerts-vmagent.yml:/etc/alerts/alerts-vmagent.yml
|
||||||
command:
|
command:
|
||||||
- "--datasource.url=http://victoriametrics:8428/"
|
- "--datasource.url=http://victoriametrics:8428/"
|
||||||
- "--remoteRead.url=http://victoriametrics:8428/"
|
- "--remoteRead.url=http://victoriametrics:8428/"
|
||||||
|
@ -72,7 +74,8 @@ services:
|
||||||
- "--rule=/etc/alerts/*.yml"
|
- "--rule=/etc/alerts/*.yml"
|
||||||
# display source of alerts in grafana
|
# display source of alerts in grafana
|
||||||
- "--external.url=http://127.0.0.1:3000" #grafana outside container
|
- "--external.url=http://127.0.0.1:3000" #grafana outside container
|
||||||
- '--external.alert.source=explore?orgId=1&left=["now-1h","now","VictoriaMetrics",{"expr":"{{$$expr|quotesEscape|crlfEscape|queryEscape}}"},{"mode":"Metrics"},{"ui":[true,true,true,"none"]}]' ## when copypaste the line be aware of '$$' for escaping in '$expr'
|
# when copypaste the line be aware of '$$' for escaping in '$expr'
|
||||||
|
- '--external.alert.source=explore?orgId=1&left=["now-1h","now","VictoriaMetrics",{"expr":"{{$$expr|quotesEscape|crlfEscape|queryEscape}}"},{"mode":"Metrics"},{"ui":[true,true,true,"none"]}]'
|
||||||
networks:
|
networks:
|
||||||
- vm_net
|
- vm_net
|
||||||
restart: always
|
restart: always
|
||||||
|
|
19
deployment/docker/prometheus-cluster.yml
Normal file
19
deployment/docker/prometheus-cluster.yml
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
global:
|
||||||
|
scrape_interval: 10s
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: 'vmagent'
|
||||||
|
static_configs:
|
||||||
|
- targets: ['vmagent:8429']
|
||||||
|
- job_name: 'vmalert'
|
||||||
|
static_configs:
|
||||||
|
- targets: ['vmalert:8880']
|
||||||
|
- job_name: 'vminsert'
|
||||||
|
static_configs:
|
||||||
|
- targets: ['vminsert:8480']
|
||||||
|
- job_name: 'vmselect'
|
||||||
|
static_configs:
|
||||||
|
- targets: ['vmselect:8481']
|
||||||
|
- job_name: 'vmstorage'
|
||||||
|
static_configs:
|
||||||
|
- targets: ['vmstorage-1:8482', 'vmstorage-2:8482']
|
|
@ -6,3 +6,9 @@ datasources:
|
||||||
access: proxy
|
access: proxy
|
||||||
url: http://victoriametrics:8428
|
url: http://victoriametrics:8428
|
||||||
isDefault: true
|
isDefault: true
|
||||||
|
|
||||||
|
- name: VictoriaMetrics - cluster
|
||||||
|
type: prometheus
|
||||||
|
access: proxy
|
||||||
|
url: http://vmselect:8481/select/0/prometheus
|
||||||
|
isDefault: false
|
|
@ -144,7 +144,7 @@ Ports may be altered by setting `-httpListenAddr` on the corresponding nodes.
|
||||||
It is recommended setting up [monitoring](#monitoring) for the cluster.
|
It is recommended setting up [monitoring](#monitoring) for the cluster.
|
||||||
|
|
||||||
The following tools can simplify cluster setup:
|
The following tools can simplify cluster setup:
|
||||||
- [An example docker-compose config for VictoriaMetrics cluster](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/docker-compose.yml)
|
- [An example docker-compose config for VictoriaMetrics cluster](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/docker-compose-cluster.yml)
|
||||||
- [Helm charts for VictoriaMetrics](https://github.com/VictoriaMetrics/helm-charts)
|
- [Helm charts for VictoriaMetrics](https://github.com/VictoriaMetrics/helm-charts)
|
||||||
- [Kubernetes operator for VictoriaMetrics](https://github.com/VictoriaMetrics/operator)
|
- [Kubernetes operator for VictoriaMetrics](https://github.com/VictoriaMetrics/operator)
|
||||||
|
|
||||||
|
|
|
@ -58,21 +58,22 @@ There is also [VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster
|
||||||
### Starting VM-Cluster via Docker
|
### Starting VM-Cluster via Docker
|
||||||
|
|
||||||
The following commands clone the latest available
|
The following commands clone the latest available
|
||||||
[VictoriaMetrics cluster repository](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster)
|
[VictoriaMetrics repository](https://github.com/VictoriaMetrics/VictoriaMetrics)
|
||||||
and start the docker container via 'docker-compose'. Further customization is possible by editing
|
and start the docker container via 'make docker-cluster-up'. Further customization is possible by editing
|
||||||
the [docker-compose.yaml](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/docker-compose.yml)
|
the [docker-compose-cluster.yml](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/docker-compose-cluster.yml)
|
||||||
file.
|
file.
|
||||||
|
|
||||||
<div class="with-copy" markdown="1">
|
<div class="with-copy" markdown="1">
|
||||||
|
|
||||||
```console
|
```console
|
||||||
git clone https://github.com/VictoriaMetrics/VictoriaMetrics --branch cluster &&
|
git clone https://github.com/VictoriaMetrics/VictoriaMetrics &&
|
||||||
cd VictoriaMetrics/deployment/docker &&
|
make docker-cluster-up
|
||||||
docker-compose up
|
|
||||||
```
|
```
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
See more details [here](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker#readme).
|
||||||
|
|
||||||
* [Cluster setup](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-setup)
|
* [Cluster setup](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-setup)
|
||||||
|
|
||||||
## Write data
|
## Write data
|
||||||
|
|
Loading…
Reference in a new issue