From 74338ef3009b1129ac6117283c10c5730b0f27a2 Mon Sep 17 00:00:00 2001 From: Aliaksandr Valialkin Date: Thu, 30 Jun 2022 14:35:28 +0300 Subject: [PATCH] docs/Troubleshooting.md: add troubleshooting guide for cluster instability --- docs/Troubleshooting.md | 44 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/docs/Troubleshooting.md b/docs/Troubleshooting.md index 46e15a0fe8..d021833cf9 100644 --- a/docs/Troubleshooting.md +++ b/docs/Troubleshooting.md @@ -10,7 +10,7 @@ This document contains troubleshooting guides for most common issues when workin - [Slow data ingestion](#slow-data-ingestion) - [Slow queries](#slow-queries) - [Out of memory errors](#out-of-memory-errors) - +- [Cluster instability](#cluster-instability) ## Unexpected query results @@ -31,7 +31,7 @@ If you see unexpected or unreliable query results from VictoriaMetrics, then try 2. If the simplest query continues returning unexpected / unreliable results, then export raw samples for this query via [/api/v1/export](https://docs.victoriametrics.com/#how-to-export-data-in-json-line-format) - on the given '[start..end]' time range and check whether they are expected: + on the given `[start..end]` time range and check whether they are expected: ```console curl http://victoriametrics:8428/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...' @@ -224,3 +224,43 @@ There are the following most common sources of out of memory (aka OOM) crashes i under the current workload, then it is recommended migrating to a host with bigger amounts of memory in order to protect from possible OOM crashes on workload spikes. It is recommended to have at least 30% of free memory for graceful handling of possible workload spikes. + + +## Cluster instability + +VictoriaMetrics cluster may become unstable if there is no enough free resources (CPU, RAM, disk IO, network bandwidth) +for processing the current workload. + +The most common sources of cluster instability are: + +- Workload spike. For example, if the number of active time series increases by 2x while + the cluster has no enough free resources for processing the increased workload, + then it may become unstable. + +- Various maintenance tasks such as rolling upgrades or rolling restarts during configuration changes. + For example, if a cluster contains `N=3` `vmstorage` nodes and they are restarted one-by-one (aka rolling restart), + then the cluster will have only `N-1=2` healthy `vmstorage` nodes during the rolling restart. + This means that the load on healthy `vmstorage` nodes increases by at least `100%/(N-1)=50%` + comparing to the load before rolling restart. E.g. they need to process 50% more incoming + data and return 50% more data during queries. In reality the load on the remaining `vmstorage` + nodes increases even more because they need to register new time series, which were re-routed + from temporarily unavailable `vmstorage` node. If `vmstorage` nodes had less than 50% + of free resources (CPU, RAM, disk IO) before the rolling restart, then the rolling restart + can lead to cluster overload and instability for both data ingestion and querying. + + As you can see, the workload increase during rolling restart can be reduced by increasing + the number of `vmstorage` nodes in the cluster. For example, if VictoriaMetrics cluster contains + `N=11` `vmstorage` nodes, then the workload increase during rolling restart of `vmstorage` nodes + would be `100%/(N-1)=10%`. So it is recommended to have at least 8 `vmstorage` nodes in the cluster. + The recommended number of `vmstorage` nodes should be multiplied by `-replicationFactor` if replication is enabled - + see [replication and data safety docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety) + for details. + +The obvious solution against VictoriaMetrics cluster instability is to make sure cluster components +have enough free resources for graceful processing the increased workload. +See [capacity planning docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#capacity-planning) +and [cluster resizing and scalability docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-resizing-and-scalability) +for details. + +VictoriaMetrics provides various configuration settings, which can be used for limiting unexpected workload spikes. +See [these docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits) for details.