From 34d764874ff7bed298a189aa666421cf20ef9818 Mon Sep 17 00:00:00 2001
From: Aliaksandr Valialkin <valyala@victoriametrics.com>
Date: Sat, 20 Aug 2022 09:39:07 +0300
Subject: [PATCH] docs/Cluster-VictoriaMetrics.md: clarify required conditions
 for cluster availability

---
 README.md                       | 47 ++++++++++++++++++++++++++++++---
 docs/Cluster-VictoriaMetrics.md | 47 ++++++++++++++++++++++++++++++---
 2 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index 6ffe1b9a7..74f448b9f 100644
--- a/README.md
+++ b/README.md
@@ -301,11 +301,50 @@ the update process. See [cluster availability](#cluster-availability) section fo
 
 ## Cluster availability
 
-- HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes.
-- The cluster remains available if at least a single `vmstorage` node exists:
+VictoriaMetrics cluster architecture prioritizes availability over data consistency.
+This means that the cluster remains available for data ingestion and data querying
+if some of its components are temporarily unavailable.
 
-  - `vminsert` re-routes incoming data from unavailable `vmstorage` nodes to healthy `vmstorage` nodes
-  - `vmselect` continues serving partial responses if at least a single `vmstorage` node is available. If consistency over availability is preferred, then either pass `-search.denyPartialResponse` command-line flag to `vmselect` or pass `deny_partial_response=1` query arg in requests to `vmselect`.
+VictoriaMetrics cluster remains available if the following conditions are met:
+
+- HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes.
+
+- At least a single `vminsert` node must remain available in the cluster for processing data ingestion workload.
+  The remaining active `vminsert` nodes must have enough compute capacity (CPU, RAM, network bandwidth)
+  for handling the current data ingestion workload.
+  If the remaining active `vminsert` nodes have no enough resources for processing the data ingestion workload,
+  then arbitrary delays may occur during data ingestion.
+  See [capacity planning](#capacity-planning) and [cluster resizing](#cluster-resizing-and-scalability) docs for more details.
+
+- At least a single `vmselect` node must remain available in the cluster for processing query workload.
+  The remaining active `vmselect` nodes must have enough compute capacity (CPU, RAM, network bandwidth, disk IO)
+  for handling the current query workload.
+  If the remaining active `vmselect` nodes have no enough resources for processing query workload,
+  then arbitrary failures and delays may occur during query processing.
+  See [capacity planning](#capacity-planning) and [cluster resizing](#cluster-resizing-and-scalability) docs for more details.
+
+- At least a single `vmstorage` node must remain available in the cluster for accepting newly ingested data
+  and for processing incoming queries. The remaining active `vmstorage` nodes must have enough compute capacity
+  (CPU, RAM, network bandwidth, disk IO, free disk space) for  handling the current workload.
+  If the remaining active `vmstorage` nodes have no enough resources for processing query workload,
+  then arbitrary failures and delay may occur during data ingestion and query processing.
+  See [capacity planning](#capacity-planning) and [cluster resizing](#cluster-resizing-and-scalability) docs for more details.
+
+The cluster works in the following way when some of `vmstorage` nodes are unavailable:
+
+- `vminsert` re-routes newly ingested data from unavailable `vmstorage` nodes to remaining healthy `vmstorage` nodes.
+  This guarantees that the newly ingested data is properly saved if the healthy `vmstorage` nodes have enough CPU, RAM, disk IO and network bandwidth
+  for processing the increased data ingestion workload.
+  `vminsert` spreads evenly the additional data among the healthy `vmstorage` nodes in order to spread evenly
+  the increased load on these nodes.
+
+- `vmselect` continues serving queries if at least a single `vmstorage` nodes is available.
+  It marks responses as partial for queries served from the remaining healthy `vmstorage` nodes,
+  since such responses may miss historical data stored on the temporarily unavailable `vmstorage` nodes.
+  Every partial JSON response contains `"isPartial": true` option.
+  If you prefer consistency over availability, then run `vmselect` nodes with `-search.denyPartialResponse` command-line flag.
+  In this case `vmselect` returns an error if at least a single `vmstorage` node is unavailable.
+  Another option is to pass `deny_partial_response=1` query arg to requests to `vmselect` nodes.
 
 `vmselect` doesn't serve partial responses for API handlers returning raw datapoints - [`/api/v1/export*` endpoints](https://docs.victoriametrics.com/#how-to-export-time-series), since users usually expect this data is always complete.
 
diff --git a/docs/Cluster-VictoriaMetrics.md b/docs/Cluster-VictoriaMetrics.md
index f7286f355..08bc75688 100644
--- a/docs/Cluster-VictoriaMetrics.md
+++ b/docs/Cluster-VictoriaMetrics.md
@@ -305,11 +305,50 @@ the update process. See [cluster availability](#cluster-availability) section fo
 
 ## Cluster availability
 
-- HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes.
-- The cluster remains available if at least a single `vmstorage` node exists:
+VictoriaMetrics cluster architecture prioritizes availability over data consistency.
+This means that the cluster remains available for data ingestion and data querying
+if some of its components are temporarily unavailable.
 
-  - `vminsert` re-routes incoming data from unavailable `vmstorage` nodes to healthy `vmstorage` nodes
-  - `vmselect` continues serving partial responses if at least a single `vmstorage` node is available. If consistency over availability is preferred, then either pass `-search.denyPartialResponse` command-line flag to `vmselect` or pass `deny_partial_response=1` query arg in requests to `vmselect`.
+VictoriaMetrics cluster remains available if the following conditions are met:
+
+- HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes.
+
+- At least a single `vminsert` node must remain available in the cluster for processing data ingestion workload.
+  The remaining active `vminsert` nodes must have enough compute capacity (CPU, RAM, network bandwidth)
+  for handling the current data ingestion workload.
+  If the remaining active `vminsert` nodes have no enough resources for processing the data ingestion workload,
+  then arbitrary delays may occur during data ingestion.
+  See [capacity planning](#capacity-planning) and [cluster resizing](#cluster-resizing-and-scalability) docs for more details.
+
+- At least a single `vmselect` node must remain available in the cluster for processing query workload.
+  The remaining active `vmselect` nodes must have enough compute capacity (CPU, RAM, network bandwidth, disk IO)
+  for handling the current query workload.
+  If the remaining active `vmselect` nodes have no enough resources for processing query workload,
+  then arbitrary failures and delays may occur during query processing.
+  See [capacity planning](#capacity-planning) and [cluster resizing](#cluster-resizing-and-scalability) docs for more details.
+
+- At least a single `vmstorage` node must remain available in the cluster for accepting newly ingested data
+  and for processing incoming queries. The remaining active `vmstorage` nodes must have enough compute capacity
+  (CPU, RAM, network bandwidth, disk IO, free disk space) for  handling the current workload.
+  If the remaining active `vmstorage` nodes have no enough resources for processing query workload,
+  then arbitrary failures and delay may occur during data ingestion and query processing.
+  See [capacity planning](#capacity-planning) and [cluster resizing](#cluster-resizing-and-scalability) docs for more details.
+
+The cluster works in the following way when some of `vmstorage` nodes are unavailable:
+
+- `vminsert` re-routes newly ingested data from unavailable `vmstorage` nodes to remaining healthy `vmstorage` nodes.
+  This guarantees that the newly ingested data is properly saved if the healthy `vmstorage` nodes have enough CPU, RAM, disk IO and network bandwidth
+  for processing the increased data ingestion workload.
+  `vminsert` spreads evenly the additional data among the healthy `vmstorage` nodes in order to spread evenly
+  the increased load on these nodes.
+
+- `vmselect` continues serving queries if at least a single `vmstorage` nodes is available.
+  It marks responses as partial for queries served from the remaining healthy `vmstorage` nodes,
+  since such responses may miss historical data stored on the temporarily unavailable `vmstorage` nodes.
+  Every partial JSON response contains `"isPartial": true` option.
+  If you prefer consistency over availability, then run `vmselect` nodes with `-search.denyPartialResponse` command-line flag.
+  In this case `vmselect` returns an error if at least a single `vmstorage` node is unavailable.
+  Another option is to pass `deny_partial_response=1` query arg to requests to `vmselect` nodes.
 
 `vmselect` doesn't serve partial responses for API handlers returning raw datapoints - [`/api/v1/export*` endpoints](https://docs.victoriametrics.com/#how-to-export-time-series), since users usually expect this data is always complete.