mirror of
https://github.com/VictoriaMetrics/VictoriaMetrics.git
synced 2025-03-11 15:34:56 +00:00
docs/Cluster-VictoriaMetrics.md: document -disableReroutingOnUnavailable command-line flag
This is a follow-up for 88f0d1572e
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5713
This commit is contained in:
parent
88f0d1572e
commit
63a43331a3
4 changed files with 42 additions and 25 deletions
39
README.md
39
README.md
|
@ -475,20 +475,6 @@ This strategy allows upgrading the cluster without downtime if the following con
|
|||
If at least a single condition isn't met, then the rolling restart may result in cluster unavailability
|
||||
during the config update / version upgrade. In this case the following strategy is recommended.
|
||||
|
||||
#### Improving re-routing performance during restart
|
||||
|
||||
`vmstorage` nodes may experience increased usage for CPU, RAM and disk IO during
|
||||
[rolling restarts](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#no-downtime-strategy),
|
||||
since they need to process higher load when some of `vmstorage` nodes are temporarily unavailable in the cluster.
|
||||
It is possible to reduce resource usage spikes by running more `vminsert` nodes and by passing bigger values
|
||||
to `-storage.vminsertConnsShutdownDuration` (available from [v1.95.0](https://docs.victoriametrics.com/CHANGELOG.html#v1950))
|
||||
command-line flag at `vmstorage` nodes.
|
||||
In this case `vmstorage` increases the interval between gradual closing of `vminsert` connections during graceful shutdown.
|
||||
This reduces data ingestion slowdown during rollout restarts.
|
||||
|
||||
Make sure that the `-storage.vminsertConnsShutdownDuration` is smaller than the graceful shutdown timeout configured at the system which manages `vmstorage`
|
||||
(e.g. Docker, Kubernetes, systemd, etc.). Otherwise the system may kill `vmstorage` node before it finishes gradual closing of `vminsert` connections.
|
||||
|
||||
### Minimum downtime strategy
|
||||
|
||||
1. Gracefully stop all the `vminsert` and `vmselect` nodes in parallel.
|
||||
|
@ -507,6 +493,27 @@ The `minimum downtime` strategy has the following benefits comparing to `no down
|
|||
- It allows minimizing the duration of config update / version upgrade for clusters with big number of nodes
|
||||
of for clusters with big `vmstorage` nodes, which may take long time for graceful restart.
|
||||
|
||||
## Improving re-routing performance during restart
|
||||
|
||||
`vmstorage` nodes may experience increased usage for CPU, RAM and disk IO during
|
||||
[rolling restarts](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#no-downtime-strategy),
|
||||
since they need to process higher load when some of `vmstorage` nodes are temporarily unavailable in the cluster.
|
||||
|
||||
The following approaches can be used for reducing resource usage at `vmstorage` nodes during rolling restart:
|
||||
|
||||
- To pass `-disableReroutingOnUnavailable` command-line flag to `vminsert` nodes, so they pause data ingestion when `vmstorage` nodes are restarted
|
||||
instead of re-routing the ingested data to other available `vmstorage` nodes.
|
||||
|
||||
- To pass bigger values to `-storage.vminsertConnsShutdownDuration` (available from [v1.95.0](https://docs.victoriametrics.com/CHANGELOG.html#v1950))
|
||||
command-line flag at `vmstorage` nodes.In this case `vmstorage` increases the interval between gradual closing of `vminsert` connections during graceful shutdown.
|
||||
This reduces data ingestion slowdown during rollout restarts.
|
||||
|
||||
Make sure that the `-storage.vminsertConnsShutdownDuration` is smaller than the graceful shutdown timeout configured at the system which manages `vmstorage`
|
||||
(e.g. Docker, Kubernetes, systemd, etc.). Otherwise the system may kill `vmstorage` node before it finishes gradual closing of `vminsert` connections.
|
||||
|
||||
See also [minimum downtime strategy](#minimum-downtime-strategy).
|
||||
|
||||
|
||||
## Cluster availability
|
||||
|
||||
VictoriaMetrics cluster architecture prioritizes availability over data consistency.
|
||||
|
@ -956,7 +963,9 @@ Below is the output for `/path/to/vminsert -help`:
|
|||
-denyQueryTracing
|
||||
Whether to disable the ability to trace queries. See https://docs.victoriametrics.com/#query-tracing
|
||||
-disableRerouting
|
||||
Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -dropSamplesOnOverload (default true)
|
||||
Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -disableReroutingOnUnavailable and -dropSamplesOnOverload (default true)
|
||||
-disableReroutingOnUnavailable
|
||||
Whether to disable re-routing when some of vmstorage nodes are unavailable. Disabled re-routing stops ingestion when some storage nodes are unavailable. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -disableRerouting
|
||||
-dropSamplesOnOverload
|
||||
Whether to drop incoming samples if the destination vmstorage node is overloaded and/or unavailable. This prioritizes cluster availability over consistency, e.g. the cluster continues accepting all the ingested samples, but some of them may be dropped if vmstorage nodes are temporarily unavailable and/or overloaded. The drop of samples happens before the replication, so it's not recommended to use this flag with -replicationFactor enabled.
|
||||
-enableTCP6
|
||||
|
|
|
@ -30,7 +30,7 @@ var (
|
|||
replicationFactor = flag.Int("replicationFactor", 1, "Replication factor for the ingested data, i.e. how many copies to make among distinct -storageNode instances. "+
|
||||
"Note that vmselect must run with -dedup.minScrapeInterval=1ms for data de-duplication when replicationFactor is greater than 1. "+
|
||||
"Higher values for -dedup.minScrapeInterval at vmselect is OK")
|
||||
disableRerouting = flag.Bool("disableRerouting", true, "Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -dropSamplesOnOverload")
|
||||
disableRerouting = flag.Bool("disableRerouting", true, "Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -disableReroutingOnUnavailable and -dropSamplesOnOverload")
|
||||
dropSamplesOnOverload = flag.Bool("dropSamplesOnOverload", false, "Whether to drop incoming samples if the destination vmstorage node is overloaded and/or unavailable. This prioritizes cluster availability over consistency, e.g. the cluster continues accepting all the ingested samples, but some of them may be dropped if vmstorage nodes are temporarily unavailable and/or overloaded. The drop of samples happens before the replication, so it's not recommended to use this flag with -replicationFactor enabled.")
|
||||
vmstorageDialTimeout = flag.Duration("vmstorageDialTimeout", 3*time.Second, "Timeout for establishing RPC connections from vminsert to vmstorage. "+
|
||||
"See also -vmstorageUserTimeout")
|
||||
|
@ -41,7 +41,8 @@ var (
|
|||
disableReroutingOnUnavailable = flag.Bool("disableReroutingOnUnavailable", false, "Whether to disable re-routing when some of vmstorage nodes are unavailable. "+
|
||||
"Disabled re-routing stops ingestion when some storage nodes are unavailable. "+
|
||||
"On the other side, disabled re-routing minimizes the number of active time series in the cluster "+
|
||||
"during rolling restarts and during spikes in series churn rate.")
|
||||
"during rolling restarts and during spikes in series churn rate. "+
|
||||
"See also -disableRerouting")
|
||||
)
|
||||
|
||||
var errStorageReadOnly = errors.New("storage node is read only")
|
||||
|
|
|
@ -36,6 +36,7 @@ Released at 2024-02-01
|
|||
All these fixes are also included in [the latest community release](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest).
|
||||
The v1.97.x line will be supported for at least 12 months since [v1.97.0](https://docs.victoriametrics.com/CHANGELOG.html#v1970) release**
|
||||
|
||||
* FEATURE: [VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html): add `-disableReroutingOnUnavailable` command-line flag, which can be used for reducing resource usage spikes at `vmstorage` nodes during rolling restart. Thanks to @Muxa1L for [the pull request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5713).
|
||||
* FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent.html): add support for `username_file` option at `basic_auth` section in [`scrape_configs`](https://docs.victoriametrics.com/sd_configs/#http-api-client-options). See [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5511). Thanks to @wasim-nihal for [the initial implementation](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5720).
|
||||
* FEATURE: [dashboards/vmagent](https://grafana.com/grafana/dashboards/12683): add `Targets scraped/s` stat panel showing the number of targets scraped by the vmagent per-second.
|
||||
* FEATURE: [dashboards/all](https://grafana.com/orgs/victoriametrics): add new panel `CPU spent on GC`. It should help identifying cases when too much CPU is spent on garbage collection, and advice users on how this can be addressed.
|
||||
|
|
|
@ -509,14 +509,18 @@ The `minimum downtime` strategy has the following benefits comparing to `no down
|
|||
`vmstorage` nodes may experience increased usage for CPU, RAM and disk IO during
|
||||
[rolling restarts](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#no-downtime-strategy),
|
||||
since they need to process higher load when some of `vmstorage` nodes are temporarily unavailable in the cluster.
|
||||
It is possible to reduce resource usage spikes by running more `vminsert` nodes and by passing bigger values
|
||||
to `-storage.vminsertConnsShutdownDuration` (available from [v1.95.0](https://docs.victoriametrics.com/CHANGELOG.html#v1950))
|
||||
command-line flag at `vmstorage` nodes.
|
||||
In this case `vmstorage` increases the interval between gradual closing of `vminsert` connections during graceful shutdown.
|
||||
This reduces data ingestion slowdown during rollout restarts.
|
||||
|
||||
Make sure that the `-storage.vminsertConnsShutdownDuration` is smaller than the graceful shutdown timeout configured at the system which manages `vmstorage`
|
||||
(e.g. Docker, Kubernetes, systemd, etc.). Otherwise the system may kill `vmstorage` node before it finishes gradual closing of `vminsert` connections.
|
||||
The following approaches can be used for reducing resource usage at `vmstorage` nodes during rolling restart:
|
||||
|
||||
- To pass `-disableReroutingOnUnavailable` command-line flag to `vminsert` nodes, so they pause data ingestion when `vmstorage` nodes are restarted
|
||||
instead of re-routing the ingested data to other available `vmstorage` nodes.
|
||||
|
||||
- To pass bigger values to `-storage.vminsertConnsShutdownDuration` (available from [v1.95.0](https://docs.victoriametrics.com/CHANGELOG.html#v1950))
|
||||
command-line flag at `vmstorage` nodes.In this case `vmstorage` increases the interval between gradual closing of `vminsert` connections during graceful shutdown.
|
||||
This reduces data ingestion slowdown during rollout restarts.
|
||||
|
||||
Make sure that the `-storage.vminsertConnsShutdownDuration` is smaller than the graceful shutdown timeout configured at the system which manages `vmstorage`
|
||||
(e.g. Docker, Kubernetes, systemd, etc.). Otherwise the system may kill `vmstorage` node before it finishes gradual closing of `vminsert` connections.
|
||||
|
||||
See also [minimum downtime strategy](#minimum-downtime-strategy).
|
||||
|
||||
|
@ -970,7 +974,9 @@ Below is the output for `/path/to/vminsert -help`:
|
|||
-denyQueryTracing
|
||||
Whether to disable the ability to trace queries. See https://docs.victoriametrics.com/#query-tracing
|
||||
-disableRerouting
|
||||
Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -dropSamplesOnOverload (default true)
|
||||
Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -disableReroutingOnUnavailable and -dropSamplesOnOverload (default true)
|
||||
-disableReroutingOnUnavailable
|
||||
Whether to disable re-routing when some of vmstorage nodes are unavailable. Disabled re-routing stops ingestion when some storage nodes are unavailable. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -disableRerouting
|
||||
-dropSamplesOnOverload
|
||||
Whether to drop incoming samples if the destination vmstorage node is overloaded and/or unavailable. This prioritizes cluster availability over consistency, e.g. the cluster continues accepting all the ingested samples, but some of them may be dropped if vmstorage nodes are temporarily unavailable and/or overloaded. The drop of samples happens before the replication, so it's not recommended to use this flag with -replicationFactor enabled.
|
||||
-enableTCP6
|
||||
|
|
Loading…
Reference in a new issue