mirror of
https://github.com/VictoriaMetrics/VictoriaMetrics.git
synced 2025-01-10 15:14:09 +00:00
docs: update Troubleshooting guide (#2809)
Follow-up after a4d9388ecb
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit is contained in:
parent
78da4b5f5e
commit
51209c114a
1 changed files with 54 additions and 31 deletions
|
@ -20,13 +20,15 @@ If you see unexpected or unreliable query results from VictoriaMetrics, then try
|
||||||
`sum(rate(http_requests_total[5m])) by (job)`, then check whether the following queries return
|
`sum(rate(http_requests_total[5m])) by (job)`, then check whether the following queries return
|
||||||
expected results:
|
expected results:
|
||||||
|
|
||||||
- Remove the outer `sum`: `rate(http_requests_total[5m])`. If this query returns too many time series,
|
- Remove the outer `sum` and execute `rate(http_requests_total[5m])`,
|
||||||
then try adding more specific label filters to it. For example, if you see that the original query
|
since aggregations could hide some missing series, gaps in data or anomalies in existing series.
|
||||||
returns unexpected results for the `job="foo"`, then use `rate(http_requests_total{job="foo"}[5m])` query.
|
If this query returns too many time series, then try adding more specific label filters to it.
|
||||||
|
For example, if you see that the original query returns unexpected results for the `job="foo"`,
|
||||||
|
then use `rate(http_requests_total{job="foo"}[5m])` query.
|
||||||
If this isn't enough, then continue adding more specific label filters, so the resulting query returns
|
If this isn't enough, then continue adding more specific label filters, so the resulting query returns
|
||||||
manageable number of time series.
|
manageable number of time series.
|
||||||
|
|
||||||
- Remove the outer `rate`: `http_requests_total`. Additional label filters may be added here in order
|
- Remove the outer `rate` and execute `http_requests_total`. Additional label filters may be added here in order
|
||||||
to reduce the number of returned series.
|
to reduce the number of returned series.
|
||||||
|
|
||||||
Sometimes the query may be improperly constructed, so it returns unexpected results.
|
Sometimes the query may be improperly constructed, so it returns unexpected results.
|
||||||
|
@ -34,8 +36,8 @@ If you see unexpected or unreliable query results from VictoriaMetrics, then try
|
||||||
especially [subqueries](https://docs.victoriametrics.com/MetricsQL.html#subqueries)
|
especially [subqueries](https://docs.victoriametrics.com/MetricsQL.html#subqueries)
|
||||||
and [rollup functions](https://docs.victoriametrics.com/MetricsQL.html#rollup-functions) sections.
|
and [rollup functions](https://docs.victoriametrics.com/MetricsQL.html#rollup-functions) sections.
|
||||||
|
|
||||||
2. If the simplest query continues returning unexpected / unreliable results, then export raw samples
|
2. If the simplest query continues returning unexpected / unreliable results, then try verifying correctness
|
||||||
for this query via [/api/v1/export](https://docs.victoriametrics.com/#how-to-export-data-in-json-line-format)
|
of raw unprocessed samples for this query via [/api/v1/export](https://docs.victoriametrics.com/#how-to-export-data-in-json-line-format)
|
||||||
on the given `[start..end]` time range and check whether they are expected:
|
on the given `[start..end]` time range and check whether they are expected:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
|
@ -165,24 +167,26 @@ There are the following most commons reasons for slow data ingestion in Victoria
|
||||||
are located in the same network with short network latency between them.
|
are located in the same network with short network latency between them.
|
||||||
`vminsert` packs incoming data into in-memory packets and sends them to `vmstorage` on-by-one.
|
`vminsert` packs incoming data into in-memory packets and sends them to `vmstorage` on-by-one.
|
||||||
It waits until `vmstorage` returns back `ack` response before sending the next packet.
|
It waits until `vmstorage` returns back `ack` response before sending the next packet.
|
||||||
If the network latency between `vminsert` and `vmstorage` is big (for example, if they run in different datacenters),
|
If the network latency between `vminsert` and `vmstorage` is high (for example, if they run in different datacenters),
|
||||||
then this may become limiting factor for data ingestion speed.
|
then this may become limiting factor for data ingestion speed.
|
||||||
|
|
||||||
The [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring)
|
The [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring)
|
||||||
contain `connection saturation` graph for `vminsert` components. If this graph reaches 100%,
|
contain `connection saturation` graph for `vminsert` components. If this graph reaches 100% (1s),
|
||||||
then it is likely you have issues with network latency between `vminsert` and `vmstorage`.
|
then it is likely you have issues with network latency between `vminsert` and `vmstorage`.
|
||||||
Another possible issue for 100% connection saturation between `vminsert` and `vmstorage`
|
Another possible issue for 100% connection saturation between `vminsert` and `vmstorage`
|
||||||
is resource shortage at `vmstorage` nodes. In this case you need to increase amounts
|
is resource shortage at `vmstorage` nodes. In this case you need to increase amounts
|
||||||
of available resources (CPU, RAM, disk IO) at `vmstorage` nodes or to add more `vmstorage` nodes to the cluster.
|
of available resources (CPU, RAM, disk IO) at `vmstorage` nodes or to add more `vmstorage` nodes to the cluster.
|
||||||
|
|
||||||
5. Noisy neighboor. Make sure VictoriaMetrics components run in envirnoments without other resource-hungry apps.
|
5. Noisy neighbor. Make sure VictoriaMetrics components run in an envirnoments without other resource-hungry apps.
|
||||||
Such apps may steal RAM, CPU, disk IO and network bandwidth, which is needed for VictoriaMetrics components.
|
Such apps may steal RAM, CPU, disk IO and network bandwidth, which is needed for VictoriaMetrics components.
|
||||||
|
Issues like this are very hard to catch via [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring)
|
||||||
|
and proper diagnosis would require checking resource usage on the instances where VictoriaMetrics runs.
|
||||||
|
|
||||||
## Slow queries
|
## Slow queries
|
||||||
|
|
||||||
Some queries may take more time and resources (CPU, RAM, network bandwidth) than others.
|
Some queries may take more time and resources (CPU, RAM, network bandwidth) than others.
|
||||||
VictoriaMetrics logs slow queries if their execution time exceeds the duration passed
|
VictoriaMetrics logs slow queries if their execution time exceeds the duration passed
|
||||||
to `-search.logSlowQueryDuration` command-line flag.
|
to `-search.logSlowQueryDuration` command-line flag (5s by default).
|
||||||
VictoriaMetrics also provides `/api/v1/status/top_queries` endpoint, which returns
|
VictoriaMetrics also provides `/api/v1/status/top_queries` endpoint, which returns
|
||||||
queries that took the most time to execute.
|
queries that took the most time to execute.
|
||||||
See [these docs](https://docs.victoriametrics.com/#prometheus-querying-api-enhancements) for details.
|
See [these docs](https://docs.victoriametrics.com/#prometheus-querying-api-enhancements) for details.
|
||||||
|
@ -191,7 +195,11 @@ There are the following solutions exist for slow queries:
|
||||||
|
|
||||||
- Adding more CPU and memory to VictoriaMetrics, so it may perform the slow query faster.
|
- Adding more CPU and memory to VictoriaMetrics, so it may perform the slow query faster.
|
||||||
If you use cluster version of VictoriaMetrics, then migration of `vmselect` nodes to machines
|
If you use cluster version of VictoriaMetrics, then migration of `vmselect` nodes to machines
|
||||||
with more CPU and RAM should help improving speed for slow queries.
|
with more CPU and RAM should help improving speed for slow queries. Remember that query performance
|
||||||
|
is always limited by resources of one vmselect which processes the query. For example, if 2vCPU on vmselect
|
||||||
|
isn't enough to process query fast enough, then prefer vertical scaling to horizontal for vmselects.
|
||||||
|
If on [ffficial Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring)
|
||||||
|
panel `Concurrent selects` is close to the limit, then prefer horizontal scaling for vmselects.
|
||||||
Sometimes adding more `vmstorage` nodes also can help improving the speed for slow queries.
|
Sometimes adding more `vmstorage` nodes also can help improving the speed for slow queries.
|
||||||
|
|
||||||
- Rewriting slow queries, so they become faster. Unfortunately it is hard determining
|
- Rewriting slow queries, so they become faster. Unfortunately it is hard determining
|
||||||
|
@ -209,28 +217,27 @@ There are the following solutions exist for slow queries:
|
||||||
There are the following most common sources of out of memory (aka OOM) crashes in VictoriaMetrics:
|
There are the following most common sources of out of memory (aka OOM) crashes in VictoriaMetrics:
|
||||||
|
|
||||||
1. Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components.
|
1. Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components.
|
||||||
If you don't understand clearly the purpose or the effect of some flags, then remove them
|
If you don't understand clearly the purpose or the effect of some flags - remove them
|
||||||
from the list of flags passed to VictoriaMetrics components, because some command-line flags
|
from the list of flags passed to VictoriaMetrics components. Improper command-line flags values
|
||||||
may lead to increased memory usage and increased CPU usage. The increased memory usage increases chances for OOM crashes.
|
may lead to increased memory and CPU usage. The increased memory usage increases chances for OOM crashes.
|
||||||
VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
|
VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
|
||||||
|
|
||||||
For example, it isn't recommended tuning cache sizes in VictoriaMetrics, since it frequently leads to OOM.
|
For example, it isn't recommended tuning cache sizes in VictoriaMetrics, since it frequently leads to OOM exceptions.
|
||||||
[These docs](https://docs.victoriametrics.com/#cache-tuning) refer command-line flags, which aren't
|
[These docs](https://docs.victoriametrics.com/#cache-tuning) refer command-line flags, which aren't
|
||||||
recommended to tune. If you see that VictoriaMetrics needs increasing some cache sizes for the current workload,
|
recommended to tune. If you see that VictoriaMetrics needs increasing some cache sizes for the current workload,
|
||||||
then it is better migrating to a host with more memory instead of trying to tune cache sizes.
|
then it is better migrating to a host with more memory instead of trying to tune cache sizes manually.
|
||||||
|
|
||||||
2. Unexpected heavy queries. The query is considered heavy if it needs to select and process millions of unique time series.
|
2. Unexpected heavy queries. The query is considered as heavy if it needs to select and process millions of unique time series.
|
||||||
Such query may lead to OOM, since VictoriaMetrics needs to keep some per-series data in memory.
|
Such query may lead to OOM exception, since VictoriaMetrics needs to keep some of per-series data in memory.
|
||||||
VictoriaMetrics provides various settings, which can help limiting resource usage in this case -
|
VictoriaMetrics provides [various settings](https://docs.victoriametrics.com/#resource-usage-limits),
|
||||||
see [these docs](https://docs.victoriametrics.com/#resource-usage-limits).
|
which can help limit resource usage.
|
||||||
See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
|
For more context, see [How to optimize PromQL and MetricsQL queries](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986).
|
||||||
which explains how to detect and optimize heavy queries.
|
VictoriaMetrics also provides [query tracer](https://docs.victoriametrics.com/#query-tracing)
|
||||||
VictoriaMetrics also provides [query tracer](https://docs.victoriametrics.com/#query-tracing),
|
to help identify the source of heavy query.
|
||||||
which may help identifying the source of heavy query.
|
|
||||||
|
|
||||||
3. Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory
|
3. Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory
|
||||||
under the current workload, then it is recommended migrating to a host with bigger amounts of memory
|
under the current workload, then it is recommended migrating to a host with bigger amounts of memory.
|
||||||
in order to protect from possible OOM crashes on workload spikes. It is recommended to have at least 50%
|
This would protect from possible OOM crashes on workload spikes. It is recommended to have at least 50%
|
||||||
of free memory for graceful handling of possible workload spikes.
|
of free memory for graceful handling of possible workload spikes.
|
||||||
See [capacity planning for single-node VictoriaMetrics](https://docs.victoriametrics.com/#capacity-planning)
|
See [capacity planning for single-node VictoriaMetrics](https://docs.victoriametrics.com/#capacity-planning)
|
||||||
and [capacity planning for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#capacity-planning).
|
and [capacity planning for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#capacity-planning).
|
||||||
|
@ -254,22 +261,38 @@ The most common sources of cluster instability are:
|
||||||
then the cluster will have only `N-1=2` healthy `vmstorage` nodes during the rolling restart.
|
then the cluster will have only `N-1=2` healthy `vmstorage` nodes during the rolling restart.
|
||||||
This means that the load on healthy `vmstorage` nodes increases by at least `100%/(N-1)=50%`
|
This means that the load on healthy `vmstorage` nodes increases by at least `100%/(N-1)=50%`
|
||||||
comparing to the load before rolling restart. E.g. they need to process 50% more incoming
|
comparing to the load before rolling restart. E.g. they need to process 50% more incoming
|
||||||
data and to return 50% more data during queries. In reality the load on the remaining `vmstorage`
|
data and to return 50% more data during queries. In reality, the load on the remaining `vmstorage`
|
||||||
nodes increases even more because they need to register new time series, which were re-routed
|
nodes increases even more because they need to register new time series, which were re-routed
|
||||||
from temporarily unavailable `vmstorage` node. If `vmstorage` nodes had less than 50%
|
from temporarily unavailable `vmstorage` node. If `vmstorage` nodes had less than 50%
|
||||||
of free resources (CPU, RAM, disk IO) before the rolling restart, then the rolling restart
|
of free resources (CPU, RAM, disk IO) before the rolling restart, then it
|
||||||
can lead to cluster overload and instability for both data ingestion and querying.
|
can lead to cluster overload and instability for both data ingestion and querying.
|
||||||
|
|
||||||
As you can see, the workload increase during rolling restart can be reduced by increasing
|
The workload increase during rolling restart can be reduced by increasing
|
||||||
the number of `vmstorage` nodes in the cluster. For example, if VictoriaMetrics cluster contains
|
the number of `vmstorage` nodes in the cluster. For example, if VictoriaMetrics cluster contains
|
||||||
`N=11` `vmstorage` nodes, then the workload increase during rolling restart of `vmstorage` nodes
|
`N=11` `vmstorage` nodes, then the workload increase during rolling restart of `vmstorage` nodes
|
||||||
would be `100%/(N-1)=10%`. So it is recommended to have at least 8 `vmstorage` nodes in the cluster.
|
would be `100%/(N-1)=10%`. It is recommended to have at least 8 `vmstorage` nodes in the cluster.
|
||||||
The recommended number of `vmstorage` nodes should be multiplied by `-replicationFactor` if replication is enabled -
|
The recommended number of `vmstorage` nodes should be multiplied by `-replicationFactor` if replication is enabled -
|
||||||
see [replication and data safety docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety)
|
see [replication and data safety docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety)
|
||||||
for details.
|
for details.
|
||||||
|
|
||||||
The obvious solution against VictoriaMetrics cluster instability is to make sure cluster components
|
The obvious solution against VictoriaMetrics cluster instability is to make sure cluster components
|
||||||
have enough free resources for graceful processing the increased workload.
|
have enough free resources for graceful processing of the increased workload.
|
||||||
See [capacity planning docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#capacity-planning)
|
See [capacity planning docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#capacity-planning)
|
||||||
and [cluster resizing and scalability docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-resizing-and-scalability)
|
and [cluster resizing and scalability docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-resizing-and-scalability)
|
||||||
for details.
|
for details.
|
||||||
|
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
Having proper [monitoring](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#monitoring)
|
||||||
|
would help identify and prevent most of the issues listed above.
|
||||||
|
|
||||||
|
[Grafana dasbhoards](https://grafana.com/orgs/victoriametrics/dashboards) contain panels reflecting the
|
||||||
|
health state, resource usage and other specific metrics for VictoriaMetrics components.
|
||||||
|
|
||||||
|
Alerting rules for [single-node](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml)
|
||||||
|
and [cluster](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/alerts.yml) versions
|
||||||
|
of VictoriaMetrics will notify about issues with Victoriametrics components and provide recommendations for how to solve them.
|
||||||
|
|
||||||
|
Internally, we heavily rely both on dashboards and alerts, and constantly improve them.
|
||||||
|
It is important to stay up to date with such changes.
|
Loading…
Reference in a new issue