mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-11-21 14:44:00 +00:00

alerts: add docs section for the full list of alerting rules

The change also includes update of all references in other docs
to the alerting rules.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

2023-08-03 10:46:25 +02:00

29 KiB

Raw Blame History

sort

weight

title

aliases

Troubleshooting

docs

parent	weight
victoriametrics	23

/Troubleshooting.html

Troubleshooting

This document contains troubleshooting guides for most common issues when working with VictoriaMetrics:

General troubleshooting checklist
Unexpected query results
Slow data ingestion
Slow queries
Out of memory errors
Cluster instability
Monitoring

General troubleshooting checklist

If you hit some issue or have some question about VictoriaMetrics components, then please follow the following steps in order to quickly find the solution:

Check the version of VictoriaMetrics component, which needs to be troubleshot and compare it to the latest available version. If the used version is lower than the latest available version, then there are high chances that the issue is already resolved in newer versions. Carefully read the changelog between your version and the latest version and check whether the issue is already fixed there.

If the issue is already fixed in newer versions, then upgrade to the newer version and verify whether the issue is fixed:
- How to upgrade single-node VictoriaMetrics
- How to upgrade VictoriaMetrics cluster
Upgrade procedure for other VictoriaMetrics components is as simple as gracefully stopping the component by sending SIGINT signal to it and starting the new version of the component.

There may be breaking changes between different versions of VictoriaMetrics components in rare cases. These cases are documented in the changelog. So please read the changelog before the upgrade.
Inspect command-line flags passed to VictoriaMetrics components and remove flags which unclear outcomes for your workload. VictoriaMetrics components are designed to work optimally with the default command-line flag values (e.g. when these flags aren't set explicitly). It is recommended removing flags with unclear outcomes, since they may result in unexpected issues.
Check for logs in VictoriaMetrics components. They may contain useful information about cause of the issue and how to fix the issue. If the log message doesn't have enough useful information for troubleshooting, then search the log message in Google. There are high chances that the issue is already reported somewhere (docs, StackOverflow, Github issues, etc.) and the solution is already documented there.
If VictoriaMetrics logs have no relevant information, then try searching for the issue in Google via multiple keywords and phrases specific to the issue. There are high chances that the issue and the solution is already documented somewhere.
Try searching for the issue at VictoriaMetrics GitHub. The signal/noise quality of search results here is much lower than in Google, but sometimes it may help finding the relevant information about the issue when Google fails to find the needed information. If you located the relevant GitHub issue, but it misses some information on how to diagnose or troubleshoot it, then please provide this information in comments to the issue. This increases chances that it will be resolved soon.
Try searching for information about the issue in VictoriaMetrics source code. GitHub code search may be not very good in some cases, so it is recommended checking out VictoriaMetrics source code and perform local search in the checked out code. Note that the source code for VictoriaMetrics cluster is located in the cluster branch.
Try searching for information about the issue in the history of VictoriaMetrics Slack chat. There are non-zero chances that somebody already stuck with the same issue and documented the solution at Slack.
If steps above didn't help finding the solution to the issue, then please file a new issue by providing the maximum details on how to reproduce the issue.

After that you can post the link to the issue to VictoriaMetrics Slack chat, so VictoriaMetrics community could help finding the solution to the issue. It is better filing the issue at VictoriaMetrics GitHub before posting your question to VictoriaMetrics Slack chat, since GitHub issues are indexed by Google, while Slack messages aren't indexed by Google. This simplifies searching for the solution to the issue for future VictoriaMetrics users.
Pro tip 1: if you see that VictoriaMetrics docs contain incomplete or incorrect information, then please create a pull request with the relevant changes. This will help VictoriaMetrics community.

All the docs published at https://docs.victoriametrics.com are located in the docs folder inside VictoriaMetrics repository.
Pro tip 2: please provide links to existing docs / GitHub issues / StackOverflow questions instead of copy-n-pasting the information from these sources when asking or answering questions from VictoriaMetrics community. If the linked resources have no enough information, then it is better posting the missing information in the web resource before providing links to this information in Slack chat. This will simplify searching for this information in the future for VictoriaMetrics users via Google and ChatGPT :)
Pro tip 3: if you are answering somebody's question about VictoriaMetrics components at GitHub issues / Slack chat / StackOverflow, then the best answer is a direct link to the information regarding the question. The better answer is a concise message with multiple links to the relevant information. The worst answer is a message with misleading or completely wrong information.
Pro tip 4: if you can fix the issue on yourself, then please do it and provide the corresponding pull request! We are glad to get pull requests from VictoriaMetrics community.

Unexpected query results

If you see unexpected or unreliable query results from VictoriaMetrics, then try the following steps:

Check whether simplified queries return unexpected results. For example, if the query looks like sum(rate(http_requests_total[5m])) by (job), then check whether the following queries return expected results:
- Remove the outer sum and execute rate(http_requests_total[5m]), since aggregations could hide some missing series, gaps in data or anomalies in existing series. If this query returns too many time series, then try adding more specific label filters to it. For example, if you see that the original query returns unexpected results for the job="foo", then use rate(http_requests_total{job="foo"}[5m]) query. If this isn't enough, then continue adding more specific label filters, so the resulting query returns manageable number of time series.
- Remove the outer rate and execute http_requests_total. Additional label filters may be added here in order to reduce the number of returned series.
Sometimes the query may be improperly constructed, so it returns unexpected results. It is recommended reading and understanding MetricsQL docs, especially subqueries and rollup functions sections.
If the simplest query continues returning unexpected / unreliable results, then try verifying correctness of raw unprocessed samples for this query via /api/v1/export on the given [start..end] time range and check whether they are expected:
```
single-node: curl http://victoriametrics:8428/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...'

cluster: curl http://<vmselect>:8481/select/<tenantID>/prometheus/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...'
```
Note that responses returned from /api/v1/query and from /api/v1/query_range contain evaluated data instead of raw samples stored in VictoriaMetrics. See these docs for details.

If you migrate from InfluxDB, then pass -search.setLookbackToStep command-line flag to single-node VictoriaMetrics or to vmselect in VictoriaMetrics cluster. See also how to migrate from InfluxDB to VictoriaMetrics.
Sometimes response caching may lead to unexpected results when samples with older timestamps are ingested into VictoriaMetrics (aka backfilling). Try disabling response cache and see whether this helps. This can be done in the following ways:
- By passing -search.disableCache command-line flag to a single-node VictoriaMetrics or to all the vmselect components if cluster version of VictoriaMetrics is used.
- By passing nocache=1 query arg to every request to /api/v1/query and /api/v1/query_range. If you use Grafana, then this query arg can be specified in Custom Query Parameters field at Prometheus datasource settings - see these docs for details.
If you use cluster version of VictoriaMetrics, then it may return partial responses by default when some of vmstorage nodes are temporarily unavailable - see cluster availability docs for details. If you want to prioritize query consistency over cluster availability, then you can pass -search.denyPartialResponse command-line flag to all the vmselect nodes. In this case VictoriaMetrics returns an error during querying if at least a single vmstorage node is unavailable. Another option is to pass deny_partial_response=1 query arg to /api/v1/query and /api/v1/query_range. If you use Grafana, then this query arg can be specified in Custom Query Parameters field at Prometheus datasource settings - see these docs for details.
If you pass -replicationFactor command-line flag to vmselect, then it is recommended removing this flag from vmselect, since it may lead to incomplete responses when vmstorage nodes contain less than -replicationFactor copies of the requested data.
If you observe gaps when plotting time series try simplifying your query according to p2 and follow the list. If problem still remains, then it is likely caused by irregular intervals for metrics collection (network delays or targets unavailability on scrapes, irregular pushes, irregular timestamps). VictoriaMetrics automatically fills the gaps based on median interval between data samples. This might work incorrect for irregular data as median will be skewed. In this case it is recommended to switch to the static interval for gaps filling by setting -search.minStalenessInterval=5m cmd-line flag (5m is the static interval used by Prometheus).
Try upgrading to the latest available version of VictoriaMetrics and verifying whether the issue is fixed there.
Try executing the query with trace=1 query arg. This enables query tracing, which may contain useful information on why the query returns unexpected data. See query tracing docs for details.
Inspect command-line flags passed to VictoriaMetrics components. If you don't understand clearly the purpose or the effect of some flags, then remove them from the list of flags passed to VictoriaMetrics components, because some command-line flags may change query results in unexpected ways when set to improper values. VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
If the steps above didn't help identifying the root cause of unexpected query results, then file a bugreport with details on how to reproduce the issue.

Slow data ingestion

There are the following most commons reasons for slow data ingestion in VictoriaMetrics:

Memory shortage for the given amounts of active time series.

VictoriaMetrics (or vmstorage in cluster version of VictoriaMetrics) maintains an in-memory cache for quick search for internal series ids per each incoming metric. This cache is named storage/tsid. VictoriaMetrics automatically determines the maximum size for this cache depending on the available memory on the host where VictoriaMetrics (or vmstorage) runs. If the cache size isn't enough for holding all the entries for active time series, then VictoriaMetrics locates the needed data on disk, unpacks it, re-constructs the missing entry and puts it into the cache. This takes additional CPU time and disk read IO.

The official Grafana dashboards for VictoriaMetrics contain Slow inserts graph, which shows the cache miss percentage for storage/tsid cache during data ingestion. If slow inserts graph shows values greater than 5% for more than 10 minutes, then it is likely the current number of active time series cannot fit the storage/tsid cache.

There are the following solutions exist for this issue:
- To increase the available memory on the host where VictoriaMetrics runs until slow inserts percentage will become lower than 5%. If you run VictoriaMetrics cluster, then you need increasing total available memory at vmstorage nodes. This can be done in two ways: either to increase the available memory per each existing vmstorage node or to add more vmstorage nodes to the cluster.
- To reduce the number of active time series. The official Grafana dashboards for VictoriaMetrics contain a graph showing the number of active time series. Recent versions of VictoriaMetrics provide cardinality explorer, which can help determining and fixing the source of high cardinality.
High churn rate, e.g. when old time series are substituted with new time series at a high rate. When VictoriaMetrics encounters a sample for new time series, it needs to register the time series in the internal index (aka indexdb), so it can be quickly located on subsequent select queries. The process of registering new time series in the internal index is an order of magnitude slower than the process of adding new sample to already registered time series. So VictoriaMetrics may work slower than expected under high churn rate.

The official Grafana dashboards for VictoriaMetrics provides Churn rate graph, which shows the average number of new time series registered during the last 24 hours. If this number exceeds the number of active time series, then you need to identify and fix the source of high churn rate. The most commons source of high churn rate is a label, which frequently changes its value. Try avoiding such labels. The cardinality explorer can help identifying such labels.
Resource shortage. The official Grafana dashboards for VictoriaMetrics contain resource usage graphs, which show memory usage, CPU usage, disk IO usage and free disk size. Make sure VictoriaMetrics has enough free resources for graceful handling of potential spikes in workload according to the following recommendations:
- 50% of free CPU
- 50% of free memory
- 20% of free disk space
If VictoriaMetrics components have lower amounts of free resources, then this may lead to significant performance degradation after workload increases slightly. For example:
- If the percentage of free CPU is close to 0, then VictoriaMetrics may experience arbitrary long delays during data ingestion when it cannot keep up with slightly increased data ingestion rate.
- If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run, may have no enough memory for page cache. VictoriaMetrics relies on page cache for quick queries over recently ingested data. If the operating system has no enough free memory for page cache, then it needs to re-read the requested data from disk. This may significantly increase disk read IO and slow down both queries and data ingestion.
- If free disk space is lower than 20%, then VictoriaMetrics is unable to perform optimal background merge of the incoming data. This leads to increased number of data files on disk, which, in turn, slows down both data ingestion and querying. See these docs for details.
If you run cluster version of VictoriaMetrics, then make sure vminsert and vmstorage components are located in the same network with small network latency between them. vminsert packs incoming data into batch packets and sends them to vmstorage on-by-one. It waits until vmstorage returns back ack response before sending the next packet. If the network latency between vminsert and vmstorage is high (for example, if they run in different datacenters), then this may become limiting factor for data ingestion speed.

The official Grafana dashboard for cluster version of VictoriaMetrics contain connection saturation graph for vminsert components. If this graph reaches 100% (1s), then it is likely you have issues with network latency between vminsert and vmstorage. Another possible issue for 100% connection saturation between vminsert and vmstorage is resource shortage at vmstorage nodes. In this case you need to increase amounts of available resources (CPU, RAM, disk IO) at vmstorage nodes or to add more vmstorage nodes to the cluster.
Noisy neighbor. Make sure VictoriaMetrics components run in an environments without other resource-hungry apps. Such apps may steal RAM, CPU, disk IO and network bandwidth, which is needed for VictoriaMetrics components. Issues like this are very hard to catch via official Grafana dashboard for cluster version of VictoriaMetrics and proper diagnosis would require checking resource usage on the instances where VictoriaMetrics runs.
If you see TooHighSlowInsertsRate alert when single-node VictoriaMetrics or vmstorage has enough free CPU and RAM, then increase -cacheExpireDuration command-line flag at single-node VictoriaMetrics or at vmstorage to the value, which exceeds the interval between ingested samples for the same time series (aka scrape_interval). See this comment for more details.

Slow queries

Some queries may take more time and resources (CPU, RAM, network bandwidth) than others. VictoriaMetrics logs slow queries if their execution time exceeds the duration passed to -search.logSlowQueryDuration command-line flag (5s by default). VictoriaMetrics also provides /api/v1/status/top_queries endpoint, which returns queries that took the most time to execute. See these docs for details.

There are the following solutions exist for slow queries:

Adding more CPU and memory to VictoriaMetrics, so it may perform the slow query faster. If you use cluster version of VictoriaMetrics, then migration of vmselect nodes to machines with more CPU and RAM should help improving speed for slow queries. Query performance is always limited by resources of one vmselect which processes the query. For example, if 2vCPU cores on vmselect isn't enough to process query fast enough, then migrating vmselect to a machine with 4vCPU cores should increase heavy query performance by up to 2x. If the line on Concurrent select graph form the official Grafana dashboard for VictoriaMetrics is close to the limit, then prefer adding more vmselect nodes to the cluster. Sometimes adding more vmstorage nodes also can help improving the speed for slow queries.
Rewriting slow queries, so they become faster. Unfortunately it is hard determining whether the given query is slow by just looking at it. VictoriaMetrics provides query tracing feature, which can help determine the source of slow query. See also this article, which explains how to determine and optimize slow queries.

In practice many slow queries are generated because of improper use of subqueries. It is recommended avoiding subqueries if you don't understand clearly how they work. It is easy to create a subquery without knowing about it. For example, rate(sum(some_metric)) is implicitly transformed into the following subquery according to implicit conversion rules for MetricsQL queries:
```
rate(
  sum(
    default_rollup(some_metric[1i])
  )[1i:1i]
)
```
It is likely this query won't return the expected results. Instead, sum(rate(some_metric)) must be used instead. See this article for more details.

Out of memory errors

There are the following most common sources of out of memory (aka OOM) crashes in VictoriaMetrics:

Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components. If you don't understand clearly the purpose or the effect of some flags - remove them from the list of flags passed to VictoriaMetrics components. Improper command-line flags values may lead to increased memory and CPU usage. The increased memory usage increases chances for OOM crashes. VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).

For example, it isn't recommended tuning cache sizes in VictoriaMetrics, since it frequently leads to OOM exceptions. These docs refer command-line flags, which aren't recommended to tune. If you see that VictoriaMetrics needs increasing some cache sizes for the current workload, then it is better migrating to a host with more memory instead of trying to tune cache sizes manually.
Unexpected heavy queries. The query is considered as heavy if it needs to select and process millions of unique time series. Such query may lead to OOM exception, since VictoriaMetrics needs to keep some of per-series data in memory. VictoriaMetrics provides various settings, which can help limit resource usage. For more context, see How to optimize PromQL and MetricsQL queries. VictoriaMetrics also provides query tracer to help identify the source of heavy query.
Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory under the current workload, then it is recommended migrating to a host with bigger amounts of memory. This would protect from possible OOM crashes on workload spikes. It is recommended to have at least 50% of free memory for graceful handling of possible workload spikes. See capacity planning for single-node VictoriaMetrics and capacity planning for cluster version of VictoriaMetrics.

Cluster instability

VictoriaMetrics cluster may become unstable if there is no enough free resources (CPU, RAM, disk IO, network bandwidth) for processing the current workload.

The most common sources of cluster instability are:

Workload spikes. For example, if the number of active time series increases by 2x while the cluster has no enough free resources for processing the increased workload, then it may become unstable. VictoriaMetrics provides various configuration settings, which can be used for limiting unexpected workload spikes. See these docs for details.
Various maintenance tasks such as rolling upgrades or rolling restarts during configuration changes. For example, if a cluster contains N=3 vmstorage nodes and they are restarted one-by-one (aka rolling restart), then the cluster will have only N-1=2 healthy vmstorage nodes during the rolling restart. This means that the load on healthy vmstorage nodes increases by at least 100%/(N-1)=50% comparing to the load before rolling restart. E.g. they need to process 50% more incoming data and to return 50% more data during queries. In reality, the load on the remaining vmstorage nodes increases even more because they need to register new time series, which were re-routed from temporarily unavailable vmstorage node. If vmstorage nodes had less than 50% of free resources (CPU, RAM, disk IO) before the rolling restart, then it can lead to cluster overload and instability for both data ingestion and querying.

The workload increase during rolling restart can be reduced by increasing the number of vmstorage nodes in the cluster. For example, if VictoriaMetrics cluster contains N=11 vmstorage nodes, then the workload increase during rolling restart of vmstorage nodes would be 100%/(N-1)=10%. It is recommended to have at least 8 vmstorage nodes in the cluster. The recommended number of vmstorage nodes should be multiplied by -replicationFactor if replication is enabled - see replication and data safety docs for details.

The obvious solution against VictoriaMetrics cluster instability is to make sure cluster components have enough free resources for graceful processing of the increased workload. See capacity planning docs and cluster resizing and scalability docs for details.

Monitoring

Having proper monitoring would help identify and prevent most of the issues listed above.

Grafana dashboards contain panels reflecting the health state, resource usage and other specific metrics for VictoriaMetrics components.

The list of recommended alerting rules for VictoriaMetrics components will notify about issues and provide recommendations for how to solve them.

Internally, we heavily rely both on dashboards and alerts, and constantly improve them. It is important to stay up to date with such changes.

29 KiB Raw Blame History

Troubleshooting

General troubleshooting checklist

Unexpected query results

Slow data ingestion

Slow queries

Out of memory errors

Cluster instability

Monitoring

29 KiB

Raw Blame History