From a645a95bd6a42701d937b19a011619b640469538 Mon Sep 17 00:00:00 2001 From: Roman Khavronenko Date: Mon, 13 Feb 2023 17:29:30 +0100 Subject: [PATCH] docs: improve troubleshooting docs for vmalert (#3812) Signed-off-by: hagen1778 --- app/vmalert/README.md | 45 ++++++++++++++++++++++++++++--------------- docs/keyConcepts.md | 24 +++++++++++++++++++++++ docs/vmalert.md | 45 ++++++++++++++++++++++++++++--------------- 3 files changed, 82 insertions(+), 32 deletions(-) diff --git a/app/vmalert/README.md b/app/vmalert/README.md index b2db420f95..3aee5da5bc 100644 --- a/app/vmalert/README.md +++ b/app/vmalert/README.md @@ -725,31 +725,42 @@ a review to the dashboard. ## Troubleshooting -vmalert executes configured rules within certain intervals. It is expected that at the moment when rule is executed, -the data is already present in configured `-datasource.url`: +### Data delay + +Data delay is one of the most common issues with rules execution. +vmalert executes configured rules within certain intervals at specifics timestamps. +It expects that the data is already present in configured `-datasource.url` at the moment of time when rule is executed: vmalert expected evaluation Usually, troubles start to appear when data in `-datasource.url` is delayed or absent. In such cases, evaluations -may get empty response from datasource and produce empty recording rules or reset alerts state: +may get empty response from the datasource and produce empty recording rules or reset alerts state: vmalert evaluation when data is delayed -By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s. -This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`. -Usually, this results into a 30s shift for recording rules results. -Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results. +Try the following recommendations to reduce the chance of hitting the data delay issue: -Try the following recommendations in such cases: - -* Always configure group's `evaluationInterval` to be bigger or equal to `scrape_interval` at which metrics -are delivered to the datasource; +* Always configure group's `evaluationInterval` to be bigger or at least equal to +[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution); +* Ensure that `[duration]` value is at least twice bigger than +[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution). For example, +if expression is `rate(my_metric[2m]) > 0` then ensure that `my_metric` resolution is at least `1m` or better `30s`. +If you use VictoriaMetrics as datasource, `[duration]` can be omitted and VictoriaMetrics will adjust it automatically. * If you know in advance, that data in datasource is delayed - try changing vmalert's `-datasource.lookback` -command-line flag to add a time shift for evaluations; -* If time intervals between datapoints in datasource are irregular or `>=5min` - try changing vmalert's -`-datasource.queryStep` command-line flag to specify how far search query can lookback for the recent datapoint. -The recommendation is to have the step at least two times bigger than `scrape_interval`, since -there are no guarantees that scrape will not fail. +command-line flag to add a time shift for evaluations. Or extend `[duration]` to tolerate the delay. +For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`. +* If [time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution) +in datasource is inconsistent or `>=5min` - try changing vmalert's `-datasource.queryStep` command-line flag to specify +how far search query can lookback for the recent datapoint. The recommendation is to have the step +at least two times bigger than the resolution. + +> Please note, data delay is inevitable in distributed systems. And it is better to account for it instead of ignoring. + +By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s +(see `-search.latencyOffset` command-line flag at vmselect). Such delay is needed to eliminate risk of incomplete +data on the moment of querying, since metrics collectors won't be able to deliver the data in time. + +### Alerts state Sometimes, it is not clear why some specific alert fired or didn't fire. It is very important to remember, that alerts with `for: 0` fire immediately when their expression becomes true. And alerts with `for > 0` will fire only @@ -772,6 +783,8 @@ HTTP request sent by vmalert to the `-datasource.url` during evaluation. If spec no samples returned and curl command returns data - then it is very likely there was no data in datasource on the moment when rule was evaluated. +### Debug mode + vmalert allows configuring more detailed logging for specific alerting rule. Just set `debug: true` in rule's configuration and vmalert will start printing additional log messages: ```terminal diff --git a/docs/keyConcepts.md b/docs/keyConcepts.md index 3673ef6798..d4d13aabc0 100644 --- a/docs/keyConcepts.md +++ b/docs/keyConcepts.md @@ -79,6 +79,30 @@ requests_total{path="/", code="200"} 123 4567890 - The `4567890` is an optional timestamp for the sample. If it is missing, then the current timestamp is used when storing the sample in VictoriaMetrics. +#### Time series resolution + +Resolution is the minimum interval between [raw samples](https://docs.victoriametrics.com/keyConcepts.html#raw-samples) +of the [time series](https://docs.victoriametrics.com/keyConcepts.html#time-series). Consider the following example: +``` +---------------------------------------------------------------------- +|