diff --git a/app/vmalert/README.md b/app/vmalert/README.md index b2db420f9..3aee5da5b 100644 --- a/app/vmalert/README.md +++ b/app/vmalert/README.md @@ -725,31 +725,42 @@ a review to the dashboard. ## Troubleshooting -vmalert executes configured rules within certain intervals. It is expected that at the moment when rule is executed, -the data is already present in configured `-datasource.url`: +### Data delay + +Data delay is one of the most common issues with rules execution. +vmalert executes configured rules within certain intervals at specifics timestamps. +It expects that the data is already present in configured `-datasource.url` at the moment of time when rule is executed: <img alt="vmalert expected evaluation" src="vmalert_ts_normal.gif"> Usually, troubles start to appear when data in `-datasource.url` is delayed or absent. In such cases, evaluations -may get empty response from datasource and produce empty recording rules or reset alerts state: +may get empty response from the datasource and produce empty recording rules or reset alerts state: <img alt="vmalert evaluation when data is delayed" src="vmalert_ts_data_delay.gif"> -By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s. -This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`. -Usually, this results into a 30s shift for recording rules results. -Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results. +Try the following recommendations to reduce the chance of hitting the data delay issue: -Try the following recommendations in such cases: - -* Always configure group's `evaluationInterval` to be bigger or equal to `scrape_interval` at which metrics -are delivered to the datasource; +* Always configure group's `evaluationInterval` to be bigger or at least equal to +[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution); +* Ensure that `[duration]` value is at least twice bigger than +[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution). For example, +if expression is `rate(my_metric[2m]) > 0` then ensure that `my_metric` resolution is at least `1m` or better `30s`. +If you use VictoriaMetrics as datasource, `[duration]` can be omitted and VictoriaMetrics will adjust it automatically. * If you know in advance, that data in datasource is delayed - try changing vmalert's `-datasource.lookback` -command-line flag to add a time shift for evaluations; -* If time intervals between datapoints in datasource are irregular or `>=5min` - try changing vmalert's -`-datasource.queryStep` command-line flag to specify how far search query can lookback for the recent datapoint. -The recommendation is to have the step at least two times bigger than `scrape_interval`, since -there are no guarantees that scrape will not fail. +command-line flag to add a time shift for evaluations. Or extend `[duration]` to tolerate the delay. +For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`. +* If [time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution) +in datasource is inconsistent or `>=5min` - try changing vmalert's `-datasource.queryStep` command-line flag to specify +how far search query can lookback for the recent datapoint. The recommendation is to have the step +at least two times bigger than the resolution. + +> Please note, data delay is inevitable in distributed systems. And it is better to account for it instead of ignoring. + +By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s +(see `-search.latencyOffset` command-line flag at vmselect). Such delay is needed to eliminate risk of incomplete +data on the moment of querying, since metrics collectors won't be able to deliver the data in time. + +### Alerts state Sometimes, it is not clear why some specific alert fired or didn't fire. It is very important to remember, that alerts with `for: 0` fire immediately when their expression becomes true. And alerts with `for > 0` will fire only @@ -772,6 +783,8 @@ HTTP request sent by vmalert to the `-datasource.url` during evaluation. If spec no samples returned and curl command returns data - then it is very likely there was no data in datasource on the moment when rule was evaluated. +### Debug mode + vmalert allows configuring more detailed logging for specific alerting rule. Just set `debug: true` in rule's configuration and vmalert will start printing additional log messages: ```terminal diff --git a/docs/keyConcepts.md b/docs/keyConcepts.md index 3673ef679..d4d13aabc 100644 --- a/docs/keyConcepts.md +++ b/docs/keyConcepts.md @@ -79,6 +79,30 @@ requests_total{path="/", code="200"} 123 4567890 - The `4567890` is an optional timestamp for the sample. If it is missing, then the current timestamp is used when storing the sample in VictoriaMetrics. +#### Time series resolution + +Resolution is the minimum interval between [raw samples](https://docs.victoriametrics.com/keyConcepts.html#raw-samples) +of the [time series](https://docs.victoriametrics.com/keyConcepts.html#time-series). Consider the following example: +``` +---------------------------------------------------------------------- +| <time series> | <value> | <timestamp> | +| requests_total{path="/health", code="200"} | 1 | 1676297640 | +| requests_total{path="/health", code="200"} | 2 | 1676297670 | +| requests_total{path="/health", code="200"} | 3 | 1676297700 | +| requests_total{path="/health", code="200"} | 4 | 1676297730 | +.... +``` +Here we have a time series `requests_total{path="/health", code="200"}` which has a value update each `30s`. +This means, its resolution is also a `30s`. + +> In terms of [pull model](https://docs.victoriametrics.com/keyConcepts.html#pull-model), resolution is equal +> to `scrape_interval` and is controlled by the monitoring system (server). +> For [push model](https://docs.victoriametrics.com/keyConcepts.html#push-model), resolution is an interval between +> samples timestamps and is controlled by a client (metrics collector). + +Try to keep time series resolution consistent, since some [MetricsQL](#metricsql) functions may expect it to be so. + + ### Types of metrics Internally, VictoriaMetrics does not have the notion of a metric type. The concept of a metric diff --git a/docs/vmalert.md b/docs/vmalert.md index 7f2ba38ff..30142f80a 100644 --- a/docs/vmalert.md +++ b/docs/vmalert.md @@ -729,31 +729,42 @@ a review to the dashboard. ## Troubleshooting -vmalert executes configured rules within certain intervals. It is expected that at the moment when rule is executed, -the data is already present in configured `-datasource.url`: +### Data delay + +Data delay is one of the most common issues with rules execution. +vmalert executes configured rules within certain intervals at specifics timestamps. +It expects that the data is already present in configured `-datasource.url` at the moment of time when rule is executed: <img alt="vmalert expected evaluation" src="vmalert_ts_normal.gif"> Usually, troubles start to appear when data in `-datasource.url` is delayed or absent. In such cases, evaluations -may get empty response from datasource and produce empty recording rules or reset alerts state: +may get empty response from the datasource and produce empty recording rules or reset alerts state: <img alt="vmalert evaluation when data is delayed" src="vmalert_ts_data_delay.gif"> -By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s. -This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`. -Usually, this results into a 30s shift for recording rules results. -Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results. +Try the following recommendations to reduce the chance of hitting the data delay issue: -Try the following recommendations in such cases: - -* Always configure group's `evaluationInterval` to be bigger or equal to `scrape_interval` at which metrics -are delivered to the datasource; +* Always configure group's `evaluationInterval` to be bigger or at least equal to +[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution); +* Ensure that `[duration]` value is at least twice bigger than +[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution). For example, +if expression is `rate(my_metric[2m]) > 0` then ensure that `my_metric` resolution is at least `1m` or better `30s`. +If you use VictoriaMetrics as datasource, `[duration]` can be omitted and VictoriaMetrics will adjust it automatically. * If you know in advance, that data in datasource is delayed - try changing vmalert's `-datasource.lookback` -command-line flag to add a time shift for evaluations; -* If time intervals between datapoints in datasource are irregular or `>=5min` - try changing vmalert's -`-datasource.queryStep` command-line flag to specify how far search query can lookback for the recent datapoint. -The recommendation is to have the step at least two times bigger than `scrape_interval`, since -there are no guarantees that scrape will not fail. +command-line flag to add a time shift for evaluations. Or extend `[duration]` to tolerate the delay. +For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`. +* If [time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution) +in datasource is inconsistent or `>=5min` - try changing vmalert's `-datasource.queryStep` command-line flag to specify +how far search query can lookback for the recent datapoint. The recommendation is to have the step +at least two times bigger than the resolution. + +> Please note, data delay is inevitable in distributed systems. And it is better to account for it instead of ignoring. + +By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s +(see `-search.latencyOffset` command-line flag at vmselect). Such delay is needed to eliminate risk of incomplete +data on the moment of querying, since metrics collectors won't be able to deliver the data in time. + +### Alerts state Sometimes, it is not clear why some specific alert fired or didn't fire. It is very important to remember, that alerts with `for: 0` fire immediately when their expression becomes true. And alerts with `for > 0` will fire only @@ -776,6 +787,8 @@ HTTP request sent by vmalert to the `-datasource.url` during evaluation. If spec no samples returned and curl command returns data - then it is very likely there was no data in datasource on the moment when rule was evaluated. +### Debug mode + vmalert allows configuring more detailed logging for specific alerting rule. Just set `debug: true` in rule's configuration and vmalert will start printing additional log messages: ```terminal