vmalert: add -rule.evalDelay flag and eval_delay as group attribute (#5185)

Also mark `-datasource.lookback` as will be deprecated, see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.

(cherry picked from commit c9375cac5e)
This commit is contained in:
Hui Wang 2023-10-25 17:54:18 +08:00 committed by hagen1778
parent d9118cdaab
commit 86d861ec55
No known key found for this signature in database
GPG key ID: 3BF75F3741CA9640
7 changed files with 93 additions and 31 deletions

View file

@ -119,6 +119,12 @@ name: <string>
# `eval_offset` can't be bigger than `interval`.
[ eval_offset: <duration> ]
# Optional
# Adjust the `time` parameter of group evaluation requests to compensate intentional query delay from datasource.
# By default, use flag `-rule.evalDelay` equal to `-search.latencyOffset` (a cmd-line flag configured for VictoriaMetrics single-node or vmselect). But if group has `latency_offset` param which value differs from `-search.latencyOffset`, set `eval_delay` equal to `latency_offset`.
# See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.
[ eval_delay: <duration> ]
# Limit the number of alerts an alerting rule and series a recording
# rule can produce. 0 is no limit.
[ limit: <int> | default = 0 ]
@ -794,9 +800,7 @@ Try the following recommendations to reduce the chance of hitting the data delay
[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution). For example,
if expression is `rate(my_metric[2m]) > 0` then ensure that `my_metric` resolution is at least `1m` or better `30s`.
If you use VictoriaMetrics as datasource, `[duration]` can be omitted and VictoriaMetrics will adjust it automatically.
* If you know in advance, that data in datasource is delayed - try changing vmalerts `-datasource.lookback`
command-line flag to add a time shift for evaluations. Or extend `[duration]` to tolerate the delay.
For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`.
* Extend `[duration]` in expr to help tolerate the delay. For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`.
* If [time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution)
in datasource is inconsistent or `>=5min` - try changing vmalerts `-datasource.queryStep` command-line flag to specify
how far search query can lookback for the recent datapoint. The recommendation is to have the step
@ -805,8 +809,10 @@ at least two times bigger than the resolution.
> Please note, data delay is inevitable in distributed systems. And it is better to account for it instead of ignoring.
By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s
(see `-search.latencyOffset` command-line flag at vmselect). Such delay is needed to eliminate risk of incomplete
(see `-search.latencyOffset` command-line flag at vmselect, and it can be overridden by adding `latency_offset` to group's params). Such delay is needed to eliminate risk of incomplete
data on the moment of querying, since metrics collectors won't be able to deliver the data in time.
To compensate the latency in timestamps for produced evaluation results, `-rule.evalDelay` is also set to 30s by default.
If you changed the `-search.latencyOffset`(cmd-line flag configured for VictoriaMetrics single-node or vmselect) value and observed a delay in timestamps for produced evaluation results, try changing `-rule.evalDelay` equal to `-search.latencyOffset`.
### Alerts state
@ -1291,6 +1297,9 @@ The shortlist of configuration flags is the following:
Limits the maximum duration for automatic alert expiration, which by default is 4 times evaluationInterval of the parent group.
-rule.resendDelay duration
Minimum amount of time to wait before resending an alert to notifier
-rule.evalDelay duration
Adjust the `time` parameter of rule evaluation requests to compensate intentional query delay from datasource. Normally should equal to `-search.latencyOffset`(a cmd-line flag configured for VictoriaMetrics single-node or vmselect). (default 30s)
See more details [here](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155).
-rule.templates array
Path or glob pattern to location with go template definitions
for rules annotations templating. Flag can be specified multiple times.

View file

@ -19,11 +19,14 @@ import (
// Group contains list of Rules grouped into
// entity with one name and evaluation interval
type Group struct {
Type Type `yaml:"type,omitempty"`
File string
Name string `yaml:"name"`
Interval *promutils.Duration `yaml:"interval,omitempty"`
EvalOffset *promutils.Duration `yaml:"eval_offset,omitempty"`
Type Type `yaml:"type,omitempty"`
File string
Name string `yaml:"name"`
Interval *promutils.Duration `yaml:"interval,omitempty"`
EvalOffset *promutils.Duration `yaml:"eval_offset,omitempty"`
// EvalDelay will adjust the `time` parameter of rule evaluation requests to compensate intentional query delay from datasource.
// see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155
EvalDelay *promutils.Duration `yaml:"eval_delay,omitempty"`
Limit int `yaml:"limit,omitempty"`
Rules []Rule `yaml:"rules"`
Concurrency int `yaml:"concurrency"`

View file

@ -43,7 +43,8 @@ var (
oauth2TokenURL = flag.String("datasource.oauth2.tokenUrl", "", "Optional OAuth2 tokenURL to use for -datasource.url.")
oauth2Scopes = flag.String("datasource.oauth2.scopes", "", "Optional OAuth2 scopes to use for -datasource.url. Scopes must be delimited by ';'")
lookBack = flag.Duration("datasource.lookback", 0, `Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
lookBack = flag.Duration("datasource.lookback", 0, `Will be deprecated soon, please adjust "-search.latencyOffset" at datasource side or specify "latency_offset" in rule group's params.`+
`Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
queryStep = flag.Duration("datasource.queryStep", 5*time.Minute, "How far a value can fallback to when evaluating queries. "+
"For example, if -datasource.queryStep=15s then param \"step\" with value \"15s\" will be added to every query. "+
"If set to 0, rule's evaluation interval will be used instead.")
@ -85,6 +86,9 @@ func Init(extraParams url.Values) (QuerierBuilder, error) {
if !*queryTimeAlignment {
logger.Warnf("flag `datasource.queryTimeAlignment` is deprecated and will be removed in next releases, please use `eval_alignment` in rule group instead")
}
if *lookBack != 0 {
logger.Warnf("flag `datasource.lookback` will be deprecated soon, please adjust `-search.latencyOffset`(a cmd-line flag configured for VictoriaMetrics single-node or vmselect) or specify `latency_offset` in rule group's params. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.")
}
tr, err := utils.Transport(*addr, *tlsCertFile, *tlsKeyFile, *tlsCAFile, *tlsServerName, *tlsInsecureSkipVerify)
if err != nil {

View file

@ -32,6 +32,8 @@ var (
resendDelay = flag.Duration("rule.resendDelay", 0, "MiniMum amount of time to wait before resending an alert to notifier")
maxResolveDuration = flag.Duration("rule.maxResolveDuration", 0, "Limits the maxiMum duration for automatic alert expiration, "+
"which by default is 4 times evaluationInterval of the parent ")
evalDelay = flag.Duration("rule.evalDelay", 30*time.Second, "Adjust the `time` parameter of rule evaluation requests to compensate intentional data query delay from datasource."+
"Normally should equal to `-search.latencyOffset`(a cmd-line flag configured for VictoriaMetrics single-node or vmselect)")
disableAlertGroupLabel = flag.Bool("disableAlertgroupLabel", false, "Whether to disable adding group's Name as label to generated alerts and time series.")
remoteReadLookBack = flag.Duration("remoteRead.lookback", time.Hour, "Lookback defines how far to look into past for alerts timeseries."+
" For example, if lookback=1h then range from now() to now()-1h will be scanned.")
@ -39,13 +41,16 @@ var (
// Group is an entity for grouping rules
type Group struct {
mu sync.RWMutex
Name string
File string
Rules []Rule
Type config.Type
Interval time.Duration
EvalOffset *time.Duration
mu sync.RWMutex
Name string
File string
Rules []Rule
Type config.Type
Interval time.Duration
EvalOffset *time.Duration
// EvalDelay will adjust the `time` parameter of rule evaluation requests to compensate intentional query delay from datasource.
// see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155
EvalDelay *time.Duration
Limit int
Concurrency int
Checksum string
@ -139,6 +144,9 @@ func NewGroup(cfg config.Group, qb datasource.QuerierBuilder, defaultInterval ti
if cfg.EvalOffset != nil {
g.EvalOffset = &cfg.EvalOffset.D
}
if cfg.EvalDelay != nil {
g.EvalDelay = &cfg.EvalDelay.D
}
for _, h := range cfg.Headers {
g.Headers[h.Key] = h.Value
}
@ -581,17 +589,24 @@ func (g *Group) adjustReqTimestamp(timestamp time.Time) time.Time {
// to 10:30, to the previous evaluationInterval.
return ts.Add(-g.Interval)
}
// EvalOffset shouldn't interfere with evalAlignment,
// so we return it immediately
// when `eval_offset` is using, ts shouldn't be effect by `eval_alignment` and `eval_delay`
// since it should be always aligned.
return ts
}
if g.evalAlignment == nil || *g.evalAlignment {
// align query time with interval to get similar result with grafana when plotting time series.
// see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049
// and https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1232
return timestamp.Truncate(g.Interval)
return timestamp.Truncate(g.Interval).Add(-g.getEvalDelay())
}
return timestamp
return timestamp.Add(-g.getEvalDelay())
}
func (g *Group) getEvalDelay() time.Duration {
if g.EvalDelay != nil {
return *g.EvalDelay
}
return *evalDelay
}
// executor contains group's notify and rw configs

View file

@ -628,6 +628,7 @@ func TestGroupStartDelay(t *testing.T) {
func TestGetPrometheusReqTimestamp(t *testing.T) {
offset := 30 * time.Minute
evalDelay := 1 * time.Minute
disableAlign := false
testCases := []struct {
name string
@ -635,24 +636,24 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
originTS, expTS string
}{
{
"with query align",
"with query align + default evalDelay",
&Group{
Interval: time.Hour,
},
"2023-08-28T11:11:00+00:00",
"2023-08-28T11:00:00+00:00",
"2023-08-28T10:59:30+00:00",
},
{
"without query align",
"without query align + default evalDelay",
&Group{
Interval: time.Hour,
evalAlignment: &disableAlign,
},
"2023-08-28T11:11:00+00:00",
"2023-08-28T11:11:00+00:00",
"2023-08-28T11:10:30+00:00",
},
{
"with eval_offset, find previous offset point",
"with eval_offset, find previous offset point + default evalDelay",
&Group{
EvalOffset: &offset,
Interval: time.Hour,
@ -661,7 +662,7 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
"2023-08-28T10:30:00+00:00",
},
{
"with eval_offset",
"with eval_offset + default evalDelay",
&Group{
EvalOffset: &offset,
Interval: time.Hour,
@ -669,6 +670,25 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
"2023-08-28T11:41:00+00:00",
"2023-08-28T11:30:00+00:00",
},
{
"with eval_delay",
&Group{
EvalDelay: &evalDelay,
Interval: time.Hour,
},
"2023-08-28T11:41:00+00:00",
"2023-08-28T10:59:00+00:00",
},
{
"disable alignment with eval_delay",
&Group{
EvalDelay: &evalDelay,
Interval: time.Hour,
evalAlignment: &disableAlign,
},
"2023-08-28T11:41:00+00:00",
"2023-08-28T11:40:00+00:00",
},
}
for _, tc := range testCases {
originT, _ := time.Parse(time.RFC3339, tc.originTS)

View file

@ -29,6 +29,7 @@ The sandbox cluster installation is running under the constant load generated by
## tip
**vmalert's cmd-line flag `datasource.queryTimeAlignment` was deprecated and will have no effect anymore. It will be completely removed in next releases. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049) and more detailed changes below.**
**vmalert's cmd-line flag `datasource.lookback` will be deprecated soon, please use `-search.latencyOffset` in datasource or override it by adding `latency_offset` in rule group's params. It will have no effect in next release and be removed in future releases. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155).**
* SECURITY: upgrade Go builder from Go1.21.1 to Go1.21.3. See [the list of issues addressed in Go1.21.2](https://github.com/golang/go/issues?q=milestone%3AGo1.21.2+label%3ACherryPickApproved) and [the list of issues addressed in Go1.21.3](https://github.com/golang/go/issues?q=milestone%3AGo1.21.3+label%3ACherryPickApproved).
@ -37,6 +38,7 @@ The sandbox cluster installation is running under the constant load generated by
* FEATURE: [vmalert](https://docs.victoriametrics.com/vmalert.html): add `eval_alignment` attribute for [Groups](https://docs.victoriametrics.com/vmalert.html#groups), it will align group query requests timestamp with interval like `datasource.queryTimeAlignment` did.
This also means that `datasource.queryTimeAlignment` command-line flag becomes deprecated now and will have no effect if configured. If `datasource.queryTimeAlignment` was set to `false` before, then `eval_alignment` has to be set to `false` explicitly under group.
See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049).
* FEATURE: [vmalert](https://docs.victoriametrics.com/vmalert.html): add `-rule.evalDelay` flag and `eval_delay` attribute for [Groups](https://docs.victoriametrics.com/vmalert.html#groups). The new flag and param can be used to adjust the `time` parameter for rule evaluation requests to match intentional query delay from the datasource. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155).
* FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent.html): support data ingestion from [NewRelic infrastructure agent](https://docs.newrelic.com/docs/infrastructure/install-infrastructure-agent). See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-newrelic-agent), [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3520) and [this pull request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4712).
* FEATURE: [vmbackup](https://docs.victoriametrics.com/vmbackup.html): add `-filestream.disableFadvise` command-line flag, which can be used for disabling `fadvise` syscall during backup upload to the remote storage. By default `vmbackup` uses `fadvise` syscall in order to prevent from eviction of recently accessed data from the [OS page cache](https://en.wikipedia.org/wiki/Page_cache) when backing up large files. Sometimes the `fadvise` syscall may take significant amounts of CPU when the backup is performed with large value of `-concurrency` command-line flag on systems with big number of CPU cores. In this case it is better to manually disable `fadvise` syscall by passing `-filestream.disableFadvise` command-line flag to `vmbackup`. See [this pull request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5120) for details.
* FEATURE: [vmbackup](https://docs.victoriametrics.com/vmbackup.html): add `-deleteAllObjectVersions` command-line flag, which can be used for forcing removal of all object versions in remote object storage. See [this](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5121) issue and [these docs](https://docs.victoriametrics.com/vmbackup.html#permanent-deletion-of-objects-in-s3-compatible-storages) for the details.

View file

@ -130,6 +130,12 @@ name: <string>
# `eval_offset` can't be bigger than `interval`.
[ eval_offset: <duration> ]
# Optional
# Adjust the `time` parameter of group evaluation requests to compensate intentional query delay from datasource.
# By default, use flag `-rule.evalDelay` equal to `-search.latencyOffset` (a cmd-line flag configured for VictoriaMetrics single-node or vmselect). But if group has `latency_offset` param which value differs from `-search.latencyOffset`, set `eval_delay` equal to `latency_offset`.
# See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.
[ eval_delay: <duration> ]
# Limit the number of alerts an alerting rule and series a recording
# rule can produce. 0 is no limit.
[ limit: <int> | default = 0 ]
@ -805,9 +811,7 @@ Try the following recommendations to reduce the chance of hitting the data delay
[time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution). For example,
if expression is `rate(my_metric[2m]) > 0` then ensure that `my_metric` resolution is at least `1m` or better `30s`.
If you use VictoriaMetrics as datasource, `[duration]` can be omitted and VictoriaMetrics will adjust it automatically.
* If you know in advance, that data in datasource is delayed - try changing vmalerts `-datasource.lookback`
command-line flag to add a time shift for evaluations. Or extend `[duration]` to tolerate the delay.
For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`.
* Extend `[duration]` in expr to help tolerate the delay. For example, `max_over_time(errors_total[10m]) > 0` will be active even if there is no data in datasource for last `9m`.
* If [time series resolution](https://docs.victoriametrics.com/keyConcepts.html#time-series-resolution)
in datasource is inconsistent or `>=5min` - try changing vmalerts `-datasource.queryStep` command-line flag to specify
how far search query can lookback for the recent datapoint. The recommendation is to have the step
@ -816,8 +820,10 @@ at least two times bigger than the resolution.
> Please note, data delay is inevitable in distributed systems. And it is better to account for it instead of ignoring.
By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s
(see `-search.latencyOffset` command-line flag at vmselect). Such delay is needed to eliminate risk of incomplete
(see `-search.latencyOffset` command-line flag at vmselect, and it can be overridden by adding `latency_offset` to group's params). Such delay is needed to eliminate risk of incomplete
data on the moment of querying, since metrics collectors won't be able to deliver the data in time.
To compensate the latency in timestamps for produced evaluation results, `-rule.evalDelay` is also set to 30s by default.
If you changed the `-search.latencyOffset`(cmd-line flag configured for VictoriaMetrics single-node or vmselect) value and observed a delay in timestamps for produced evaluation results, try changing `-rule.evalDelay` equal to `-search.latencyOffset`.
### Alerts state
@ -1302,6 +1308,9 @@ The shortlist of configuration flags is the following:
Limits the maximum duration for automatic alert expiration, which by default is 4 times evaluationInterval of the parent group.
-rule.resendDelay duration
Minimum amount of time to wait before resending an alert to notifier
-rule.evalDelay duration
Adjust the `time` parameter of rule evaluation requests to compensate intentional query delay from datasource. Normally should equal to `-search.latencyOffset`(a cmd-line flag configured for VictoriaMetrics single-node or vmselect). (default 30s)
See more details [here](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155).
-rule.templates array
Path or glob pattern to location with go template definitions
for rules annotations templating. Flag can be specified multiple times.