vmalert: allow configuring the default number of stored rule's update states (#3556)

Allow configuring the default number of stored rule's update states in memory
 via global `-rule.updateEntriesLimit` command-line flag or per-rule via rule's
 `update_entries_limit` configuration param.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit is contained in:
Roman Khavronenko 2022-12-29 12:36:44 +01:00 committed by Aliaksandr Valialkin
parent f27bb19213
commit 5cf2998af8
No known key found for this signature in database
GPG key ID: A72BEC6CD3D0DED1
19 changed files with 144 additions and 33 deletions

View file

@ -191,6 +191,11 @@ expr: <string>
# Is applicable to alerting rules only. # Is applicable to alerting rules only.
[ debug: <bool> | default = false ] [ debug: <bool> | default = false ]
# Defines the number of rule's updates entries stored in memory
# and available for view on rule's Details page.
# Overrides `rule.updateEntriesLimit` value for this specific rule.
[ update_entries_limit: <integer> | default 0 ]
# Labels to add or overwrite for each alert. # Labels to add or overwrite for each alert.
labels: labels:
[ <labelname>: <tmpl_string> ] [ <labelname>: <tmpl_string> ]
@ -319,6 +324,12 @@ expr: <string>
# Labels to add or overwrite before storing the result. # Labels to add or overwrite before storing the result.
labels: labels:
[ <labelname>: <labelvalue> ] [ <labelname>: <labelvalue> ]
# Defines the number of rule's updates entries stored in memory
# and available for view on rule's Details page.
# Overrides `rule.updateEntriesLimit` value for this specific rule.
[ update_entries_limit: <integer> | default 0 ]
``` ```
For recording rules to work `-remoteWrite.url` must be specified. For recording rules to work `-remoteWrite.url` must be specified.
@ -695,7 +706,7 @@ may get empty response from datasource and produce empty recording rules or rese
<img alt="vmalert evaluation when data is delayed" src="vmalert_ts_data_delay.gif"> <img alt="vmalert evaluation when data is delayed" src="vmalert_ts_data_delay.gif">
By default recently written samples to VictoriaMetrics aren't visible for queries for up to 30s. By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s.
This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`. This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`.
Usually, this results into a 30s shift for recording rules results. Usually, this results into a 30s shift for recording rules results.
Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results. Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results.
@ -721,8 +732,9 @@ If `-remoteWrite.url` command-line flag is configured, vmalert will persist aler
[vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) or Grafana to track how alerts state [vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) or Grafana to track how alerts state
changed in time. changed in time.
vmalert also stores last N state updates for each rule. To check updates, click on `Details` link next to rule's name vmalert stores last `-rule.maxUpdateEntries` (or `update_entries_limit` [per-rule config](https://docs.victoriametrics.com/vmalert.html#alerting-rules))
on `/vmalert/groups` page and check the `Last updates` section: state updates for each rule. To check updates, click on `Details` link next to rule's name on `/vmalert/groups` page
and check the `Last updates` section:
<img alt="vmalert state" src="vmalert_state.png"> <img alt="vmalert state" src="vmalert_state.png">
@ -731,7 +743,7 @@ HTTP request sent by vmalert to the `-datasource.url` during evaluation. If spec
no samples returned and curl command returns data - then it is very likely there was no data in datasource on the no samples returned and curl command returns data - then it is very likely there was no data in datasource on the
moment when rule was evaluated. moment when rule was evaluated.
vmalert also alows configuring more detailed logging for specific rule. Just set `debug: true` in rule's configuration vmalert allows configuring more detailed logging for specific alerting rule. Just set `debug: true` in rule's configuration
and vmalert will start printing additional log messages: and vmalert will start printing additional log messages:
```terminal ```terminal
2022-09-15T13:35:41.155Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms) 2022-09-15T13:35:41.155Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms)
@ -890,6 +902,8 @@ The shortlist of configuration flags is the following:
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string -loggerFormat string
Format for logs. Possible values: default, json (default "default") Format for logs. Possible values: default, json (default "default")
-loggerJSONFields string
Allows renaming fields in JSON formatted logs. Example: "ts:timestamp,msg:message" renames "ts" to "timestamp" and "msg" to "message". Supported fields: ts, level, caller, msg
-loggerLevel string -loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO") Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string -loggerOutput string
@ -1092,6 +1106,8 @@ The shortlist of configuration flags is the following:
Interval for checking for changes in '-rule' files. By default the checking is disabled. Send SIGHUP signal in order to force config check for changes. DEPRECATED - see '-configCheckInterval' instead Interval for checking for changes in '-rule' files. By default the checking is disabled. Send SIGHUP signal in order to force config check for changes. DEPRECATED - see '-configCheckInterval' instead
-rule.maxResolveDuration duration -rule.maxResolveDuration duration
Limits the maximum duration for automatic alert expiration, which is by default equal to 3 evaluation intervals of the parent group. Limits the maximum duration for automatic alert expiration, which is by default equal to 3 evaluation intervals of the parent group.
-rule.maxUpdateEntries int
Defines the max number of rule's state updates. (default 20)
-rule.resendDelay duration -rule.resendDelay duration
Minimum amount of time to wait before resending an alert to notifier Minimum amount of time to wait before resending an alert to notifier
-rule.templates array -rule.templates array

View file

@ -74,10 +74,15 @@ func newAlertingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rule
Debug: cfg.Debug, Debug: cfg.Debug,
}), }),
alerts: make(map[uint64]*notifier.Alert), alerts: make(map[uint64]*notifier.Alert),
state: newRuleState(),
metrics: &alertingRuleMetrics{}, metrics: &alertingRuleMetrics{},
} }
if cfg.UpdateEntriesLimit != nil {
ar.state = newRuleState(*cfg.UpdateEntriesLimit)
} else {
ar.state = newRuleState(*ruleUpdateEntriesLimit)
}
labels := fmt.Sprintf(`alertname=%q, group=%q, id="%d"`, ar.Name, group.Name, ar.ID()) labels := fmt.Sprintf(`alertname=%q, group=%q, id="%d"`, ar.Name, group.Name, ar.ID())
ar.metrics.pending = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerts_pending{%s}`, labels), ar.metrics.pending = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerts_pending{%s}`, labels),
func() float64 { func() float64 {
@ -491,6 +496,7 @@ func (ar *AlertingRule) ToAPI() APIRule {
State: "inactive", State: "inactive",
Alerts: ar.AlertsToAPI(), Alerts: ar.AlertsToAPI(),
LastSamples: lastState.samples, LastSamples: lastState.samples,
MaxUpdates: ar.state.size(),
Updates: ar.state.getAll(), Updates: ar.state.getAll(),
// encode as strings to avoid rounding in JSON // encode as strings to avoid rounding in JSON

View file

@ -709,7 +709,6 @@ func TestAlertingRule_Template(t *testing.T) {
"summary": `{{ $labels.alertname }}: Too high connection number for "{{ $labels.instance }}"`, "summary": `{{ $labels.alertname }}: Too high connection number for "{{ $labels.instance }}"`,
}, },
alerts: make(map[uint64]*notifier.Alert), alerts: make(map[uint64]*notifier.Alert),
state: newRuleState(),
}, },
[]datasource.Metric{ []datasource.Metric{
metricWithValueAndLabels(t, 1, "instance", "foo"), metricWithValueAndLabels(t, 1, "instance", "foo"),
@ -749,7 +748,6 @@ func TestAlertingRule_Template(t *testing.T) {
"description": `{{ $labels.alertname}}: It is {{ $value }} connections for "{{ $labels.instance }}"`, "description": `{{ $labels.alertname}}: It is {{ $value }} connections for "{{ $labels.instance }}"`,
}, },
alerts: make(map[uint64]*notifier.Alert), alerts: make(map[uint64]*notifier.Alert),
state: newRuleState(),
}, },
[]datasource.Metric{ []datasource.Metric{
metricWithValueAndLabels(t, 2, "__name__", "first", "instance", "foo", alertNameLabel, "override"), metricWithValueAndLabels(t, 2, "__name__", "first", "instance", "foo", alertNameLabel, "override"),
@ -789,7 +787,6 @@ func TestAlertingRule_Template(t *testing.T) {
"summary": `Alert "{{ $labels.alertname }}({{ $labels.alertgroup }})" for instance {{ $labels.instance }}`, "summary": `Alert "{{ $labels.alertname }}({{ $labels.alertgroup }})" for instance {{ $labels.instance }}`,
}, },
alerts: make(map[uint64]*notifier.Alert), alerts: make(map[uint64]*notifier.Alert),
state: newRuleState(),
}, },
[]datasource.Metric{ []datasource.Metric{
metricWithValueAndLabels(t, 1, metricWithValueAndLabels(t, 1,
@ -820,6 +817,7 @@ func TestAlertingRule_Template(t *testing.T) {
fq := &fakeQuerier{} fq := &fakeQuerier{}
tc.rule.GroupID = fakeGroup.ID() tc.rule.GroupID = fakeGroup.ID()
tc.rule.q = fq tc.rule.q = fq
tc.rule.state = newRuleState(10)
fq.add(tc.metrics...) fq.add(tc.metrics...)
if _, err := tc.rule.Exec(context.TODO(), time.Now(), 0); err != nil { if _, err := tc.rule.Exec(context.TODO(), time.Now(), 0); err != nil {
t.Fatalf("unexpected err: %s", err) t.Fatalf("unexpected err: %s", err)
@ -936,6 +934,6 @@ func newTestAlertingRule(name string, waitFor time.Duration) *AlertingRule {
For: waitFor, For: waitFor,
EvalInterval: waitFor, EvalInterval: waitFor,
alerts: make(map[uint64]*notifier.Alert), alerts: make(map[uint64]*notifier.Alert),
state: newRuleState(), state: newRuleState(10),
} }
} }

View file

@ -114,6 +114,9 @@ type Rule struct {
Labels map[string]string `yaml:"labels,omitempty"` Labels map[string]string `yaml:"labels,omitempty"`
Annotations map[string]string `yaml:"annotations,omitempty"` Annotations map[string]string `yaml:"annotations,omitempty"`
Debug bool `yaml:"debug,omitempty"` Debug bool `yaml:"debug,omitempty"`
// UpdateEntriesLimit defines max number of rule's state updates stored in memory.
// Overrides `-rule.updateEntriesLimit`.
UpdateEntriesLimit *int `yaml:"update_entries_limit,omitempty"`
// Catches all undefined fields and must be empty after parsing. // Catches all undefined fields and must be empty after parsing.
XXX map[string]interface{} `yaml:",inline"` XXX map[string]interface{} `yaml:",inline"`

View file

@ -550,6 +550,20 @@ rules:
- alert: foo - alert: foo
expr: sum by(job) (up == 1) expr: sum by(job) (up == 1)
debug: true debug: true
`)
})
t.Run("`update_entries_limit` change", func(t *testing.T) {
f(t, `
name: TestGroup
rules:
- alert: foo
expr: sum by(job) (up == 1)
`, `
name: TestGroup
rules:
- alert: foo
expr: sum by(job) (up == 1)
update_entries_limit: 33
`) `)
}) })
} }

View file

@ -12,6 +12,7 @@ groups:
expr: vm_tcplistener_conns > 0 expr: vm_tcplistener_conns > 0
for: 3m for: 3m
debug: true debug: true
update_entries_limit: 40
annotations: annotations:
labels: "Available labels: {{ $labels }}" labels: "Available labels: {{ $labels }}"
summary: Too high connection number for {{ $labels.instance }} summary: Too high connection number for {{ $labels.instance }}
@ -20,6 +21,7 @@ groups:
{{ end }} {{ end }}
description: "It is {{ $value }} connections for {{$labels.instance}}" description: "It is {{ $value }} connections for {{$labels.instance}}"
- alert: ExampleAlertAlwaysFiring - alert: ExampleAlertAlwaysFiring
update_entries_limit: -1
expr: sum by(job) expr: sum by(job)
(up == 1) (up == 1)
labels: labels:

View file

@ -7,6 +7,7 @@ groups:
- alert: Conns - alert: Conns
expr: filterSeries(sumSeries(host.receiver.interface.cons),'last','>', 500) expr: filterSeries(sumSeries(host.receiver.interface.cons),'last','>', 500)
for: 3m for: 3m
annotations: annotations:
summary: Too high connection number for {{$labels.instance}} summary: Too high connection number for {{$labels.instance}}
description: "It is {{ $value }} connections for {{$labels.instance}}" description: "It is {{ $value }} connections for {{$labels.instance}}"

View file

@ -460,7 +460,7 @@ func TestFaultyRW(t *testing.T) {
r := &RecordingRule{ r := &RecordingRule{
Name: "test", Name: "test",
state: newRuleState(), state: newRuleState(10),
q: fq, q: fq,
} }

View file

@ -56,7 +56,9 @@ absolute path to all .tpl files in root.`)
validateExpressions = flag.Bool("rule.validateExpressions", true, "Whether to validate rules expressions via MetricsQL engine") validateExpressions = flag.Bool("rule.validateExpressions", true, "Whether to validate rules expressions via MetricsQL engine")
maxResolveDuration = flag.Duration("rule.maxResolveDuration", 0, "Limits the maximum duration for automatic alert expiration, "+ maxResolveDuration = flag.Duration("rule.maxResolveDuration", 0, "Limits the maximum duration for automatic alert expiration, "+
"which is by default equal to 3 evaluation intervals of the parent group.") "which is by default equal to 3 evaluation intervals of the parent group.")
resendDelay = flag.Duration("rule.resendDelay", 0, "Minimum amount of time to wait before resending an alert to notifier") resendDelay = flag.Duration("rule.resendDelay", 0, "Minimum amount of time to wait before resending an alert to notifier")
ruleUpdateEntriesLimit = flag.Int("rule.updateEntriesLimit", 20, "Defines the max number of rule's state updates stored in-memory. "+
"Rule's updates are available on rule's Details page and are used for debugging purposes. The number of stored updates can be overriden per rule via update_entries_limit param.")
externalURL = flag.String("external.url", "", "External URL is used as alert's source for sent alerts to the notifier") externalURL = flag.String("external.url", "", "External URL is used as alert's source for sent alerts to the notifier")
externalAlertSource = flag.String("external.alert.source", "", `External Alert Source allows to override the Source link for alerts sent to AlertManager `+ externalAlertSource = flag.String("external.alert.source", "", `External Alert Source allows to override the Source link for alerts sent to AlertManager `+

View file

@ -58,7 +58,6 @@ func newRecordingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rul
Labels: cfg.Labels, Labels: cfg.Labels,
GroupID: group.ID(), GroupID: group.ID(),
metrics: &recordingRuleMetrics{}, metrics: &recordingRuleMetrics{},
state: newRuleState(),
q: qb.BuildWithParams(datasource.QuerierParams{ q: qb.BuildWithParams(datasource.QuerierParams{
DataSourceType: group.Type.String(), DataSourceType: group.Type.String(),
EvaluationInterval: group.Interval, EvaluationInterval: group.Interval,
@ -67,6 +66,12 @@ func newRecordingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rul
}), }),
} }
if cfg.UpdateEntriesLimit != nil {
rr.state = newRuleState(*cfg.UpdateEntriesLimit)
} else {
rr.state = newRuleState(*ruleUpdateEntriesLimit)
}
labels := fmt.Sprintf(`recording=%q, group=%q, id="%d"`, rr.Name, group.Name, rr.ID()) labels := fmt.Sprintf(`recording=%q, group=%q, id="%d"`, rr.Name, group.Name, rr.ID())
rr.metrics.errors = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_recording_rules_error{%s}`, labels), rr.metrics.errors = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_recording_rules_error{%s}`, labels),
func() float64 { func() float64 {
@ -212,6 +217,7 @@ func (rr *RecordingRule) ToAPI() APIRule {
EvaluationTime: lastState.duration.Seconds(), EvaluationTime: lastState.duration.Seconds(),
Health: "ok", Health: "ok",
LastSamples: lastState.samples, LastSamples: lastState.samples,
MaxUpdates: rr.state.size(),
Updates: rr.state.getAll(), Updates: rr.state.getAll(),
// encode as strings to avoid rounding // encode as strings to avoid rounding

View file

@ -19,7 +19,7 @@ func TestRecordingRule_Exec(t *testing.T) {
expTS []prompbmarshal.TimeSeries expTS []prompbmarshal.TimeSeries
}{ }{
{ {
&RecordingRule{Name: "foo", state: newRuleState()}, &RecordingRule{Name: "foo"},
[]datasource.Metric{metricWithValueAndLabels(t, 10, []datasource.Metric{metricWithValueAndLabels(t, 10,
"__name__", "bar", "__name__", "bar",
)}, )},
@ -30,7 +30,7 @@ func TestRecordingRule_Exec(t *testing.T) {
}, },
}, },
{ {
&RecordingRule{Name: "foobarbaz", state: newRuleState()}, &RecordingRule{Name: "foobarbaz"},
[]datasource.Metric{ []datasource.Metric{
metricWithValueAndLabels(t, 1, "__name__", "foo", "job", "foo"), metricWithValueAndLabels(t, 1, "__name__", "foo", "job", "foo"),
metricWithValueAndLabels(t, 2, "__name__", "bar", "job", "bar"), metricWithValueAndLabels(t, 2, "__name__", "bar", "job", "bar"),
@ -53,8 +53,7 @@ func TestRecordingRule_Exec(t *testing.T) {
}, },
{ {
&RecordingRule{ &RecordingRule{
Name: "job:foo", Name: "job:foo",
state: newRuleState(),
Labels: map[string]string{ Labels: map[string]string{
"source": "test", "source": "test",
}}, }},
@ -80,6 +79,7 @@ func TestRecordingRule_Exec(t *testing.T) {
fq := &fakeQuerier{} fq := &fakeQuerier{}
fq.add(tc.metrics...) fq.add(tc.metrics...)
tc.rule.q = fq tc.rule.q = fq
tc.rule.state = newRuleState(10)
tss, err := tc.rule.Exec(context.TODO(), time.Now(), 0) tss, err := tc.rule.Exec(context.TODO(), time.Now(), 0)
if err != nil { if err != nil {
t.Fatalf("unexpected Exec err: %s", err) t.Fatalf("unexpected Exec err: %s", err)
@ -198,7 +198,7 @@ func TestRecordingRuleLimit(t *testing.T) {
metricWithValuesAndLabels(t, []float64{2, 3}, "__name__", "bar", "job", "bar"), metricWithValuesAndLabels(t, []float64{2, 3}, "__name__", "bar", "job", "bar"),
metricWithValuesAndLabels(t, []float64{4, 5, 6}, "__name__", "baz", "job", "baz"), metricWithValuesAndLabels(t, []float64{4, 5, 6}, "__name__", "baz", "job", "baz"),
} }
rule := &RecordingRule{Name: "job:foo", state: newRuleState(), Labels: map[string]string{ rule := &RecordingRule{Name: "job:foo", state: newRuleState(10), Labels: map[string]string{
"source": "test_limit", "source": "test_limit",
}} }}
var err error var err error
@ -216,7 +216,7 @@ func TestRecordingRuleLimit(t *testing.T) {
func TestRecordingRule_ExecNegative(t *testing.T) { func TestRecordingRule_ExecNegative(t *testing.T) {
rr := &RecordingRule{ rr := &RecordingRule{
Name: "job:foo", Name: "job:foo",
state: newRuleState(), state: newRuleState(10),
Labels: map[string]string{ Labels: map[string]string{
"job": "test", "job": "test",
}, },

View file

@ -37,6 +37,8 @@ type ruleState struct {
sync.RWMutex sync.RWMutex
entries []ruleStateEntry entries []ruleStateEntry
cur int cur int
// disabled defines whether ruleState tracks ruleStateEntry
disabled bool
} }
type ruleStateEntry struct { type ruleStateEntry struct {
@ -57,21 +59,36 @@ type ruleStateEntry struct {
curl string curl string
} }
const defaultStateEntriesLimit = 20 func newRuleState(size int) *ruleState {
if size < 1 {
func newRuleState() *ruleState { return &ruleState{disabled: true}
}
return &ruleState{ return &ruleState{
entries: make([]ruleStateEntry, defaultStateEntriesLimit), entries: make([]ruleStateEntry, size),
} }
} }
func (s *ruleState) getLast() ruleStateEntry { func (s *ruleState) getLast() ruleStateEntry {
if s.disabled {
return ruleStateEntry{}
}
s.RLock() s.RLock()
defer s.RUnlock() defer s.RUnlock()
return s.entries[s.cur] return s.entries[s.cur]
} }
func (s *ruleState) size() int {
s.RLock()
defer s.RUnlock()
return len(s.entries)
}
func (s *ruleState) getAll() []ruleStateEntry { func (s *ruleState) getAll() []ruleStateEntry {
if s.disabled {
return nil
}
entries := make([]ruleStateEntry, 0) entries := make([]ruleStateEntry, 0)
s.RLock() s.RLock()
@ -94,6 +111,10 @@ func (s *ruleState) getAll() []ruleStateEntry {
} }
func (s *ruleState) add(e ruleStateEntry) { func (s *ruleState) add(e ruleStateEntry) {
if s.disabled {
return
}
s.Lock() s.Lock()
defer s.Unlock() defer s.Unlock()

View file

@ -6,8 +6,27 @@ import (
"time" "time"
) )
func TestRule_stateDisabled(t *testing.T) {
state := newRuleState(-1)
e := state.getLast()
if !e.at.IsZero() {
t.Fatalf("expected entry to be zero")
}
state.add(ruleStateEntry{at: time.Now()})
if !e.at.IsZero() {
t.Fatalf("expected entry to be zero")
}
if len(state.getAll()) != 0 {
t.Fatalf("expected for state to have %d entries; got %d",
0, len(state.getAll()),
)
}
}
func TestRule_state(t *testing.T) { func TestRule_state(t *testing.T) {
state := newRuleState() stateEntriesN := 20
state := newRuleState(stateEntriesN)
e := state.getLast() e := state.getLast()
if !e.at.IsZero() { if !e.at.IsZero() {
t.Fatalf("expected entry to be zero") t.Fatalf("expected entry to be zero")
@ -39,7 +58,7 @@ func TestRule_state(t *testing.T) {
} }
var last time.Time var last time.Time
for i := 0; i < defaultStateEntriesLimit*2; i++ { for i := 0; i < stateEntriesN*2; i++ {
last = time.Now() last = time.Now()
state.add(ruleStateEntry{at: last}) state.add(ruleStateEntry{at: last})
} }
@ -50,9 +69,9 @@ func TestRule_state(t *testing.T) {
e.at, last) e.at, last)
} }
if len(state.getAll()) != defaultStateEntriesLimit { if len(state.getAll()) != stateEntriesN {
t.Fatalf("expected for state to have %d entries only; got %d", t.Fatalf("expected for state to have %d entries only; got %d",
defaultStateEntriesLimit, len(state.getAll()), stateEntriesN, len(state.getAll()),
) )
} }
} }
@ -61,7 +80,7 @@ func TestRule_state(t *testing.T) {
// execution of state updates. // execution of state updates.
// Should be executed with -race flag // Should be executed with -race flag
func TestRule_stateConcurrent(t *testing.T) { func TestRule_stateConcurrent(t *testing.T) {
state := newRuleState() state := newRuleState(20)
const workers = 50 const workers = 50
const iterations = 100 const iterations = 100

View file

@ -440,7 +440,7 @@
</div> </div>
<br> <br>
<div class="display-6 pb-3">Last {%d len(rule.Updates) %} updates</span>:</div> <div class="display-6 pb-3">Last {%d len(rule.Updates) %}/{%d rule.MaxUpdates %} updates</span>:</div>
<table class="table table-striped table-hover table-sm"> <table class="table table-striped table-hover table-sm">
<thead> <thead>
<tr> <tr>

View file

@ -1345,6 +1345,10 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
<div class="display-6 pb-3">Last `) <div class="display-6 pb-3">Last `)
//line app/vmalert/web.qtpl:443 //line app/vmalert/web.qtpl:443
qw422016.N().D(len(rule.Updates)) qw422016.N().D(len(rule.Updates))
//line app/vmalert/web.qtpl:443
qw422016.N().S(`/`)
//line app/vmalert/web.qtpl:443
qw422016.N().D(rule.MaxUpdates)
//line app/vmalert/web.qtpl:443 //line app/vmalert/web.qtpl:443
qw422016.N().S(` updates</span>:</div> qw422016.N().S(` updates</span>:</div>
<table class="table table-striped table-hover table-sm"> <table class="table table-striped table-hover table-sm">

View file

@ -17,7 +17,7 @@ func TestHandler(t *testing.T) {
alerts: map[uint64]*notifier.Alert{ alerts: map[uint64]*notifier.Alert{
0: {State: notifier.StateFiring}, 0: {State: notifier.StateFiring},
}, },
state: newRuleState(), state: newRuleState(10),
} }
g := &Group{ g := &Group{
Name: "group", Name: "group",

View file

@ -121,6 +121,8 @@ type APIRule struct {
// GroupID is an unique Group's ID // GroupID is an unique Group's ID
GroupID string `json:"group_id"` GroupID string `json:"group_id"`
// MaxUpdates is the max number of recorded ruleStateEntry objects
MaxUpdates int `json:"max_updates_entries"`
// Updates contains the ordered list of recorded ruleStateEntry objects // Updates contains the ordered list of recorded ruleStateEntry objects
Updates []ruleStateEntry `json:"updates"` Updates []ruleStateEntry `json:"updates"`
} }

View file

@ -18,6 +18,7 @@ The following tip changes can be tested by building VictoriaMetrics components f
* FEATURE: [vmui](https://docs.victoriametrics.com/#vmui): add ability to explore metrics exported by a particular `job` / `instance`. See [these docs](https://docs.victoriametrics.com/#metrics-explorer) and [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3386). * FEATURE: [vmui](https://docs.victoriametrics.com/#vmui): add ability to explore metrics exported by a particular `job` / `instance`. See [these docs](https://docs.victoriametrics.com/#metrics-explorer) and [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3386).
* FEATURE: allow passing partial `RFC3339` date/time to `time`, `start` and `end` query args at [querying APIs](https://docs.victoriametrics.com/#prometheus-querying-api-usage) and [export APIs](https://docs.victoriametrics.com/#how-to-export-time-series). For example, `2022` is equivalent to `2022-01-01T00:00:00Z`, while `2022-01-30T14` is equivalent to `2022-01-30T14:00:00Z`. See [these docs](https://docs.victoriametrics.com/#timestamp-formats). * FEATURE: allow passing partial `RFC3339` date/time to `time`, `start` and `end` query args at [querying APIs](https://docs.victoriametrics.com/#prometheus-querying-api-usage) and [export APIs](https://docs.victoriametrics.com/#how-to-export-time-series). For example, `2022` is equivalent to `2022-01-01T00:00:00Z`, while `2022-01-30T14` is equivalent to `2022-01-30T14:00:00Z`. See [these docs](https://docs.victoriametrics.com/#timestamp-formats).
* FEATURE: [relabeling](https://docs.victoriametrics.com/vmagent.html#relabeling): add support for `keepequal` and `dropequal` relabeling actions, which are supported by Prometheus starting from [v2.41.0](https://github.com/prometheus/prometheus/releases/tag/v2.41.0). These relabeling actions are almost identical to `keep_if_equal` and `drop_if_equal` relabeling actions supported by VictoriaMetrics since `v1.38.0` - see [these docs](https://docs.victoriametrics.com/vmagent.html#relabeling-enhancements) - so it is recommended sticking to `keep_if_equal` and `drop_if_equal` actions instead of switching to `keepequal` and `dropequal`. * FEATURE: [relabeling](https://docs.victoriametrics.com/vmagent.html#relabeling): add support for `keepequal` and `dropequal` relabeling actions, which are supported by Prometheus starting from [v2.41.0](https://github.com/prometheus/prometheus/releases/tag/v2.41.0). These relabeling actions are almost identical to `keep_if_equal` and `drop_if_equal` relabeling actions supported by VictoriaMetrics since `v1.38.0` - see [these docs](https://docs.victoriametrics.com/vmagent.html#relabeling-enhancements) - so it is recommended sticking to `keep_if_equal` and `drop_if_equal` actions instead of switching to `keepequal` and `dropequal`.
* FEATURE: [vmalert](https://docs.victoriametrics.com/vmalert.html): allow configuring the default number of stored rule's update states in memory via global `-rule.updateEntriesLimit` command-line flag or per-rule via rule's `update_entries_limit` configuration param.
* BUGFIX: [vmui](https://docs.victoriametrics.com/#vmui): properly update the `step` value in url after the `step` input field has been manually changed. This allows preserving the proper `step` when copy-n-pasting the url to another instance of web browser. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3513). * BUGFIX: [vmui](https://docs.victoriametrics.com/#vmui): properly update the `step` value in url after the `step` input field has been manually changed. This allows preserving the proper `step` when copy-n-pasting the url to another instance of web browser. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3513).

View file

@ -195,6 +195,11 @@ expr: <string>
# Is applicable to alerting rules only. # Is applicable to alerting rules only.
[ debug: <bool> | default = false ] [ debug: <bool> | default = false ]
# Defines the number of rule's updates entries stored in memory
# and available for view on rule's Details page.
# Overrides `rule.updateEntriesLimit` value for this specific rule.
[ update_entries_limit: <integer> | default 0 ]
# Labels to add or overwrite for each alert. # Labels to add or overwrite for each alert.
labels: labels:
[ <labelname>: <tmpl_string> ] [ <labelname>: <tmpl_string> ]
@ -323,6 +328,12 @@ expr: <string>
# Labels to add or overwrite before storing the result. # Labels to add or overwrite before storing the result.
labels: labels:
[ <labelname>: <labelvalue> ] [ <labelname>: <labelvalue> ]
# Defines the number of rule's updates entries stored in memory
# and available for view on rule's Details page.
# Overrides `rule.updateEntriesLimit` value for this specific rule.
[ update_entries_limit: <integer> | default 0 ]
``` ```
For recording rules to work `-remoteWrite.url` must be specified. For recording rules to work `-remoteWrite.url` must be specified.
@ -699,7 +710,7 @@ may get empty response from datasource and produce empty recording rules or rese
<img alt="vmalert evaluation when data is delayed" src="vmalert_ts_data_delay.gif"> <img alt="vmalert evaluation when data is delayed" src="vmalert_ts_data_delay.gif">
By default recently written samples to VictoriaMetrics aren't visible for queries for up to 30s. By default, recently written samples to VictoriaMetrics aren't visible for queries for up to 30s.
This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`. This behavior is controlled by `-search.latencyOffset` command-line flag and the `latency_offset` query ag at `vmselect`.
Usually, this results into a 30s shift for recording rules results. Usually, this results into a 30s shift for recording rules results.
Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results. Note that too small value passed to `-search.latencyOffset` or to `latency_offest` query arg may lead to incomplete query results.
@ -725,8 +736,9 @@ If `-remoteWrite.url` command-line flag is configured, vmalert will persist aler
[vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) or Grafana to track how alerts state [vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) or Grafana to track how alerts state
changed in time. changed in time.
vmalert also stores last N state updates for each rule. To check updates, click on `Details` link next to rule's name vmalert stores last `-rule.maxUpdateEntries` (or `update_entries_limit` [per-rule config](https://docs.victoriametrics.com/vmalert.html#alerting-rules))
on `/vmalert/groups` page and check the `Last updates` section: state updates for each rule. To check updates, click on `Details` link next to rule's name on `/vmalert/groups` page
and check the `Last updates` section:
<img alt="vmalert state" src="vmalert_state.png"> <img alt="vmalert state" src="vmalert_state.png">
@ -735,7 +747,7 @@ HTTP request sent by vmalert to the `-datasource.url` during evaluation. If spec
no samples returned and curl command returns data - then it is very likely there was no data in datasource on the no samples returned and curl command returns data - then it is very likely there was no data in datasource on the
moment when rule was evaluated. moment when rule was evaluated.
vmalert also alows configuring more detailed logging for specific rule. Just set `debug: true` in rule's configuration vmalert allows configuring more detailed logging for specific alerting rule. Just set `debug: true` in rule's configuration
and vmalert will start printing additional log messages: and vmalert will start printing additional log messages:
```terminal ```terminal
2022-09-15T13:35:41.155Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms) 2022-09-15T13:35:41.155Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms)
@ -894,6 +906,8 @@ The shortlist of configuration flags is the following:
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string -loggerFormat string
Format for logs. Possible values: default, json (default "default") Format for logs. Possible values: default, json (default "default")
-loggerJSONFields string
Allows renaming fields in JSON formatted logs. Example: "ts:timestamp,msg:message" renames "ts" to "timestamp" and "msg" to "message". Supported fields: ts, level, caller, msg
-loggerLevel string -loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO") Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string -loggerOutput string
@ -1096,6 +1110,8 @@ The shortlist of configuration flags is the following:
Interval for checking for changes in '-rule' files. By default the checking is disabled. Send SIGHUP signal in order to force config check for changes. DEPRECATED - see '-configCheckInterval' instead Interval for checking for changes in '-rule' files. By default the checking is disabled. Send SIGHUP signal in order to force config check for changes. DEPRECATED - see '-configCheckInterval' instead
-rule.maxResolveDuration duration -rule.maxResolveDuration duration
Limits the maximum duration for automatic alert expiration, which is by default equal to 3 evaluation intervals of the parent group. Limits the maximum duration for automatic alert expiration, which is by default equal to 3 evaluation intervals of the parent group.
-rule.maxUpdateEntries int
Defines the max number of rule's state updates. (default 20)
-rule.resendDelay duration -rule.resendDelay duration
Minimum amount of time to wait before resending an alert to notifier Minimum amount of time to wait before resending an alert to notifier
-rule.templates array -rule.templates array