diff --git a/app/vmalert/README.md b/app/vmalert/README.md index 7d1c4a150..f268cf959 100644 --- a/app/vmalert/README.md +++ b/app/vmalert/README.md @@ -638,6 +638,61 @@ Use the official [Grafana dashboard](https://grafana.com/grafana/dashboards/1495 If you have suggestions for improvements or have found a bug - please open an issue on github or add a review to the dashboard. +## Troubleshooting + +vmalert executes configured rules within certain intervals. It is expected that at the moment when rule is executed, +the data is already present in configured `-datasource.url`: + +vmalert expected evaluation + +Usually, troubles start to appear when data in `-datasource.url` is delayed or absent. In such cases, evaluations +may get empty response from datasource and produce empty recording rules or reset alerts state: + +vmalert evaluation when data is delayed + +Try the following recommendations in such cases: + +* Always configure group's `evaluationInterval` to be bigger or equal to `scrape_interval` at which metrics +are delivered to the datasource; +* If you know in advance, that data in datasource is delayed - try changing vmalert's `-datasource.lookback` +command-line flag to add a time shift for evaluations; +* If time intervals between datapoints in datasource are irregular - try changing vmalert's `-datasource.queryStep` +command-line flag to specify how far search query can lookback for the recent datapoint. By default, this value +is equal to group's `evaluationInterval`. + +Sometimes, it is not clear why some specific alert fired or didn't fire. It is very important to remember, that +alerts with `for: 0` fire immediately when their expression becomes true. And alerts with `for > 0` will fire only +after multiple consecutive evaluations, and at each evaluation their expression must be true. If at least one evaluation +becomes false, then alert's state resets to the initial state. + +If `-remoteWrite.url` command-line flag is configured, vmalert will persist alert's state in form of time series +`ALERTS` and `ALERTS_FOR_STATE` to the specified destination. Such time series can be then queried via +[vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) or Grafana to track how alerts state +changed in time. + +vmalert also stores last N state updates for each rule. To check updates, click on `Details` link next to rule's name +on `/vmalert/groups` page and check the `Last updates` section: + +vmalert state + +Rows in the section represent ordered rule evaluations and their results. The column `curl` contains an example of +HTTP request sent by vmalert to the `-datasource.url` during evaluation. If specific state shows that there were +no samples returned and curl command returns data - then it is very likely there was no data in datasource on the +moment when rule was evaluated. + +vmalert also alows configuring more detailed logging for specific rule. Just set `debug: true` in rule's configuration +and vmalert will start printing additional log messages: +```terminal +2022-09-15T13:35:41.155Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms) +2022-09-15T13:35:56.149Z DEBUG datasource request: executing POST request with params "denyPartialResponse=true&query=sum%28vm_tcplistener_conns%7Binstance%3D%22localhost%3A8429%22%7D%29+by%28instance%29+%3E+0&step=15s&time=1663248945" +2022-09-15T13:35:56.178Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:56+02:00: query returned 1 samples (elapsed: 28.368208ms) +2022-09-15T13:35:56.178Z DEBUG datasource request: executing POST request with params "denyPartialResponse=true&query=sum%28vm_tcplistener_conns%7Binstance%3D%22localhost%3A8429%22%7D%29&step=15s&time=1663248945" +2022-09-15T13:35:56.179Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:56+02:00: alert 10705778000901301787 {alertgroup="TestGroup",alertname="Conns",cluster="east-1",instance="localhost:8429",replica="a"} created in state PENDING +... +2022-09-15T13:36:56.153Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:36:56+02:00: alert 10705778000901301787 {alertgroup="TestGroup",alertname="Conns",cluster="east-1",instance="localhost:8429",replica="a"} PENDING => FIRING: 1m0s since becoming active at 2022-09-15 15:35:56.126006 +0200 CEST m=+39.384575417 +``` + + ## Profiling `vmalert` provides handlers for collecting the following [Go profiles](https://blog.golang.org/profiling-go-programs): diff --git a/app/vmalert/vmalert_state.png b/app/vmalert/vmalert_state.png new file mode 100644 index 000000000..5bf656b01 Binary files /dev/null and b/app/vmalert/vmalert_state.png differ diff --git a/app/vmalert/vmalert_ts_data_delay.gif b/app/vmalert/vmalert_ts_data_delay.gif new file mode 100644 index 000000000..2da024b46 Binary files /dev/null and b/app/vmalert/vmalert_ts_data_delay.gif differ diff --git a/app/vmalert/vmalert_ts_normal.gif b/app/vmalert/vmalert_ts_normal.gif new file mode 100644 index 000000000..a05c74061 Binary files /dev/null and b/app/vmalert/vmalert_ts_normal.gif differ diff --git a/app/vmalert/web.qtpl b/app/vmalert/web.qtpl index 2d35e5880..7346a2705 100644 --- a/app/vmalert/web.qtpl +++ b/app/vmalert/web.qtpl @@ -384,6 +384,7 @@ + {% if rule.Type == "alerting" %}
@@ -394,6 +395,7 @@
+ {% endif %}
@@ -406,6 +408,7 @@
+ {% if rule.Type == "alerting" %}
@@ -419,6 +422,7 @@
+ {% endif %}
diff --git a/app/vmalert/web.qtpl.go b/app/vmalert/web.qtpl.go index b4648bbbd..08baf5333 100644 --- a/app/vmalert/web.qtpl.go +++ b/app/vmalert/web.qtpl.go @@ -1187,6 +1187,11 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
+ `) +//line app/vmalert/web.qtpl:387 + if rule.Type == "alerting" { +//line app/vmalert/web.qtpl:387 + qw422016.N().S(`
@@ -1194,13 +1199,18 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
`) -//line app/vmalert/web.qtpl:393 - qw422016.E().V(rule.Duration) -//line app/vmalert/web.qtpl:393 - qw422016.N().S(` seconds +//line app/vmalert/web.qtpl:394 + qw422016.E().V(rule.Duration) +//line app/vmalert/web.qtpl:394 + qw422016.N().S(` seconds
+ `) +//line app/vmalert/web.qtpl:398 + } +//line app/vmalert/web.qtpl:398 + qw422016.N().S(`
@@ -1208,27 +1218,32 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
`) -//line app/vmalert/web.qtpl:403 +//line app/vmalert/web.qtpl:405 for _, k := range labelKeys { -//line app/vmalert/web.qtpl:403 +//line app/vmalert/web.qtpl:405 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:404 +//line app/vmalert/web.qtpl:406 qw422016.E().S(k) -//line app/vmalert/web.qtpl:404 +//line app/vmalert/web.qtpl:406 qw422016.N().S(`=`) -//line app/vmalert/web.qtpl:404 +//line app/vmalert/web.qtpl:406 qw422016.E().S(rule.Labels[k]) -//line app/vmalert/web.qtpl:404 +//line app/vmalert/web.qtpl:406 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:405 +//line app/vmalert/web.qtpl:407 } -//line app/vmalert/web.qtpl:405 +//line app/vmalert/web.qtpl:407 qw422016.N().S(`
+ `) +//line app/vmalert/web.qtpl:411 + if rule.Type == "alerting" { +//line app/vmalert/web.qtpl:411 + qw422016.N().S(`
@@ -1236,28 +1251,33 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
`) -//line app/vmalert/web.qtpl:415 - for _, k := range annotationKeys { -//line app/vmalert/web.qtpl:415 - qw422016.N().S(` +//line app/vmalert/web.qtpl:418 + for _, k := range annotationKeys { +//line app/vmalert/web.qtpl:418 + qw422016.N().S(` `) -//line app/vmalert/web.qtpl:416 - qw422016.E().S(k) -//line app/vmalert/web.qtpl:416 - qw422016.N().S(`:
+//line app/vmalert/web.qtpl:419 + qw422016.E().S(k) +//line app/vmalert/web.qtpl:419 + qw422016.N().S(`:

`) -//line app/vmalert/web.qtpl:417 - qw422016.E().S(rule.Annotations[k]) -//line app/vmalert/web.qtpl:417 - qw422016.N().S(`

+//line app/vmalert/web.qtpl:420 + qw422016.E().S(rule.Annotations[k]) +//line app/vmalert/web.qtpl:420 + qw422016.N().S(`

`) -//line app/vmalert/web.qtpl:418 - } -//line app/vmalert/web.qtpl:418 - qw422016.N().S(` +//line app/vmalert/web.qtpl:421 + } +//line app/vmalert/web.qtpl:421 + qw422016.N().S(`
+ `) +//line app/vmalert/web.qtpl:425 + } +//line app/vmalert/web.qtpl:425 + qw422016.N().S(`
@@ -1265,17 +1285,17 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
@@ -1283,9 +1303,9 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule)
Last `) -//line app/vmalert/web.qtpl:434 +//line app/vmalert/web.qtpl:438 qw422016.N().D(len(rule.Updates)) -//line app/vmalert/web.qtpl:434 +//line app/vmalert/web.qtpl:438 qw422016.N().S(` updates:
@@ -1300,201 +1320,201 @@ func StreamRuleDetails(qw422016 *qt422016.Writer, r *http.Request, rule APIRule) `) -//line app/vmalert/web.qtpl:447 +//line app/vmalert/web.qtpl:451 for _, u := range rule.Updates { -//line app/vmalert/web.qtpl:447 +//line app/vmalert/web.qtpl:451 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:460 +//line app/vmalert/web.qtpl:464 if u.err != nil { -//line app/vmalert/web.qtpl:460 +//line app/vmalert/web.qtpl:464 qw422016.N().S(` - `) -//line app/vmalert/web.qtpl:466 +//line app/vmalert/web.qtpl:470 } -//line app/vmalert/web.qtpl:466 +//line app/vmalert/web.qtpl:470 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:467 +//line app/vmalert/web.qtpl:471 } -//line app/vmalert/web.qtpl:467 +//line app/vmalert/web.qtpl:471 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:469 +//line app/vmalert/web.qtpl:473 tpl.StreamFooter(qw422016, r) -//line app/vmalert/web.qtpl:469 +//line app/vmalert/web.qtpl:473 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 } -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 func WriteRuleDetails(qq422016 qtio422016.Writer, r *http.Request, rule APIRule) { -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 qw422016 := qt422016.AcquireWriter(qq422016) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 StreamRuleDetails(qw422016, r, rule) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 qt422016.ReleaseWriter(qw422016) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 } -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 func RuleDetails(r *http.Request, rule APIRule) string { -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 qb422016 := qt422016.AcquireByteBuffer() -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 WriteRuleDetails(qb422016, r, rule) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 qs422016 := string(qb422016.B) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 qt422016.ReleaseByteBuffer(qb422016) -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 return qs422016 -//line app/vmalert/web.qtpl:470 +//line app/vmalert/web.qtpl:474 } -//line app/vmalert/web.qtpl:474 +//line app/vmalert/web.qtpl:478 func streambadgeState(qw422016 *qt422016.Writer, state string) { -//line app/vmalert/web.qtpl:474 +//line app/vmalert/web.qtpl:478 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:476 +//line app/vmalert/web.qtpl:480 badgeClass := "bg-warning text-dark" if state == "firing" { badgeClass = "bg-danger" } -//line app/vmalert/web.qtpl:480 +//line app/vmalert/web.qtpl:484 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:481 +//line app/vmalert/web.qtpl:485 qw422016.E().S(state) -//line app/vmalert/web.qtpl:481 +//line app/vmalert/web.qtpl:485 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 } -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 func writebadgeState(qq422016 qtio422016.Writer, state string) { -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 qw422016 := qt422016.AcquireWriter(qq422016) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 streambadgeState(qw422016, state) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 qt422016.ReleaseWriter(qw422016) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 } -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 func badgeState(state string) string { -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 qb422016 := qt422016.AcquireByteBuffer() -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 writebadgeState(qb422016, state) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 qs422016 := string(qb422016.B) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 qt422016.ReleaseByteBuffer(qb422016) -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 return qs422016 -//line app/vmalert/web.qtpl:482 +//line app/vmalert/web.qtpl:486 } -//line app/vmalert/web.qtpl:484 +//line app/vmalert/web.qtpl:488 func streambadgeRestored(qw422016 *qt422016.Writer) { -//line app/vmalert/web.qtpl:484 +//line app/vmalert/web.qtpl:488 qw422016.N().S(` restored `) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 } -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 func writebadgeRestored(qq422016 qtio422016.Writer) { -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 qw422016 := qt422016.AcquireWriter(qq422016) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 streambadgeRestored(qw422016) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 qt422016.ReleaseWriter(qw422016) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 } -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 func badgeRestored() string { -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 qb422016 := qt422016.AcquireByteBuffer() -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 writebadgeRestored(qb422016) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 qs422016 := string(qb422016.B) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 qt422016.ReleaseByteBuffer(qb422016) -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 return qs422016 -//line app/vmalert/web.qtpl:486 +//line app/vmalert/web.qtpl:490 } diff --git a/docs/vmalert.md b/docs/vmalert.md index 2f7e19fc5..8d6f18159 100644 --- a/docs/vmalert.md +++ b/docs/vmalert.md @@ -642,6 +642,61 @@ Use the official [Grafana dashboard](https://grafana.com/grafana/dashboards/1495 If you have suggestions for improvements or have found a bug - please open an issue on github or add a review to the dashboard. +## Troubleshooting + +vmalert executes configured rules within certain intervals. It is expected that at the moment when rule is executed, +the data is already present in configured `-datasource.url`: + +vmalert expected evaluation + +Usually, troubles start to appear when data in `-datasource.url` is delayed or absent. In such cases, evaluations +may get empty response from datasource and produce empty recording rules or reset alerts state: + +vmalert evaluation when data is delayed + +Try the following recommendations in such cases: + +* Always configure group's `evaluationInterval` to be bigger or equal to `scrape_interval` at which metrics +are delivered to the datasource; +* If you know in advance, that data in datasource is delayed - try changing vmalert's `-datasource.lookback` +command-line flag to add a time shift for evaluations; +* If time intervals between datapoints in datasource are irregular - try changing vmalert's `-datasource.queryStep` +command-line flag to specify how far search query can lookback for the recent datapoint. By default, this value +is equal to group's `evaluationInterval`. + +Sometimes, it is not clear why some specific alert fired or didn't fire. It is very important to remember, that +alerts with `for: 0` fire immediately when their expression becomes true. And alerts with `for > 0` will fire only +after multiple consecutive evaluations, and at each evaluation their expression must be true. If at least one evaluation +becomes false, then alert's state resets to the initial state. + +If `-remoteWrite.url` command-line flag is configured, vmalert will persist alert's state in form of time series +`ALERTS` and `ALERTS_FOR_STATE` to the specified destination. Such time series can be then queried via +[vmui](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) or Grafana to track how alerts state +changed in time. + +vmalert also stores last N state updates for each rule. To check updates, click on `Details` link next to rule's name +on `/vmalert/groups` page and check the `Last updates` section: + +vmalert state + +Rows in the section represent ordered rule evaluations and their results. The column `curl` contains an example of +HTTP request sent by vmalert to the `-datasource.url` during evaluation. If specific state shows that there were +no samples returned and curl command returns data - then it is very likely there was no data in datasource on the +moment when rule was evaluated. + +vmalert also alows configuring more detailed logging for specific rule. Just set `debug: true` in rule's configuration +and vmalert will start printing additional log messages: +```terminal +2022-09-15T13:35:41.155Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms) +2022-09-15T13:35:56.149Z DEBUG datasource request: executing POST request with params "denyPartialResponse=true&query=sum%28vm_tcplistener_conns%7Binstance%3D%22localhost%3A8429%22%7D%29+by%28instance%29+%3E+0&step=15s&time=1663248945" +2022-09-15T13:35:56.178Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:56+02:00: query returned 1 samples (elapsed: 28.368208ms) +2022-09-15T13:35:56.178Z DEBUG datasource request: executing POST request with params "denyPartialResponse=true&query=sum%28vm_tcplistener_conns%7Binstance%3D%22localhost%3A8429%22%7D%29&step=15s&time=1663248945" +2022-09-15T13:35:56.179Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:56+02:00: alert 10705778000901301787 {alertgroup="TestGroup",alertname="Conns",cluster="east-1",instance="localhost:8429",replica="a"} created in state PENDING +... +2022-09-15T13:36:56.153Z DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:36:56+02:00: alert 10705778000901301787 {alertgroup="TestGroup",alertname="Conns",cluster="east-1",instance="localhost:8429",replica="a"} PENDING => FIRING: 1m0s since becoming active at 2022-09-15 15:35:56.126006 +0200 CEST m=+39.384575417 +``` + + ## Profiling `vmalert` provides handlers for collecting the following [Go profiles](https://blog.golang.org/profiling-go-programs): diff --git a/docs/vmalert_state.png b/docs/vmalert_state.png new file mode 100644 index 000000000..5bf656b01 Binary files /dev/null and b/docs/vmalert_state.png differ diff --git a/docs/vmalert_ts_data_delay.gif b/docs/vmalert_ts_data_delay.gif new file mode 100644 index 000000000..2da024b46 Binary files /dev/null and b/docs/vmalert_ts_data_delay.gif differ diff --git a/docs/vmalert_ts_normal.gif b/docs/vmalert_ts_normal.gif new file mode 100644 index 000000000..a05c74061 Binary files /dev/null and b/docs/vmalert_ts_normal.gif differ
`) -//line app/vmalert/web.qtpl:450 +//line app/vmalert/web.qtpl:454 qw422016.E().S(u.time.Format(time.RFC3339)) -//line app/vmalert/web.qtpl:450 +//line app/vmalert/web.qtpl:454 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:452 +//line app/vmalert/web.qtpl:456 qw422016.N().D(u.samples) -//line app/vmalert/web.qtpl:452 +//line app/vmalert/web.qtpl:456 qw422016.N().S(` `) -//line app/vmalert/web.qtpl:453 +//line app/vmalert/web.qtpl:457 qw422016.N().FPrec(u.duration.Seconds(), 3) -//line app/vmalert/web.qtpl:453 +//line app/vmalert/web.qtpl:457 qw422016.N().S(`s `) -//line app/vmalert/web.qtpl:454 +//line app/vmalert/web.qtpl:458 qw422016.E().S(u.at.Format(time.RFC3339)) -//line app/vmalert/web.qtpl:454 +//line app/vmalert/web.qtpl:458 qw422016.N().S(`
+ `) -//line app/vmalert/web.qtpl:463 +//line app/vmalert/web.qtpl:467 qw422016.E().V(u.err) -//line app/vmalert/web.qtpl:463 +//line app/vmalert/web.qtpl:467 qw422016.N().S(`