github-mirrors/VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-11-21 14:44:00 +00:00

Author	SHA1	Message	Date
Roman Khavronenko	4cdffb04a4	vmalert: update groups on config reload only if changes detected (#759 ) On config reload event `vmalert` reloads configuration for every group. While it works for simple configurations, the more complex and heavy installations may suffer from frequent config reloads. The change introduces the `checksum` field for every group and is set to md5 hash of yaml configuration. The checksum will change if on any change to group definition like rules order or annotation change. Comparing the `checksum` field on config reload event helps to detect if group should be updated. The groups update is now done concurrently, so reload duration will be limited by the slowest group now. Partially solves #691 by improving config reload speed.	2020-09-11 20:14:30 +01:00
Aliaksandr Valialkin	5f16ceb294	app/vmalert: imrovements over `3f932c2db1`	2020-09-03 01:00:55 +03:00
DexterZhang	3f932c2db1	feat: spread load of rule evaluation by group when starting new groups (#724 ) * feat: spread load of rule evaluation by group when starting new groups * review: reduce the resulting diff. * Update app/vmalert/group.go Co-authored-by: Roman Khavronenko <hagen1778@gmail.com> Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com> Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>	2020-09-03 00:58:54 +03:00
Roman Khavronenko	0be5b09fb4	app/vmalert: extend metrics set exported by `vmalert` #573 (#654 ) * app/vmalert: extend metrics set exported by `vmalert` #573 New metrics were added to improve observability: + vmalert_alerts_pending{alertname, group} - number of pending alerts per group per alert; + vmalert_alerts_acitve{alertname, group} - number of active alerts per group per alert; + vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error during prev execution, is 0 if no errors happened; + vmalert_recording_rules_error{recording, group} - is 1 if recording rule ended up with error during prev execution, is 0 if no errors happened; * vmalert_iteration_total{group, file} - now contains group and file name labels. This should improve control over specific groups; * vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups; Some collisions for alerts and recording rules are possible, because neither group name nor alert/recording rule name are unique for compatibility reasons. Commit contains list of TODOs for Unregistering metrics since groups and rules are ephemeral and could be removed without application restart. In order to unlock Unregistering feature corresponding PR was filed - https://github.com/VictoriaMetrics/metrics/pull/13 * app/vmalert: extend metrics set exported by `vmalert` #573 The changes are following: * add an ID label to rules metrics, since `name` collisions within one group is a common case - see the k8s example alerts; * supports metrics unregistering on rule updates. Consider the case when one rule was added or removed from the group, or the whole group was added or removed. The change depends on https://github.com/VictoriaMetrics/metrics/pull/16 where race condition for Unregister method was fixed.	2020-08-09 09:41:29 +03:00
Roman Khavronenko	2f1e7298ce	app/vmalert: support `external.label` to specify global labelset for all rules #622 (#652 ) `external.label` flag supposed to help to distinguish alert or recording rules source in situations when more than one `vmalert` runs for the same datasource or AlertManager.	2020-07-28 14:20:31 +03:00
Roman Khavronenko	703def4b2e	app/vmalert: add retries to remotewrite (#605 ) * app/vmalert: add retries to remotewrite Remotewrite pkg now does limited number of retries if write request failed. This suppose to make vmalert state persisting more reliable. New metrics were added to remotewrite in order to track rows/bytes sent/dropped. defaultFlushInterval was increased from 1s to 5s for sanity reasons. * fix * wip * wip * wip * fix bits alignment bug for 32-bit systems * fix mistakenly dropped field	2020-07-05 18:46:52 +03:00
Aliaksandr Valialkin	d5dddb0953	all: use %w instead of %s for wrapping errors in `fmt.Errorf` This will simplify examining the returned errors such as httpserver.ErrorWithStatusCode . See https://blog.golang.org/go1.13-errors for details.	2020-06-30 23:05:11 +03:00
Roman Khavronenko	88538df267	app/vmalert: support multiple notifier urls (#584 ) (#590 ) * app/vmalert: support multiple notifier urls (#584) User now can set multiple notifier URLs in the same fashion as for other vmutils (e.g. vmagent). The same is correct for TLS setting for every configured URL. Alerts sending is done in sequential way for respecting the specified URLs order. * app/vmalert: add basicAuth support for notifier client (#585) The change adds possibility to set basicAuth creds for notifier client in the same fasion as for remote write/read and datasource.	2020-06-29 22:21:03 +03:00
Roman Khavronenko	3e277020a5	vmalert-491: allow to configure concurrent rules execution per group. (#542 ) The feature allows to speed up group rules execution by executing them concurrently. Change also contains README changes to reflect configuration details.	2020-06-09 15:21:20 +03:00
Roman Khavronenko	270552fde4	vmalert: Add recording rules support. (#519 ) * vmalert: Add recording rules support. Recording rules support required additional service refactoring since it wasn't planned to support them from the very beginning. The list of changes is following: * new entity RecordingRule was added for writing results of MetricsQL expressions into remote storage; * interface Rule now unites both recording and alerting rules; * configuration parser was moved to separate package and now performs more strict validation; * new endpoint for listing all groups and rules in json format was added; * evaluation interval may be set to every particular group; * vmalert: uncomment tests * vmalert: rm outdated TODO * vmalert: fix typos in README	2020-06-01 13:46:37 +03:00
Roman Khavronenko	92212f04da	vmalert: avoid sending resolves for pending alerts (#498 ) Before the change we were sending notifications to notifier if following conditions are met: * alert is in Fire state * alert is in Inactive state We were sending Inactive notifications to resolve alert ASAP. Unfortunately, we were sending resolves for Pending alerts that become Inactive, which is wrong. In this change we delete alert from the active list if it was Pending and become Inactive. In this way we now have Inactive alerts only if they were in state Fire before. See test change for example.	2020-05-17 15:13:22 +01:00
Roman Khavronenko	de60ad0cd6	vmalert: fix potential race during configuration reloads (#497 ) Configuration reload and rules evaluation can't be executed in same time now. This may make reload time longer but prevents from potential races.	2020-05-17 15:12:09 +01:00
Roman Khavronenko	a249cd9d22	vmalert: fix the access to rules slice element by wrong index (#486 ) During group's update rules deletion was causing slice mutations while slice index was assumed to be unchanged. This caused "slice bounds out of range" errors when multiple rules were deleted sequentially.	2020-05-15 07:55:22 +01:00
Roman Khavronenko	8c8ff5d0cb	vmalert: cleanup and restructure of code to improve maintainability (#471 ) The change introduces new entity `manager` which replaces `watchdog`, decouples requestHandler and groups. Manager supposed to control life cycle of groups, rules and config reloads. Groups export an ID method which returns a hash from filename and group name. ID supposed to be unique identifier across all loaded groups. Some tests were added to improve coverage. Bug with wrong annotation value if $value is used in templates after metrics being restored fixed. Notifier interface was extended to accept context. New set of metrics was introduced for config reload.	2020-05-10 17:58:17 +01:00

14 commits