github-mirrors/VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-11-21 14:44:00 +00:00

Author	SHA1	Message	Date
Haleygo	dc28196237	vmalert-tool: implement unittest (#4789 ) 1. split package rule under /app/vmalert, expose needed objects 2. add vmalert-tool with unittest subcmd https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2945	2023-10-13 13:54:33 +02:00
Haleygo	2aa0f5fc41	vmalert: add `evalAlignment` for rule group and fix evalutaion timstamp (#5066 ) * vmalert: add `query_time_alignment` for rule group 1. add `eval_alignment` attribute for group which by default is true. So group rule query stamp will be aligned with interval and propagated to ALERT metrics and the messages for alertmanager; 2. deprecate `datasource.queryTimeAlignment` flag. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049	2023-10-10 12:41:19 +02:00
Haleygo	45c0e4bb31	vmalert: add `eval_offset` for group (#4693 ) Adds `eval_offset` attribute for Groups. If specified, Group will be evaluated at the exact time offset on the range of [0...evaluationInterval]. The setting might be useful for cron-like rules which must be evaluated at specific moments of time. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3409 Signed-off-by: Haley Wang <pipilong.25@gmail.com> Co-authored-by: hagen1778 <roman@victoriametrics.com>	2023-09-06 16:29:59 +02:00
Haleygo	ae0e4a8c90	vmalert: add `keep_firing_for` field for alerting rule (#4669 ) vmalert: support `keep_firing_for` field for alerting rule https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4529 --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com>	2023-07-27 15:13:13 +02:00
Roman Khavronenko	d6fa4da712	vmalert: cancel in-flight requests on group's update or close (#3886 ) When group's update() or close() method is called, the group still need to wait for its current evaluation to finish. Sometimes, evaluation could take a significant amount of time which slows configuration update or vmalert's graceful shutdown. The change interrupts current evaluation in order to speed up the graceful shutdown or config update procedures. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-03-01 15:48:20 +01:00
Roman Khavronenko	6fd10e8871	vmalert: speed up state restore procedure on start (#3758 ) * vmalert: speed up state restore procedure on start Alerts state restore procedure has been changed to become asynchronous. It doesn't block groups start anymore which significantly improves vmalert's startup time. Instead, state restore is called by each group in their goroutines after the first rules evaluation. While previously state restore attempt was made for all loaded alerting rules, now it is called only for alerts which became active after the first evaluation. This reduces the amount of API calls to the configured remote read URL. This also means that `remoteRead.ignoreRestoreErrors` command-line flag becomes deprecated now and will have no effect if configured. See relevant issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2608 Signed-off-by: hagen1778 <roman@victoriametrics.com> * make lint happy Signed-off-by: hagen1778 <roman@victoriametrics.com> * Apply suggestions from code review --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2023-02-03 19:46:13 -08:00
Roman Khavronenko	6588fcbfca	vmalert: allow configuring the default number of stored rule's update states (#3556 ) Allow configuring the default number of stored rule's update states in memory via global `-rule.updateEntriesLimit` command-line flag or per-rule via rule's `update_entries_limit` configuration param. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-12-29 12:36:44 +01:00
Roman Khavronenko	3dec847c93	vmalert: correctly return error for RW failures (#3452 ) * vmalert: correctly return error for RW failures By mistake, in `0989649ad0` the error for remote write failures weren't return to user. This change fixes it. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-12-06 15:36:46 +01:00
Roman Khavronenko	1aa5112771	vmalert: remove `notifier` dependency from `config` (#2906 ) Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-07-22 13:50:41 +02:00
cui fliter	6f4d9b2a48	fix some typos (#2882 ) Signed-off-by: cui fliter <imcusg@gmail.com>	2022-07-18 12:02:51 +03:00
Roman Khavronenko	ef7f52e0e6	Vmalert notifiers (#2744 ) * vmalert: remove head of line blocking for sending alerts This change makes sending alerts to notifiers concurrent instead of sequential. This eliminates head of line blocking, where first faulty notifier address prevents the rest of notifiers from receiving notifications. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: make default timeout for sending alerts 10s Previous value of 1m was too high and was inconsistent with default timeout defined for notifiers via configuration file. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: linter checks fix Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-06-18 09:11:37 +02:00
Andrii Chubatiuk	a531a96193	added reusable templates support (#2532 ) Signed-off-by: Andrii Chubatiuk <andrew.chubatiuk@gmail.com>	2022-05-14 11:38:44 +02:00
Roman Khavronenko	3f0ecee128	vmalert: properly cleanup stale series tracker on rules update (#2577 ) Rules executor within group tracks series sent to remote write in order to mark them as stale if they had disappeared in next evaluation round. The executor uses rules ID as a key to identifies series which belong to rule. On config reload, executor remains active but the set of rules could change. Hence, we need to properly cleanup the tracker for rules which has been disappeared on config reload. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-05-13 10:04:49 +02:00
Roman Khavronenko	2b59fff526	vmalert: fix labels and annotations processing for alerts (#2403 ) To improve compatibility with Prometheus alerting the order of templates processing has changed. Before, vmalert did all labels processing beforehand. It meant all extra labels (such as `alertname`, `alertgroup` or rule labels) were available in templating. All collisions were resolved in favour of extra labels. In Prometheus, only labels from the received metric are available in templating, so no collisions are possible. This change makes vmalert's behaviour similar to Prometheus. For example, consider alerting rule which is triggered by time series with `alertname` label. In vmalert, this label would be overriden by alerting rule's name everywhere: for alert labels, for annotations, etc. In Prometheus, it would be overriden for alert's labels only, but in annotations the original label value would be available. See more details here https://github.com/prometheus/compliance/issues/80 Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-04-06 20:24:45 +02:00
Roman Khavronenko	0989649ad0	Vmalert compliance 2 (#2340 ) * vmalert: split alert's `Start` field into `ActiveAt` and `Start` The `ActiveAt` field identifies when alert becomes active for rules with `for > 0`. Previously, this value was stored in field `Start`. The field `Start` now identifies the moment alert became `FIRING`. The split is needed in order to distinguish these two moments in the API responses for alerts. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: support specific moment of time for rules evaluation The Querier interface was extended to accept a new argument used as a timestamp at which evaluation should be made. It is needed to align rules execution time within the group. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: mark disappeared series as stale Series generated by alerting rules, which were sent to remote write now will be marked as stale if they will disappear on the next evaluation. This would make ALERTS and ALERTS_FOR_TIME series more precise. Signed-off-by: hagen1778 <roman@victoriametrics.com> * wip Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: evaluate rules at fixed timestamp Before, time at which rules were evaluated was calculated right before rule execution. The change makes sure that timestamp is calculated only once per evalution round and all rules are using the same timestamp. It also updates the logic of resending of already resolved alert notification. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: allow overridin `alertname` label value if it is present in response Previously, `alertname` was always equal to the Alerting Rule name. Now, its value can be overriden if series in response containt the different value for this label. The change is needed for improving compatibility with Prometheus. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: align rules evaluation in time Now, evaluation timestamp for rules evaluates as if there was no delay in rules evaluation. It means, that rules will be evaluated at fixed timestamps+group_interval. This way provides more consistent evaluation results and improves compatibility with Prometheus, Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: add metric for missed iterations New metric `vmalert_iteration_missed_total` will show whether rules evaluation round was missed. Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: reduce delay before the initial rule evaluation in group Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: rollback alertname override According to the spec: ``` The alert name from the alerting rule (HighRequestLatency from the example above) MUST be added to the labels of the alert with the label name as alertname. It MUST override any existing alertname label. ``` https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md#step-3 Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: throw err immediately on dedup detection ``` The execution of an alerting rule MUST error out immediately and MUST NOT send any alerts or add samples to samples receiver if there is more than one alert with the same labels ``` https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md#step-4 Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: cleanup Signed-off-by: hagen1778 <roman@victoriametrics.com> * vmalert: use strings builder to reduce allocs Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-03-29 15:09:07 +02:00
Dmytro Kozlov	11ae1ae924	Added resendDelay for alerts (#2296 ) * vmalert: add support of `resendDelay` flag for alerts Co-authored-by: dmitryk-dk <dmitry.kozlov@brightlocal.com> Co-authored-by: hagen1778 <roman@victoriametrics.com>	2022-03-16 15:26:33 +00:00
Roman Khavronenko	e3adcbec6e	lib/promscrape: support prometheus-like duration in scrape configs (#2169 ) * lib/promscrape: support prometheus-like duration in scrape configs The change allows to specify duration values like `1d`, `1w` for fields `scrape_interval`, `scrape_timeout`, etc. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/817#issuecomment-1033384766 Signed-off-by: hagen1778 <roman@victoriametrics.com> * lib/blockcache: make linter happy Signed-off-by: hagen1778 <roman@victoriametrics.com> * lib/promscrape: support prometheus-like duration in scrape configs * add support for extra fields `scrape_align_interval` and `scrape_offset`; * support Prometheus duration parsing for `__scrape_interval__` and `__scrape_duration__` labels; Signed-off-by: hagen1778 <roman@victoriametrics.com> * wip * wip * docs/CHANGELOG.md: document the feature Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2022-02-11 16:17:00 +02:00
Roman Khavronenko	5da71eb685	vmalert: support configuration file for notifiers (#2127 ) vmalert: support configuration file for notifiers * vmalert notifiers now can be configured via file see https://docs.victoriametrics.com/vmalert.html#notifier-configuration-file * add support of Consul service discovery for notifiers config see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1947 * add UI section for currently loaded/discovered notifiers * deprecate `-rule.configCheckInterval` in favour of `-configCheckInterval` * add ability to suppress logs for duplicated targets for notifiers discovery * change behaviour of `vmalert_alerts_send_errors_total` - it now accounts for failed alerts, not HTTP calls.	2022-02-02 14:11:41 +02:00
Aliaksandr Valialkin	bf814320b0	app/vmalert: remove `rule.type` config, since it doesnt play well with the upcoming default tenants for -clusterMode It is better from the consistency point of view to set up rule types at group level where tenant config is set up.	2021-11-05 19:52:32 +02:00
Roman Khavronenko	43a7984cd8	vmalert: correctly calculate alert ID including extra labels (#1734 ) Previously, ID for alert entity was generated without alertname or groupname. This led to collision, when multiple alerting rules within the same group producing same labelsets. E.g. expr: `sum(metric1) by (job) > 0` and expr: `sum(metric2) by (job) > 0` could result into same labelset `job: "job"`. The issue affects only UI and Web API parts of vmalert, because alert ID is used only for displaying and finding active alerts. It does not affect state restore procedure, since this label was added right before pushing to remote storage. The change now adds all extra labels right after receiving response from the datasource. And removes adding extra labels before pushing to remote storage. Additionally, change introduces a new flag `Restored` which will be displayed in UI for alerts which have been restored from remote storage on restart.	2021-10-22 12:30:38 +03:00
Roman Khavronenko	5494bc02a6	vmalert: add flag to limit the max value for auto-resovle duration for alerts (#1609 ) * vmalert: add flag to limit the max value for auto-resovle duration for alerts The new flag `rule.maxResolveDuration` suppose to limit max value for alert.End param, which is used by notifiers like Alertmanager for alerts auto resolve. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1586	2021-09-13 15:48:18 +03:00
Aliaksandr Valialkin	bfba4c28a4	app/vmalert: accept Prometheus-like durations in `interval` config option inside `group` section	2021-07-12 12:35:17 +03:00
Nikolay	15609ee447	changes vmalert Querier with per rule querier (#1249 ) * changes vmalert Querier with per rule querier it allows to changes some parametrs based on rule setting for instance - alert type, tenant for cluster version or event endpoint url.	2021-04-28 21:41:15 +01:00
Roman Khavronenko	6247884057	vmalert: add function "query", "first" and "value" to alert templates functions (#960 ) The commit adds a support for template function `query`, `first` and `value`. The function `query` executes a MetricsQL query for active alerts. In vmalert we update templates on every evaluation for active alerts to keep them up to date. With `query` func it may become a perf issue since it will fire a query on every execution. We should keep it in mind for now. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/539	2020-12-14 20:11:45 +02:00
Roman Khavronenko	3adf8c5a6f	vmalert: explicitly set extra labels to alert entities (#886 ) The previous implementation treated extra labels (global and rule labels) as separate label set to returned time series labels. Hence, time series always contained only original labels and alert ID was generated from sorted labels key-values. Extra labels didn't affect the generated ID and were applied on the following actions: - templating for Summary and Annotations; - persisting state via remote write; - restoring state via remote read. Such behaviour caused difficulties on restore procedure because extra labels had to be dropped before checking the alert ID, but that not always worked. Consider the case when expression returns the following time series `up{job="foo"}` and rule has extra label `job=bar`. This would mean that restored alert ID will be always different to the real time series because of collision. To solve the situation extra labels are now always applied beforehand and `vmalert` doesn't store original labels anymore. However, this could result into a new error situation. Consider the case when expression returns two time series `up{job="foo"}` and `up{job="baz"}`, while rule has extra label `job=bar`. In such case, applying extra labels will result into two identical time series and `vmalert` will return error: `result contains metrics with the same labelset after applying rule labels` https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870	2020-11-10 00:27:32 +02:00
Aliaksandr Valialkin	f4e8687c88	app/vmalert: accept days, weeks and years in `for:` part of config like Prometheus does Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/817	2020-10-08 20:13:15 +03:00
Aliaksandr Valialkin	5f16ceb294	app/vmalert: imrovements over `3f932c2db1`	2020-09-03 01:00:55 +03:00
Roman Khavronenko	2f1e7298ce	app/vmalert: support `external.label` to specify global labelset for all rules #622 (#652 ) `external.label` flag supposed to help to distinguish alert or recording rules source in situations when more than one `vmalert` runs for the same datasource or AlertManager.	2020-07-28 14:20:31 +03:00
Roman Khavronenko	88538df267	app/vmalert: support multiple notifier urls (#584 ) (#590 ) * app/vmalert: support multiple notifier urls (#584) User now can set multiple notifier URLs in the same fashion as for other vmutils (e.g. vmagent). The same is correct for TLS setting for every configured URL. Alerts sending is done in sequential way for respecting the specified URLs order. * app/vmalert: add basicAuth support for notifier client (#585) The change adds possibility to set basicAuth creds for notifier client in the same fasion as for remote write/read and datasource.	2020-06-29 22:21:03 +03:00
Roman Khavronenko	e91d758831	vmalert-537: allow name duplication for rules within one group. (#559 ) Uniqueness of rule is now defined by combination of its name, expression and labels. The hash of the combination is now used as rule ID and identifies rule within the group. Set of rules from coreos/kube-prometheus was added for testing purposes to verify compatibility. The check also showed that `vmalert` doesn't support `query` template function that was mentioned as limitation in README.	2020-06-15 20:15:47 +01:00
Roman Khavronenko	3e277020a5	vmalert-491: allow to configure concurrent rules execution per group. (#542 ) The feature allows to speed up group rules execution by executing them concurrently. Change also contains README changes to reflect configuration details.	2020-06-09 15:21:20 +03:00
Roman Khavronenko	ffa75c423d	vmalert-521: allow to disable rules expression validation. (#536 ) This feature may be useful for using `vmalert` with PromQL compatible datasources like Loki.	2020-06-06 21:27:09 +01:00
Roman Khavronenko	270552fde4	vmalert: Add recording rules support. (#519 ) * vmalert: Add recording rules support. Recording rules support required additional service refactoring since it wasn't planned to support them from the very beginning. The list of changes is following: * new entity RecordingRule was added for writing results of MetricsQL expressions into remote storage; * interface Rule now unites both recording and alerting rules; * configuration parser was moved to separate package and now performs more strict validation; * new endpoint for listing all groups and rules in json format was added; * evaluation interval may be set to every particular group; * vmalert: uncomment tests * vmalert: rm outdated TODO * vmalert: fix typos in README	2020-06-01 13:46:37 +03:00
Roman Khavronenko	a249cd9d22	vmalert: fix the access to rules slice element by wrong index (#486 ) During group's update rules deletion was causing slice mutations while slice index was assumed to be unchanged. This caused "slice bounds out of range" errors when multiple rules were deleted sequentially.	2020-05-15 07:55:22 +01:00
Roman Khavronenko	8c8ff5d0cb	vmalert: cleanup and restructure of code to improve maintainability (#471 ) The change introduces new entity `manager` which replaces `watchdog`, decouples requestHandler and groups. Manager supposed to control life cycle of groups, rules and config reloads. Groups export an ID method which returns a hash from filename and group name. ID supposed to be unique identifier across all loaded groups. Some tests were added to improve coverage. Bug with wrong annotation value if $value is used in templates after metrics being restored fixed. Notifier interface was extended to accept context. New set of metrics was introduced for config reload.	2020-05-10 17:58:17 +01:00

35 commits