On templates validation stage vmalert does not acutally send queries, so for complex
chained expression validation may fail. To avoid this, we add a blank sample in response
so validation can pass successfully. Later, during the rule execution, stub will be replaced
with real `query` function.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/989
The commit adds a support for template function `query`,
`first` and `value`. The function `query` executes
a MetricsQL query for active alerts. In vmalert we
update templates on every evaluation for active alerts
to keep them up to date. With `query` func it may become
a perf issue since it will fire a query on every execution.
We should keep it in mind for now.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/539
The previous implementation treated extra labels (global and rule labels) as
separate label set to returned time series labels. Hence, time series always contained
only original labels and alert ID was generated from sorted labels key-values.
Extra labels didn't affect the generated ID and were applied on the following actions:
- templating for Summary and Annotations;
- persisting state via remote write;
- restoring state via remote read.
Such behaviour caused difficulties on restore procedure because extra labels had to be dropped
before checking the alert ID, but that not always worked. Consider the case when expression
returns the following time series `up{job="foo"}` and rule has extra label `job=bar`.
This would mean that restored alert ID will be always different to the real time series because
of collision.
To solve the situation extra labels are now always applied beforehand and `vmalert` doesn't
store original labels anymore. However, this could result into a new error situation.
Consider the case when expression returns two time series `up{job="foo"}` and `up{job="baz"}`,
while rule has extra label `job=bar`. In such case, applying extra labels will result into
two identical time series and `vmalert` will return error:
`result contains metrics with the same labelset after applying rule labels`
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870
Label `alertgroup` was introduced in #611 and automatically added to generated
time series. By mistake, this new label wasn't correctly purged on restore event
and affected alert's ID uniqueness. This commit removes `alertgroup` label
in restore function.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870
On config reload event `vmalert` reloads configuration for every group. While
it works for simple configurations, the more complex and heavy installations may
suffer from frequent config reloads.
The change introduces the `checksum` field for every group and is set to md5 hash
of yaml configuration. The checksum will change if on any change to group
definition like rules order or annotation change. Comparing the `checksum` field
on config reload event helps to detect if group should be updated.
The groups update is now done concurrently, so reload duration will be limited by
the slowest group now.
Partially solves #691 by improving config reload speed.
* VMAlert start with empty rules dir
There are some applications (operator for instance), that generates alerts configuration at runtime
and vmalert must start correctly without rules to support this behaviour.
Later application will add rules files and send SIGHUP to vmalert,
which will trigger reading rules files and start rules exectuion.
Removing rules files with SIGHUP signal must stop rules execution and
vmalert will wait for new rules.
* imports sorted
* added test cases for empty rules, removed blank line
* fixed imports conflict
* updated tests
* feat: spread load of rule evaluation by group when starting new groups
* review: reduce the resulting diff.
* Update app/vmalert/group.go
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
Description for `-rule` flag uses as example specific chars like asterisks
which could be interpreted wrong by different shells. To avoid this, description
now contains quoted flag values.
See also #708
* app/vmalert: extend metrics set exported by `vmalert` #573
New metrics were added to improve observability:
+ vmalert_alerts_pending{alertname, group} - number of pending alerts per group
per alert;
+ vmalert_alerts_acitve{alertname, group} - number of active alerts per group
per alert;
+ vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error
during prev execution, is 0 if no errors happened;
+ vmalert_recording_rules_error{recording, group} - is 1 if recording rule
ended up with error during prev execution, is 0 if no errors happened;
* vmalert_iteration_total{group, file} - now contains group and file name labels.
This should improve control over specific groups;
* vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups;
Some collisions for alerts and recording rules are possible, because neither
group name nor alert/recording rule name are unique for compatibility reasons.
Commit contains list of TODOs for Unregistering metrics since groups and rules
are ephemeral and could be removed without application restart. In order to
unlock Unregistering feature corresponding PR was filed - https://github.com/VictoriaMetrics/metrics/pull/13
* app/vmalert: extend metrics set exported by `vmalert` #573
The changes are following:
* add an ID label to rules metrics, since `name` collisions within one group is
a common case - see the k8s example alerts;
* supports metrics unregistering on rule updates. Consider the case when one rule
was added or removed from the group, or the whole group was added or removed.
The change depends on https://github.com/VictoriaMetrics/metrics/pull/16
where race condition for Unregister method was fixed.
`external.label` flag supposed to help to distinguish alert or recording rules
source in situations when more than one `vmalert` runs for the same datasource
or AlertManager.
* app/vmalert: add retries to remotewrite
Remotewrite pkg now does limited number of retries if write request failed.
This suppose to make vmalert state persisting more reliable.
New metrics were added to remotewrite in order to track rows/bytes sent/dropped.
defaultFlushInterval was increased from 1s to 5s for sanity reasons.
* fix
* wip
* wip
* wip
* fix bits alignment bug for 32-bit systems
* fix mistakenly dropped field
* Fix Auto metrics relabeled errors
* Finalize auto-genenated Labels
* Fix Test Errors
* fix error logs when queue is full
Co-authored-by: xinyulong <xinyulong@kuaishou.com>
* app/vmalert: support multiple notifier urls (#584)
User now can set multiple notifier URLs in the same fashion
as for other vmutils (e.g. vmagent). The same is correct for
TLS setting for every configured URL. Alerts sending is done
in sequential way for respecting the specified URLs order.
* app/vmalert: add basicAuth support for notifier client (#585)
The change adds possibility to set basicAuth creds for notifier
client in the same fasion as for remote write/read and datasource.
The change adds no new functionality and aims to move flags definitions
to subpackages that are using them. This should improve readability
of the main function.
app/vmalert: add support for TLS configuration
Add support for TLS optional configuration in a similar fashion to what
is currently supported in other vmutils such as vmagent. TLS
configuration options are distinct for datasource, remoteRead,
remoteWrite as well as notifier.
app/vmalert: Support custom URL for alerts source
Add flag `external.alert.source` for configuring custom URL
for alert's source. This may be handy to re-point default source
URL to other systems like Grafana.
Updates #517
Uniqueness of rule is now defined by combination of its name, expression and
labels. The hash of the combination is now used as rule ID and identifies rule within the group.
Set of rules from coreos/kube-prometheus was added for testing purposes to
verify compatibility. The check also showed that `vmalert` doesn't support
`query` template function that was mentioned as limitation in README.
The feature allows to speed up group rules execution by
executing them concurrently.
Change also contains README changes to reflect configuration
details.
* vmalert: Add recording rules support.
Recording rules support required additional service refactoring since
it wasn't planned to support them from the very beginning. The list
of changes is following:
* new entity RecordingRule was added for writing results of MetricsQL
expressions into remote storage;
* interface Rule now unites both recording and alerting rules;
* configuration parser was moved to separate package and now performs
more strict validation;
* new endpoint for listing all groups and rules in json format was added;
* evaluation interval may be set to every particular group;
* vmalert: uncomment tests
* vmalert: rm outdated TODO
* vmalert: fix typos in README
Before the change we were sending notifications to notifier
if following conditions are met:
* alert is in Fire state
* alert is in Inactive state
We were sending Inactive notifications to resolve alert ASAP.
Unfortunately, we were sending resolves for Pending alerts that become
Inactive, which is wrong.
In this change we delete alert from the active list if
it was Pending and become Inactive. In this way we now
have Inactive alerts only if they were in state Fire before.
See test change for example.
During group's update rules deletion was causing slice
mutations while slice index was assumed to be unchanged.
This caused "slice bounds out of range" errors when multiple
rules were deleted sequentially.
The check for non-nil remoteRead was mistakenly dropped
during refactoring which caused panics when `vmalert`
wasn't configured with `remoteRead` flag.
* feat: add remoteWrite.maxQueueSize to reduce queue full
* rename remote(write|read) flags to remote(Write|Read) for the sake of consistency
Co-authored-by: xiaobeibei <xiaobeibei@bigo.sg>
The change introduces new entity `manager` which replaces
`watchdog`, decouples requestHandler and groups. Manager
supposed to control life cycle of groups, rules and
config reloads.
Groups export an ID method which returns a hash
from filename and group name. ID supposed to be unique
identifier across all loaded groups.
Some tests were added to improve coverage.
Bug with wrong annotation value if $value is used in
templates after metrics being restored fixed.
Notifier interface was extended to accept context.
New set of metrics was introduced for config reload.
* app/vmalert: restore alerts state from datasource metrics
Vmalert will restore alerts state for rules that have `rule.For` > 0 from previously written timeseries via `remotewrite.url` flag.
* app/vmalert: mention remotewerite and remoteread configuration in README
* app/vmalert: initial remote-write support for alerts state persistence.
If `remotewrite.url` flag is set, vmalert will send alerts state via remote-write protocol to remote storage. The sending is asynchronous to avoid blocking calls in rules evaluation loop.
* app/vmalert: merge with master
* app/vmalert: write both `instant` and `for` alerts timeseries states in remote storage.
vmalert: Extend web responses for alerts
* populate apiAlert object with additional fields
* return all active alerts, not only firing
* sort list of API alerts for deterministic output
* add helper for available path list
* Initial rules evaluation support.
Rules are now store alerts state in private field `alerts`. Every evaluation updates
the alerts and state. Every unique metric received from datastore represents a unique alert,
uniqueness is guaranteed by hashing ordered labelset.
* merge with master
* cleanup
* support endAt parameter as 3*evaluationInterval for active alerts
* make golint happy