update guide and changelog to 1.9.1 (#5706)

This commit is contained in:
Fred Navruzov 2024-01-28 08:43:28 +01:00 committed by GitHub
parent 6c6c2c185f
commit 9b555a0034
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 51 additions and 40 deletions

View file

@ -1,7 +1,7 @@
scheduler:
infer_every: "1m"
fit_every: "2m"
fit_window: "14d"
fit_window: "3h"
model:
class: "model.prophet.ProphetModel"

View file

@ -15,14 +15,20 @@ aliases:
Please find the changelog for VictoriaMetrics Anomaly Detection below.
The following `tip` changes can be tested by building from the `latest` tag:
```sh
docker pull us-docker.pkg.dev/victoriametrics-test/public/vmanomaly-trial:latest
```
> **Important note: Users are strongly encouraged to upgrade to `vmanomaly` [v1.9.1](https://hub.docker.com/repository/docker/victoriametrics/vmanomaly/tags?page=1&ordering=name) or later versions for optimal performance and accuracy. This recommendation is crucial for configurations with a low `infer_every` parameter [in your scheduler](https://docs.victoriametrics.com/anomaly-detection/components/scheduler/#parameters-1), and in scenarios where data exhibits significant high-order seasonality patterns (such as hourly or daily cycles). Previous versions from v1.5.1 to v1.8.0 were identified to contain a critical issue impacting model training, where models were inadvertently trained on limited data subsets, leading to suboptimal fits, affecting the accuracy of anomaly detection. Upgrading to v1.9.1 addresses this issue, ensuring proper model training and enhanced reliability. For users utilizing Helm charts, it is recommended to temporarily revert to version [0.5.0](https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-anomaly/CHANGELOG.md#050). This action is advised until an updated version, encompassing the necessary bug fixes, becomes available.**
Please find [launch instructions here](/anomaly-detection/Overview.html#run-vmanomaly-docker-container).
# tip
## v1.9.1
Released: 2024-01-27
- IMPROVEMENT: Updated the offline license verification backbone to mitigate a critical vulnerability identified in the [`ecdsa`](https://pypi.org/project/ecdsa/) library, ensuring enhanced security despite initial non-impact.
- IMPROVEMENT: bump 3rd-party dependencies for Python 3.12.1
## v1.9.0
Released: 2024-01-26
- BUGFIX: The `query_from_last_seen_timestamp` internal logic in [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader.html#vm-reader), first introduced in [v1.5.1](https://docs.victoriametrics.com/anomaly-detection/CHANGELOG.html#v151), now functions correctly. This fix ensures that the input data shape remains consistent for subsequent `fit`-based model calls in the service.
- BREAKING CHANGE: The `sampling_period` parameter is now mandatory in [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader.html#vm-reader). This change aims to clarify and standardize the frequency of input/output in `vmanomaly`, thereby reducing uncertainty and aligning with user expectations.
> **Note**: The majority of users, who have been proactively specifying the `sampling_period` parameter in their configurations, will experience no disruption from this update. This transition formalizes a practice that was already prevalent and expected among our user base.
## v1.8.0

View file

@ -57,7 +57,7 @@ Future updates will introduce additional readers, expanding the range of data so
<tr>
<td><code>sampling_period</code></td>
<td><code>"1h"</code></td>
<td>Optional. Frequency of the points returned. Will be converted to <code>"/query_range?step=%s"</code> param (in seconds).</td>
<td>Frequency of the points returned. Will be converted to <code>"/query_range?step=%s"</code> param (in seconds). **Required** since <a href="https://docs.victoriametrics.com/anomaly-detection/changelog/#v190">1.9.0</a>.</td>
</tr>
<tr>
<td><code>query_range_path</code></td>

View file

@ -36,7 +36,8 @@ aliases:
All the service parameters are defined in a config file.
A single config file supports only one model. It is ok to run multiple vmanomaly processes, each using its own config.
> **Note**: As of the time of writing, in the [1.9.1](https://docs.victoriametrics.com/anomaly-detection/changelog/#v191) release and earlier versions, each `vmanomaly` configuration file is limited to supporting only one model type. To utilize *different models* on your data, it is necessary to run multiple instances of the `vmanomaly` process. Each instance should operate with its own configuration file, differing in the `model` section.
**vmanomaly** does the following:
- periodically queries user-specified metrics
@ -98,55 +99,58 @@ node_cpu_seconds_total{cpu="1",mode="iowait"} 51.22
...
```
Here, metric `node_cpu_seconds_total` tells us how many seconds each CPU spent in different modes: _user_, _system_, _iowait_, _idle_, _irq&softirq_, _guest_, or _steal_.
These modes are mutually exclusive. A high _iowait_ means that you are disk or network bound, high _user_ or _system_ means that you are CPU bound.
In this context, the metric `node_cpu_seconds_total` provides a comprehensive breakdown of the time each CPU core has spent in various operational modes. These modes include: _user_, _system_, _iowait_, _idle_, _irq&softirq_, _guest_, and _steal_. Each of these eight modes is mutually exclusive, offering distinct insights into CPU activity. For instance, a predominant _iowait_ suggests disk or network bottlenecks, while elevated levels in _user_ or _system_ indicate significant CPU utilization.
The metric `node_cpu_seconds_total` is a [counter](https://docs.victoriametrics.com/keyConcepts.html#counter) type of metric. If we'd like to see how much time CPU spent in each of the nodes, we need to calculate the per-second values change via [rate function](https://docs.victoriametrics.com/MetricsQL.html#rate): `rate(node_cpu_seconds_total)`. To aggregate data by mode we'll use median or 50% quantile function. Resulting query will look likt this: `quantile by (mode) (0.5, rate(node_cpu_seconds_total[5m])`
Here is how this query may look like in Grafana:
The `node_cpu_seconds_total` metric is classified as a [counter](https://docs.victoriametrics.com/keyConcepts.html#counter) type. To analyze the duration each CPU core spends in these modes, it is necessary to compute the rate of change per second using the [rate function](https://docs.victoriametrics.com/MetricsQL.html#rate): `rate(node_cpu_seconds_total)`. For a more refined and smoother aggregation of data by mode, we apply the median function, or the 50% quantile. The resulting query is formulated as follows: `quantile by (mode) (0.5, rate(node_cpu_seconds_total[5m]))`.
Below is an illustrative example of how this query might be visualized in Grafana:
<img alt="node_cpu_rate_graph" src="guide-vmanomaly-vmalert-query.webp">
This query result will generate 8 time series per each cpu, and we will use them as an input for our VM Anomaly Detection. vmanomaly will start learning configured model type separately for each of the time series.
This query will yield a total of eight time series, each corresponding to a CPU mode. The number of series is unaffected by the number of CPU cores, due to the `by` aggregation applied. These series serve as the input for `vmanomaly`, where the service independently fits a separate instance of the configured model type to each of time series.
## 5. vmanomaly configuration and parameter description
**Parameter description**:
There are 4 required sections in config file:
The configuration file for `vmanomaly` comprises 4 essential sections:
[`scheduler`](/anomaly-detection/components/scheduler.html) - defines how often to run and make inferences, as well as what timerange to use to train the model.
1. [`scheduler`](/anomaly-detection/components/scheduler.html) - This section determines the frequency of model inferences and training, including the time range for model training.
[`model`](/anomaly-detection/components/models.html) - specific model parameters and configurations,
2. [`model`](/anomaly-detection/components/models.html) - Here, you define specific parameters and configurations for the model being used for anomaly detection.
[`reader`](/anomaly-detection/components/reader.html) - how to read data and where it is located
3. [`reader`](/anomaly-detection/components/reader.html) - This section outlines the methodology for data reading, including the data source location.
[`writer`](/anomaly-detection/components/writer.html) - where and how to write generated output.
4. [`writer`](/anomaly-detection/components/writer.html) - Specifies the destination and method for writing the generated output.
[`monitoring`](/anomaly-detection/components/monitoring.html) (optional) - how to expose healthckeck metrics of `vmanomaly`.
5. [`monitoring`](/anomaly-detection/components/monitoring.html) (optional) - Describes how to monitor and expose health check metrics of `vmanomaly`.
Let's look into parameters in each section:
Detailed parameters in each section:
* `scheduler`
* `infer_every` - how often trained models will make inferences on new data. Basically, how often to generate new datapoints for anomaly_score. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days). You can look at this as how often a model will write its conclusions on newly added data. Here in example we are asking every 1 minute: based on the previous data, do these new datapoints look abnormal?
* `fit_every` - how often to retrain the models. The higher the frequency -- the fresher the model, but the more CPU it consumes. If omitted, the models will be retrained on each infer_every cycle. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days).
* `fit_window` - what data interval to use for model training. Longer intervals capture longer historical behavior and detect seasonalities better, but is slower to adapt to permanent changes to metrics behavior. Recommended value is at least two full seasons. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days). Here is the previous 14 days of data to put into the model training.
* `model`
* `class` - what model to run. You can [use your own model](/anomaly-detection/components/models.html#custom-model-guide) or choose from [built-in models](/anomaly-detection/components/models.html#built-in-models). Here we use [Facebook Prophet](/anomaly-detection/components/models.html#prophet) (`model.prophet.ProphetModel`).
* `args` - Model specific parameters, represented as YAML dictionary in a simple `key: value` form. For example, you can use parameters that are available in [FB Prophet](https://facebook.github.io/prophet/docs/quick_start.html).
* `reader`
* `datasource_url` - Data source. An HTTP endpoint that serves `/api/v1/query_range`.
* `queries`: - MetricsQL (extension of PromQL) expressions, where you want to find anomalies.
You can put several queries in a form:
`<QUERY_ALIAS>: "QUERY"`. QUERY_ALIAS will be used as a `for` label in generated metrics and anomaly scores.
* `writer`
* `datasource_url` - Output destination. An HTTP endpoint that serves `/api/v1/import`.
* `infer_every` - Specifies the frequency at which the trained models perform inferences on new data, essentially determining how often new anomaly score data points are generated. Format examples: 30s, 4m, 2h, 1d (time units: 's' for seconds, 'm' for minutes, 'h' for hours, 'd' for days). This parameter essentially asks, at regular intervals (e.g., every 1 minute), whether the latest data points appear abnormal based on historical data.
* `fit_every` - Sets the frequency for retraining the models. A higher frequency ensures more updated models but requires more CPU resources. If omitted, models are retrained in each `infer_every` cycle. Format is similar to `infer_every`.
* `fit_window` - Defines the data interval for training the models. Longer intervals allow for capturing extensive historical behavior and better seasonal pattern detection but may slow down the model's response to permanent metric changes and increase resource consumption. A minimum of two full seasonal cycles is recommended. Example format: 3h for three hours of data.
Here is an example of the config file `vmanomaly_config.yml`.
* `model`
* `class` - Specifies the model to be used. Options include custom models ([guide here](/anomaly-detection/components/models.html#custom-model-guide)) or a selection from [built-in models](/anomaly-detection/components/models.html#built-in-models), such as the [Facebook Prophet](/anomaly-detection/components/models.html#prophet) (`model.prophet.ProphetModel`).
* `args` - Model-specific parameters, formatted as a YAML dictionary in the `key: value` structure. Parameters available in [FB Prophet](https://facebook.github.io/prophet/docs/quick_start.html) can be used as an example.
* `reader`
* `datasource_url` - The URL for the data source, typically an HTTP endpoint serving `/api/v1/query_range`.
* `queries` - MetricsQL (an extension of PromQL) expressions for anomaly detection. Multiple queries can be formatted as `<QUERY_ALIAS>: "QUERY"`, where QUERY_ALIAS is used as a label in the generated metrics and anomaly scores.
* `writer`
* `datasource_url` - Specifies the endpoint for output data, typically an HTTP endpoint serving `/api/v1/import`.
Below is an illustrative example of a `vmanomaly_config.yml` configuration file.
``` yaml
scheduler:
infer_every: "1m"
fit_every: "2m"
fit_window: "14d"
fit_window: "3h"
model:
class: "model.prophet.ProphetModel"
@ -179,7 +183,7 @@ As the result of running vmanomaly, it produces the following metrics:
- `y` - initial query result value.
Here is an example of how output metric will be written into VictoriaMetrics:
`anomaly_score{for="node_cpu_rate", cpu="0", instance="node-xporter:9100", job="node-exporter", mode="idle"} 0.85`
`anomaly_score{for="node_cpu_rate", instance="node-xporter:9100", job="node-exporter", mode="idle"} 0.85`
## 7. vmalert configuration
@ -198,9 +202,10 @@ groups:
summary: Anomaly Score exceeded 1.0. `rate(node_cpu_seconds_total)` is showing abnormal behavior.
```
In the query expression `expr`, it's crucial to establish a criterion based on the generated anomaly scores. Typically, an [anomaly score](https://docs.victoriametrics.com/anomaly-detection/faq/#what-is-anomaly-score) ranging from 0.0 to 1.0 indicates that the analyzed value falls within normal behavior. Scores exceeding 1.0 signal increasing confidence from our model that the observed value is anomalous.
Selecting an appropriate threshold for the anomaly score depends on your specific requirements and the context of the data. One effective method for determining this threshold is through visual analysis. By plotting the `anomaly_score` metric in conjunction with the predicted 'expected' range, delineated by `yhat_lower` and `yhat_upper`, you can make a more informed decision. Later in this tutorial, we will demonstrate this process with a practical example.
In the query expression we need to put a condition on the generated anomaly scores. Usually if the anomaly score is between 0.0 and 1.0, the analyzed value is not abnormal. The more anomaly score exceeded 1 the more our model is sure that value is an anomaly.
You can choose your threshold value that you consider reasonable based on the anomaly score metric, generated by vmanomaly. One of the best ways is to estimate it visually, by plotting the `anomaly_score` metric, along with predicted "expected" range of `yhat_lower` and `yhat_upper`. Later in this tutorial we will show an example
## 8. Docker Compose configuration
@ -454,7 +459,7 @@ To check if vmanomaly is up and running you can check docker logs:
```sh
docker logs vmanomaly
docker logs vmanomaly -f
```