docs/vmanomaly: add self-monitoring section (#7558)
### Describe Your Changes - Added self-monitoring guide for `vmanomaly`. - Added cross-referencing on other pages. - Slight improvements in wording on related pages - Update references to v1.18.4 - [x] publish Grafana dashboard to https://grafana.com/orgs/victoriametrics/dashboards: https://grafana.com/grafana/dashboards/22337-victoriametrics-vmanomaly/ @AndrewChubatiuk , JFYI if it somehow impacts your work on supporting `vmanomaly` in operator. ### Checklist The following checks are **mandatory**: - [x] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/).
|
@ -167,6 +167,8 @@ The list of alerting rules is the following:
|
|||
alerting rules related to [vmauth](https://docs.victoriametrics.com/vmauth/) component;
|
||||
* [alerts-vlogs.yml](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/rules/alerts-vlogs.yml):
|
||||
alerting rules related to [VictoriaLogs](https://docs.victoriametrics.com/victorialogs/);
|
||||
* [alerts-vmanomaly.yml](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/rules/alerts-vmanomaly.yml):
|
||||
alerting rules related to [VictoriaMetrics Anomaly Detection](https://docs.victoriametrics.com/anomaly-detection/);
|
||||
|
||||
Please, also see [how to monitor](https://docs.victoriametrics.com/single-server-victoriametrics/#monitoring)
|
||||
VictoriaMetrics installations.
|
||||
|
|
121
deployment/docker/rules/alerts-vmanomaly.yml
Normal file
|
@ -0,0 +1,121 @@
|
|||
# This file provides a recommended list of alerts to monitor the health of VictoriaMetrics Anomaly Detection (vmanomaly).
|
||||
# Note: The alerts below are general recommendations and may require customization,
|
||||
# including threshold adjustments, to suit the specifics of your setup.
|
||||
|
||||
groups:
|
||||
# Note - Adjust the `job` filter to match your specific setup.
|
||||
# By default, the `job` label for vmanomaly in push-based self-monitoring mode is set to `vmanomaly`.
|
||||
# However, this can be overridden using additional labels. For further details, refer to the example here:
|
||||
# https://docs.victoriametrics.com/anomaly-detection/components/monitoring/?highlight=extra_labels#monitoring-section-config-example
|
||||
- name: vmanomaly-health
|
||||
rules:
|
||||
- alert: TooManyRestarts
|
||||
expr: changes(process_start_time_seconds{job=~".*vmanomaly.*"}[15m]) > 2
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "{{ $labels.job }} too many restarts (instance {{ $labels.instance }})"
|
||||
description: |
|
||||
Job {{ $labels.job }} (instance {{ $labels.instance }}) has restarted more than twice in the last 15 minutes.
|
||||
It might be crashlooping. Please check the logs for more details.
|
||||
Additionally, refer to the "r:errors" value in the "Instance Overview" section of the self-monitoring Grafana dashboard.
|
||||
|
||||
# works if you use Prometheus scraping (pull model only)
|
||||
- alert: ServiceDown
|
||||
expr: up{job=~".*vmanomaly.*"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
|
||||
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5m"
|
||||
|
||||
- alert: ProcessNearFDLimits
|
||||
expr: (process_max_fds{job=~".*vmanomaly.*"} - process_open_fds{job=~".*vmanomaly.*"}) < 100
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
|
||||
description: |
|
||||
Exhausting OS file descriptors limit can cause severe degradation of the process.
|
||||
Consider to increase the limit as fast as possible.
|
||||
|
||||
- alert: TooHighCPUUsage
|
||||
expr: >
|
||||
sum(rate(process_cpu_seconds_total{job=~".*vmanomaly.*"}[5m])) by (job, instance) /
|
||||
sum(vmanomaly_cpu_cores_available{job=~".*vmanomaly.*"}[5m]) by (job, instance) > 0.9
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "More than 90% of CPU is used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
|
||||
description: >
|
||||
Too high CPU usage may be a sign of insufficient resources and make process unstable.
|
||||
Consider to either increase available CPU resources or decrease the load on the process.
|
||||
|
||||
- alert: TooHighMemoryUsage
|
||||
expr: (min_over_time(process_resident_memory_bytes[10m]) / vmanomaly_available_memory_bytes) > 0.85
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "It is more than 85% of memory used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\")"
|
||||
description: |
|
||||
Too high memory usage may result into multiple issues such as OOMs or degraded performance.
|
||||
E.g. it can be caused by high churn rate in your input data.
|
||||
Consider to either increase available memory or decrease the load on the process.
|
||||
|
||||
- name: vmanomaly-issues
|
||||
rules:
|
||||
- alert: ServiceErrorsDetected
|
||||
expr: sum(increase(vmanomaly_model_run_errors_total{job=~".*vmanomaly.*"}[5m])) by (job, instance, stage) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Model Run Errors in \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") stage: {{ $labels.stage }} during the last 5m"
|
||||
description: >
|
||||
Errors in the service may indicate a problem with the service itself or its dependencies.
|
||||
Investigate the logs for more details.
|
||||
- alert: SkippedModelRunsDetected
|
||||
expr: sum(increase(vmanomaly_model_runs_skipped_total{job=~".*vmanomaly.*"}[5m])) by (job, instance, stage) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Skipped Model Runs in \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") stage: {{ $labels.stage }} during the last 5m"
|
||||
description: >
|
||||
Skipped model runs may indicate issues like:
|
||||
1. No new or valid data is available for the current run.
|
||||
2. The presence of new time series that do not have a trained model yet.
|
||||
3. No new (or valid) datapoints produced during inference.
|
||||
Investigate the logs for more details.
|
||||
- alert: HighReadErrorRate
|
||||
expr: >
|
||||
(
|
||||
sum(increase(vmanomaly_reader_responses_total{job=~".*vmanomaly.*", code=~"2.."}[5m])) by (job, instance, url) /
|
||||
sum(increase(vmanomaly_reader_responses_total{job=~".*vmanomaly.*"}[5m])) by (job, instance, url)
|
||||
) < 0.95
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate in read requests for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for url: {{ $labels.url }} during the last 5m"
|
||||
description: >
|
||||
Reading errors may indicate issues with the input data source, server-side constraint violations, security or network issues.
|
||||
Investigate the logs for more details.
|
||||
- alert: HighWriteErrorRate
|
||||
expr: >
|
||||
(
|
||||
sum(increase(vmanomaly_writer_responses_total{job=~".*vmanomaly.*", code=~"2.."}[5m])) by (job, instance, url) /
|
||||
sum(increase(vmanomaly_writer_responses_total{job=~".*vmanomaly.*"}[5m])) by (job, instance, url)
|
||||
) < 0.95
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate in write requests for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for url: {{ $labels.url }} during the last 5m"
|
||||
description: >
|
||||
Writing errors may indicate issues with the destination source, server-side constraint violations, security, or network issues.
|
||||
Investigate the logs for more details.
|
|
@ -72,7 +72,7 @@ services:
|
|||
restart: always
|
||||
vmanomaly:
|
||||
container_name: vmanomaly
|
||||
image: victoriametrics/vmanomaly:v1.18.3
|
||||
image: victoriametrics/vmanomaly:v1.18.4
|
||||
depends_on:
|
||||
- "victoriametrics"
|
||||
ports:
|
||||
|
|
|
@ -11,6 +11,11 @@ aliases:
|
|||
---
|
||||
Please find the changelog for VictoriaMetrics Anomaly Detection below.
|
||||
|
||||
## v1.18.4
|
||||
Released: 2024-11-18
|
||||
|
||||
- IMPROVEMENT: Introduced [self-monitoring guide](https://docs.victoriametrics.com/anomaly-detection/self-monitoring/) for `vmanomaly`. Added metrics for total RAM `vmanomaly_available_memory_bytes` and the number of logical CPU cores `vmanomaly_cpu_cores_available` to the [self-monitoring metrics](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#metrics-generated-by-vmanomaly).
|
||||
|
||||
## v1.18.3
|
||||
Released: 2024-11-14
|
||||
|
||||
|
|
|
@ -159,7 +159,7 @@ services:
|
|||
# ...
|
||||
vmanomaly:
|
||||
container_name: vmanomaly
|
||||
image: victoriametrics/vmanomaly:v1.18.3
|
||||
image: victoriametrics/vmanomaly:v1.18.4
|
||||
# ...
|
||||
ports:
|
||||
- "8490:8490"
|
||||
|
@ -361,3 +361,9 @@ reader:
|
|||
- '{region="region_name_2"}'
|
||||
# ...
|
||||
```
|
||||
|
||||
## Monitoring vmanomaly
|
||||
|
||||
`vmanomaly` includes self-monitoring features that allow you to track its health, performance, and detect arising issues. Metrics related to resource usage, model runs, errors, and I/O operations are visualized using a Grafana Dashboard and are complemented by alerting rules that notify you of critical conditions. These monitoring tools help ensure stability and efficient troubleshooting of the service.
|
||||
|
||||
For detailed instructions on setting up self-monitoring, dashboards, and alerting rules, refer to the [self-monitoring documentation](https://docs.victoriametrics.com/anomaly-detection/self-monitoring/).
|
||||
|
|
|
@ -229,7 +229,7 @@ This will expose metrics at `http://0.0.0.0:8080/metrics` page.
|
|||
To use *vmanomaly* you need to pull docker image:
|
||||
|
||||
```sh
|
||||
docker pull victoriametrics/vmanomaly:v1.18.3
|
||||
docker pull victoriametrics/vmanomaly:v1.18.4
|
||||
```
|
||||
|
||||
> Note: please check what is latest release in [CHANGELOG](https://docs.victoriametrics.com/anomaly-detection/changelog/)
|
||||
|
@ -239,7 +239,7 @@ docker pull victoriametrics/vmanomaly:v1.18.3
|
|||
You can put a tag on it for your convenience:
|
||||
|
||||
```sh
|
||||
docker image tag victoriametrics/vmanomaly:v1.18.3 vmanomaly
|
||||
docker image tag victoriametrics/vmanomaly:v1.18.4 vmanomaly
|
||||
```
|
||||
Here is an example of how to run *vmanomaly* docker container with [license file](#licensing):
|
||||
|
||||
|
|
|
@ -58,13 +58,13 @@ Below are the steps to get `vmanomaly` up and running inside a Docker container:
|
|||
1. Pull Docker image:
|
||||
|
||||
```sh
|
||||
docker pull victoriametrics/vmanomaly:v1.18.3
|
||||
docker pull victoriametrics/vmanomaly:v1.18.4
|
||||
```
|
||||
|
||||
2. (Optional step) tag the `vmanomaly` Docker image:
|
||||
|
||||
```sh
|
||||
docker image tag victoriametrics/vmanomaly:v1.18.3 vmanomaly
|
||||
docker image tag victoriametrics/vmanomaly:v1.18.4 vmanomaly
|
||||
```
|
||||
|
||||
3. Start the `vmanomaly` Docker container with a *license file*, use the command below.
|
||||
|
@ -98,7 +98,7 @@ docker run -it --user 1000:1000 \
|
|||
services:
|
||||
# ...
|
||||
vmanomaly:
|
||||
image: victoriametrics/vmanomaly:v1.18.3
|
||||
image: victoriametrics/vmanomaly:v1.18.4
|
||||
volumes:
|
||||
$YOUR_LICENSE_FILE_PATH:/license
|
||||
$YOUR_CONFIG_FILE_PATH:/config.yml
|
||||
|
|
|
@ -1,17 +1,19 @@
|
|||
In the dynamic and complex world of system monitoring, VictoriaMetrics Anomaly Detection, being a part of our [Enterprise offering](https://victoriametrics.com/products/enterprise/), stands as a pivotal tool for achieving advanced observability. It empowers SREs and DevOps teams by automating the intricate task of identifying abnormal behavior in time-series data. It goes beyond traditional threshold-based alerting, utilizing machine learning techniques to not only detect anomalies but also minimize false positives, thus reducing alert fatigue. By providing simplified alerting mechanisms atop of [unified anomaly scores](https://docs.victoriametrics.com/anomaly-detection/components/models/#vmanomaly-output), it enables teams to spot and address potential issues faster, ensuring system reliability and operational efficiency.
|
||||
In the dynamic and complex world of system monitoring, [VictoriaMetrics Anomaly Detection](https://victoriametrics.com/products/enterprise/anomaly-detection/) (or shortly, `vmanomaly`), being a part of our [Enterprise offering](https://victoriametrics.com/products/enterprise/), stands as a pivotal tool for achieving advanced observability. It empowers SREs and DevOps teams by automating the identification of abnormal behavior in time-series data. It goes beyond traditional threshold-based alerting, utilizing machine learning techniques to not only detect anomalies but also minimize false positives, thus reducing alert fatigue. By providing simplified alerting mechanisms atop of [unified anomaly scores](https://docs.victoriametrics.com/anomaly-detection/components/models/#vmanomaly-output), it enables teams to spot and address potential issues faster, ensuring system reliability and operational efficiency.
|
||||
|
||||
## Practical Guides and Installation
|
||||
Begin your VictoriaMetrics Anomaly Detection journey with ease using our guides and installation instructions:
|
||||
|
||||
- **Quickstart**: Check out how to get `vmanomaly` up and running [here](https://docs.victoriametrics.com/anomaly-detection/quickstart/).
|
||||
- **Overview**: Find out how `vmanomaly` service operates [here](https://docs.victoriametrics.com/anomaly-detection/overview/)
|
||||
- **Integration**: Integrate anomaly detection into your observability ecosystem. Get started [here](https://docs.victoriametrics.com/anomaly-detection/guides/guide-vmanomaly-vmalert/).
|
||||
- **Anomaly Detection Presets**: Enable anomaly detection on predefined set of indicators, that require frequently changing static thresholds for alerting. Find more information [here](https://docs.victoriametrics.com/anomaly-detection/presets/).
|
||||
Get started with VictoriaMetrics Anomaly Detection efficiently by following our guides and installation options:
|
||||
|
||||
- **Installation Options**: Select the method that aligns with your technical requirements:
|
||||
- **Docker Installation**: Suitable for containerized environments. See [Docker guide](https://docs.victoriametrics.com/anomaly-detection/overview/#run-vmanomaly-docker-container).
|
||||
- **Helm Chart Installation**: Appropriate for those using Kubernetes. See our [Helm charts](https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-anomaly).
|
||||
- **Quickstart**: Learn how to quickly set up `vmanomaly` by following the [Quickstart Guide](https://docs.victoriametrics.com/anomaly-detection/quickstart/).
|
||||
- **Overview**: Understand the architecture and operation of the `vmanomaly` service [here](https://docs.victoriametrics.com/anomaly-detection/overview/).
|
||||
- **Integration**: Integrate anomaly detection into your existing observability stack. Find detailed steps [here](https://docs.victoriametrics.com/anomaly-detection/guides/guide-vmanomaly-vmalert/).
|
||||
- **Anomaly Detection Presets**: Enable anomaly detection on predefined sets of metrics that require frequent static threshold changes for alerting. Learn more [here](https://docs.victoriametrics.com/anomaly-detection/presets/).
|
||||
|
||||
- **Installation Options**: Choose the installation method that best fits your infrastructure:
|
||||
- **Docker Installation**: Ideal for containerized environments. Follow the [Docker Installation Guide](https://docs.victoriametrics.com/anomaly-detection/overview/#run-vmanomaly-docker-container).
|
||||
- **Helm Chart Installation**: Recommended for Kubernetes deployments. See our [Helm charts](https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-anomaly).
|
||||
|
||||
- **Self-Monitoring**: Ensure `vmanomaly` is functioning optimally with built-in self-monitoring capabilities. Use the provided Grafana dashboards and alerting rules to track service health and operational metrics. Find the complete docs [here](https://docs.victoriametrics.com/anomaly-detection/self-monitoring/).
|
||||
|
||||
> **Note**: starting from [v1.5.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v150) `vmanomaly` requires a [license key](https://docs.victoriametrics.com/anomaly-detection/overview/#licensing) to run. You can obtain a trial license key [**here**](https://victoriametrics.com/products/enterprise/trial/).
|
||||
|
||||
|
@ -32,6 +34,9 @@ Enhance your knowledge with our handbook on Anomaly Detection & Root Cause Analy
|
|||
- [Techniques and Models for Anomaly Detection](https://victoriametrics.com/blog/victoriametrics-anomaly-detection-handbook-chapter-3/)
|
||||
* Follow the [`#anomaly-detection`](https://victoriametrics.com/blog/tags/anomaly-detection/) tag in our blog
|
||||
|
||||
## Product Updates
|
||||
Stay up-to-date with the latest improvements and features in VictoriaMetrics Anomaly Detection by following our [blog post series](https://victoriametrics.com/blog/victoriamerics-anomaly-detection-updates-2024-q1-2/).
|
||||
|
||||
## Frequently Asked Questions (FAQ)
|
||||
Got questions about VictoriaMetrics Anomaly Detection? Chances are, we've got the answers ready for you.
|
||||
|
||||
|
|
147
docs/anomaly-detection/Self-monitoring.md
Normal file
|
@ -0,0 +1,147 @@
|
|||
---
|
||||
weight: 5
|
||||
title: Self-monitoring
|
||||
menu:
|
||||
docs:
|
||||
identifier: "vmanomaly-self-monitoring"
|
||||
parent: "anomaly-detection"
|
||||
weight: 5
|
||||
aliases:
|
||||
- /anomaly-detection/self-monitoring.html
|
||||
---
|
||||
|
||||
## What is Self-Monitoring
|
||||
|
||||
Self-monitoring refers to the ability of the service to track and report its own health, operational performance, and any potential issues in real time. This process enables proactive maintenance and problem identification, making it easier for administrators and users to maintain system stability and performance.
|
||||
|
||||
VictoriaMetrics Anomaly Detection (`vmanomaly`) supports self-monitoring by generating metrics related to different operational aspects of the service, including model execution, reader behavior, writer behavior, and overall service health. These metrics provide insights into system bottlenecks, errors, or unusual behavior, and allow for data-driven decisions on tuning performance and reliability.
|
||||
|
||||
Self-monitoring metrics are available in both the [push](https://docs.victoriametrics.com/keyconcepts/#push-model) and [pull](https://docs.victoriametrics.com/keyconcepts/#pull-model) models, providing flexibility to fit into different monitoring environments. By specifying relevant parameters in the `monitoring` section of the configuration, `vmanomaly` components can seamlessly integrate with VictoriaMetrics, Prometheus or other monitoring solutions, enabling centralized visibility into both service and anomaly detection outcomes.
|
||||
|
||||
> **Note**: For the detailed overview of self-monitoring metrics that are produced by `vmanomaly` and how to enable their tracking for push/pull models, please refer to [monitoring](https://docs.victoriametrics.com/anomaly-detection/components/monitoring) section docs.
|
||||
|
||||
The self-monitoring assets of `vmanomaly` include Grafana dashboard and accompanying alerting rules.
|
||||
|
||||
## Grafana Dashboard
|
||||
|
||||
> **Note**: Recent revision of Grafana dashboard is designed to work with metrics produced by `vmanomaly` version [v1.18.4](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1184) or higher.
|
||||
|
||||
|
||||
### Overview
|
||||
|
||||
To visualize and interact with the self-monitoring metrics, `vmanomaly` provides a [Grafana Dashboard](https://grafana.com/grafana/dashboards/22337/). It offers an overview of service health (per job, instance) and performance metrics from different operational stages, including model behavior, reader, and writer components.
|
||||
|
||||
The Grafana Dashboard is helpful for:
|
||||
|
||||
- Monitoring common metrics such as uptime, RAM, and CPU usage.
|
||||
- Tracking key metrics like skipped model runs, request errors, and stage timings across `vmanomaly` components.
|
||||
- Identifying service-specific bottlenecks or performance issues at a glance.
|
||||
|
||||
### Navigating the Dashboard
|
||||
|
||||
> **Note**: Use the top-level filters to refine metrics by job, instance, or specific components for more focused monitoring. The time range filter, along with `job` and `instance` filters, is applied across all components. All other filters apply to all dashboard sections except "Instance Overview." Hover over the (i) icon for detailed filter descriptions.
|
||||
|
||||
<img src="vmanomaly-dashboard-1-filters.webp" alt="vmanomaly-dashboard-1-filters" width="800px"/>
|
||||
|
||||
The Grafana Dashboard for `vmanomaly` is organized into various panels that offer insights into different components and their operational metrics. The main sections are as follows:
|
||||
|
||||
### Instance Overview
|
||||
|
||||
<img src="vmanomaly-dashboard-2-instance-overview.webp" alt="vmanomaly-dashboard-2-instance-overview" width="800px"/>
|
||||
|
||||
This panel provides general information about the state at individual `instance` level, including metrics such as uptime, restarts, errors, license expiration, and overall status. It serves as a critical starting point for assessing the health of the anomaly detection service. If any issues are identified with a particular instance — such as a low success rate, a high number of skipped or erroneous runs, or increased resource consumption — you can drill down further by using the dashboard filter `instance={{instance}}` for more detailed analysis.
|
||||
|
||||
**Healthy scenario**:
|
||||
- **I/O: Reads, Writes, Total**: These should all be close to 100%, indicating successful data handling.
|
||||
- **Acceptance Rate (`r:acc.rate`)**: Should be close to 100%, meaning there is no invalid data (e.g., `Inf` or `NaN`).
|
||||
- **Resource Usage (CPU, RAM)**: CPU and RAM usage should be within expected limits.
|
||||
- **Errors (`r:errors`)**: Should be zero, indicating no major faults in operations.
|
||||
- **Skipped Runs (`r:skipped`)**: Should be close to zero, reflecting minimal disruptions in data processing.
|
||||
|
||||
### Global Statistics
|
||||
|
||||
#### Models
|
||||
This global panel holds statistics related to models, filtered by the dashboard settings, including:
|
||||
- The number of unique models in use and the entities they interact with (e.g., queries, schedulers, instances).
|
||||
- Counts of successful, skipped, or erroneous model runs.
|
||||
- Average timings for different model stages.
|
||||
|
||||
<img src="vmanomaly-dashboard-3-global-panel-models.webp" alt="vmanomaly-dashboard-3-global-panel-models" width="800px"/>
|
||||
|
||||
**Healthy scenario**:
|
||||
- **Data Acceptance**: Should be consistently high, ideally close to 100%. This indicates that the system is successfully processing the majority of incoming data without issues (e.g., no NaNs or Inf values).
|
||||
- **Erroneous Runs**: There should be zero erroneous runs. Any errors suggest potential issues with the service or uncaught corner cases that need immediate attention.
|
||||
- **Skipped Runs**: Should be minimal, ideally none. A significant number of skipped runs may indicate missing data, configuration problems, high churn rate (resulting in no models trained yet for new time series), or other underlying issues.
|
||||
- **Problematic Runs Percentage**: This should be close to 0%. A higher percentage warrants further investigation into potential bottlenecks or misconfigurations.
|
||||
- **Timings**: Processing time sparklines should remain stable over time without significant peaks.
|
||||
|
||||
#### I/O
|
||||
This global panel holds statistics related to I/O operations and data processing, filtered by the dashboard settings.
|
||||
|
||||
<img src="vmanomaly-dashboard-3-global-panel-io.webp" alt="vmanomaly-dashboard-3-global-io" width="800px"/>
|
||||
|
||||
**Healthy scenario**:
|
||||
- **I/O success, %**: Should be close to 100%.
|
||||
- **Timeseries graphs**: Should appear stable over time, without significant spikes or drops.
|
||||
|
||||
#### Latency
|
||||
This global panel holds latency statistics (reads, writes, response processing by stages), filtered by the dashboard settings.
|
||||
|
||||
<img src="vmanomaly-dashboard-3-global-panel-latency.webp" alt="vmanomaly-dashboard-3-global-latency" width="800px"/>
|
||||
|
||||
**Healthy scenario**:
|
||||
- **Timeseries graphs**: Should appear stable over time, without significant spikes or drops.
|
||||
- **Histograms**: Should not have right-hand buckets significantly higher than all the other buckets, indicating the absence of consecutive latency spikes.
|
||||
|
||||
#### Resource
|
||||
|
||||
This global panel holds resource utilization (CPU, RAM, File Descriptors) on both an overall and per-`instance` level, filtered by the dashboard settings.
|
||||
|
||||
<img src="vmanomaly-dashboard-3-global-panel-resources.webp" alt="vmanomaly-dashboard-3-global-resources" width="800px"/>
|
||||
|
||||
**Healthy scenario**:
|
||||
- **Timeseries graphs**: Should appear stable over time, without significant spikes or drops. An absence of upward trends (e.g., trends in RAM usage may indicate a [high churn rate](https://docs.victoriametrics.com/faq/#what-is-high-churn-rate) in your input data).
|
||||
|
||||
### Model Statistics
|
||||
|
||||
These panels contain repeated blocks for each unique `model_alias` (a distinct entity defined in the `models` [configuration section](https://docs.victoriametrics.com/anomaly-detection/components/models)), filtered according to the current dashboard settings. They provide information on the number of unique entities (such as queries, schedulers, and instances) that a particular `model_alias` interacts with, as well as the count of active model instances available for inferring new data.
|
||||
|
||||
<img src="vmanomaly-dashboard-4-model-sections.webp" alt="vmanomaly-dashboard-4-model-sections" width="800px"/>
|
||||
|
||||
**Healthy scenario**:
|
||||
- **Erroneous Runs**: There should be zero erroneous runs. Any errors suggest potential issues with the service or uncaught corner cases that need immediate attention.
|
||||
- **Skipped Runs**: Should be minimal, ideally none. A significant number of skipped runs may indicate missing data, configuration problems, high churn rate (resulting in no models trained yet for new time series), or other underlying issues.
|
||||
- **Timeseries graphs**: Should remain stable over time, without significant spikes or drops. An absence of upward trends (e.g., in RAM usage) could indicate normal operation without excessive [churn rate](https://docs.victoriametrics.com/faq/#what-is-high-churn-rate).
|
||||
|
||||
---
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
### Overview
|
||||
|
||||
Alerting rules for `vmanomaly` are a critical part of ensuring that any severe issues are promptly detected, so the users can be timely notified of problems such as service downtime, excessive error rates, or insufficient system resources.
|
||||
|
||||
The alerting rules are provided in a YAML file called [`alerts-vmanomaly.yml`](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker/rules/alerts-vmanomaly.yml).
|
||||
|
||||
### Using the Alerting Rules
|
||||
|
||||
These alerting rules complements the [dashboard](#grafana-dashboard) to monitor the health of `vmanomaly`. Each alert has annotations to help understand the issue and guide troubleshooting efforts. Below are the key alerts included, grouped into 2 sections:
|
||||
|
||||
<img src="firing-alerts-groups.webp" alt="firing-alerts-groups" width="800px"/>
|
||||
|
||||
`vmanomaly-health` alerting group:
|
||||
- **`TooManyRestarts`**: Triggers if an instance restarts more than twice within 15 minutes, suggesting the process might be crashlooping and needs investigation.
|
||||
- **`ServiceDown`**: Alerts if an instance is down for more than 5 minutes, indicating a service outage.
|
||||
- **`ProcessNearFDLimits`**: Alerts when the number of available file descriptors falls below 100, which could lead to severe degradation if the limit is exhausted.
|
||||
- **`TooHighCPUUsage`**: Alerts when CPU usage exceeds 90% for a continuous 5-minute period, indicating possible resource exhaustion and the need to adjust resource allocation or load.
|
||||
- **`TooHighMemoryUsage`**: Alerts when RAM usage exceeds 85% for a continuous 5-minute period and the need to adjust resource allocation or load.
|
||||
|
||||
<img src="firing-alerts-example-too-many-restarts.webp" alt="firing-alerts-example-too-many-restarts" width="800px"/>
|
||||
|
||||
`vmanomaly-issues` alerting group:
|
||||
- **`ServiceErrorsDetected`**: Alerts if model run errors are detected, indicating problems with the anomaly detection service or its dependencies.
|
||||
- **`SkippedModelRunsDetected`**: Alerts if model runs are skipped, potentially due to no new valid data, absence of trained ML models for new time series, or invalid data points (like `Inf` or `NaN`).
|
||||
- **`HighReadErrorRate`**: Alerts when the error rate for read operations exceeds 5% in a 5-minute window, suggesting issues with the data source, server constraints, or network.
|
||||
- **`HighWriteErrorRate`**: Alerts when the error rate for write operations exceeds 5% in a 5-minute window, indicating issues with data writing, potential server-side violations, or network problems.
|
||||
|
||||
<img src="firing-alerts-example-skipped-runs.webp" alt="firing-alerts-example-skipped-runs" width="800px"/>
|
|
@ -993,7 +993,7 @@ monitoring:
|
|||
Let's pull the docker image for `vmanomaly`:
|
||||
|
||||
```sh
|
||||
docker pull victoriametrics/vmanomaly:v1.18.3
|
||||
docker pull victoriametrics/vmanomaly:v1.18.4
|
||||
```
|
||||
|
||||
Now we can run the docker container putting as volumes both config and model file:
|
||||
|
@ -1007,7 +1007,7 @@ docker run -it \
|
|||
-v $(PWD)/license:/license \
|
||||
-v $(PWD)/custom_model.py:/vmanomaly/model/custom.py \
|
||||
-v $(PWD)/custom.yaml:/config.yaml \
|
||||
victoriametrics/vmanomaly:v1.18.3 /config.yaml \
|
||||
victoriametrics/vmanomaly:v1.18.4 /config.yaml \
|
||||
--licenseFile=/license
|
||||
```
|
||||
|
||||
|
|
|
@ -266,6 +266,22 @@ For detailed guidance on configuring mTLS parameters such as `verify_tls`, `tls_
|
|||
<td>Gauge</td>
|
||||
<td>vmanomaly UI version information, contained in `version` label. Added in [v1.17.2](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1172)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
|
||||
`vmanomaly_available_memory_bytes`
|
||||
</td>
|
||||
<td>Gauge</td>
|
||||
<td>Virtual memory size in bytes, available to the process. Added in [v1.18.4](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1184)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
|
||||
`vmanomaly_cpu_cores_available`
|
||||
</td>
|
||||
<td>Gauge</td>
|
||||
<td>Number of (logical) CPU cores available to the process. Added in [v1.18.4](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1184)</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
|
BIN
docs/anomaly-detection/firing-alerts-example-skipped-runs.webp
Normal file
After Width: | Height: | Size: 51 KiB |
After Width: | Height: | Size: 54 KiB |
BIN
docs/anomaly-detection/firing-alerts-groups.webp
Normal file
After Width: | Height: | Size: 22 KiB |
|
@ -385,7 +385,7 @@ services:
|
|||
restart: always
|
||||
vmanomaly:
|
||||
container_name: vmanomaly
|
||||
image: victoriametrics/vmanomaly:v1.18.3
|
||||
image: victoriametrics/vmanomaly:v1.18.4
|
||||
depends_on:
|
||||
- "victoriametrics"
|
||||
ports:
|
||||
|
|
BIN
docs/anomaly-detection/vmanomaly-dashboard-1-filters.webp
Normal file
After Width: | Height: | Size: 25 KiB |
After Width: | Height: | Size: 106 KiB |
After Width: | Height: | Size: 181 KiB |
After Width: | Height: | Size: 164 KiB |
After Width: | Height: | Size: 104 KiB |
After Width: | Height: | Size: 331 KiB |
BIN
docs/anomaly-detection/vmanomaly-dashboard-4-model-sections.webp
Normal file
After Width: | Height: | Size: 93 KiB |