- Added self-monitoring guide for `vmanomaly`. - Added cross-referencing on other pages. - Slight improvements in wording on related pages - Update references to v1.18.4 - [x] publish Grafana dashboard to https://grafana.com/orgs/victoriametrics/dashboards: https://grafana.com/grafana/dashboards/22337-victoriametrics-vmanomaly/ @AndrewChubatiuk , JFYI if it somehow impacts your work on supporting `vmanomaly` in operator. The following checks are **mandatory**: - [x] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/).
41 KiB
title | weight | menu | aliases | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Monitoring | 5 |
|
|
There are 2 models to monitor VictoriaMetrics Anomaly Detection behavior - push and pull. Parameters for each of them should be specified in the config file, monitoring
section.
Note
: there was an enhancement of self-monitoring metrics for consistency across the components (v.1.17.0). Documentation was updated accordingly. Key changes included:
- Converting several self-monitoring metrics from
Summary
toHistogram
to enable quantile calculation. This addresses the limitation of theprometheus_client
's Summary implementation, which does not support quantiles. The change ensures metrics are more informative for performance analysis. Affected metrics are: - Adding a
query_key
label to thevmanomaly_reader_response_parsing_seconds
metric to provide finer granularity in tracking the performance of individual queries. This metric has also been switched fromSummary
toHistogram
to align with the other metrics and support quantile calculations. - Adding
preset
andscheduler_alias
keys to VmReader and VmWriter metrics for consistency in multi-scheduler setups. - Renaming Counters
vmanomaly_reader_response_count
tovmanomaly_reader_responses
andvmanomaly_writer_response_count
tovmanomaly_writer_responses
.
Pull Model Config parameters
Parameter | Default | Description |
---|---|---|
|
|
Server IP Address |
|
|
Port |
Push Config parameters
Parameter | Default | Description |
---|---|---|
|
Link where to push metrics to. Example: |
|
|
Tenant ID for cluster version. Example: |
|
|
|
Deprecated since v1.8.0. Absolute, to override |
|
BasicAuth username | |
|
BasicAuth password | |
bearer_token
|
token
|
Token is passed in the standard format with header: Authorization: bearer {token} . Available since v1.15.9
|
bearer_token_file
|
path_to_file
|
Path to a file, which contains token, that is passed in the standard format with header: Authorization: bearer {token} . Available since v1.15.9
|
verify_tls
|
false
|
Verify TLS certificate. If False , it will not verify the TLS certificate.
If True , it will verify the certificate using the system's CA store.
If a path to a CA bundle file (like ca.crt ), it will verify the certificate using the provided CA bundle.
|
tls_cert_file
|
path/to/cert.crt
|
Path to a file with the client certificate, i.e. client.crt . Available since v1.16.3.
|
tls_key_file
|
path/to/key.crt
|
Path to a file with the client certificate key, i.e. client.key . Available since v1.16.3.
|
|
|
Stop waiting for a response after a given number of seconds. |
|
Section for custom labels specified by user. |
Monitoring section config example
monitoring:
pull: # Enable /metrics endpoint.
addr: "0.0.0.0"
port: 8080
push:
url: "http://localhost:8480/"
tenant_id: "0:0" # For cluster version only
user: "USERNAME"
password: "PASSWORD"
verify_tls: False
timeout: "5s"
extra_labels:
job: "vmanomaly-push"
test: "test-1"
mTLS protection
Starting from v1.16.3, vmanomaly
components such as VmWriter support mTLS to ensure secure communication with VictoriaMetrics Enterprise, configured with mTLS.
For detailed guidance on configuring mTLS parameters such as verify_tls
, tls_cert_file
, and tls_key_file
, please refer to the mTLS protection section in the Reader documentation. The configuration principles apply consistently across all these vmanomaly
components.
Metrics generated by vmanomaly
Startup metrics
Metric | Type | Description |
---|---|---|
|
Gauge | vmanomaly start time in UNIX time |
|
Gauge | vmanomaly version information, contained in version label. Added in v1.17.2 |
|
Gauge | vmanomaly UI version information, contained in version label. Added in v1.17.2 |
|
Gauge | Virtual memory size in bytes, available to the process. Added in v1.18.4 |
|
Gauge | Number of (logical) CPU cores available to the process. Added in v1.18.4 |
Reader behaviour metrics
Label names description
Note
: additional labels (
scheduler_alias
,preset
) were added to writer and reader metrics in v1.17.0 to improve consistency across the components. Also, metricsvmanomaly_reader_request_duration_seconds
andvmanomaly_reader_response_parsing_seconds
changed their type toHistogram
(wasSummary
prior to v1.17.0).
Metric | Type | Description | Labelnames |
---|---|---|---|
|
Histogram (was Summary prior to v1.17.0) |
The total time (in seconds) taken by queries to VictoriaMetrics url for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The count of responses received from VictoriaMetrics url for the query_key query, categorized by code , within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The total number of bytes received in responses for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Histogram (was Summary prior to v1.17.0) |
The total time (in seconds) taken for data parsing at each step (json, dataframe) for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The total number of timeseries received from VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The total number of datapoints received from VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
Models behaviour metrics
Label names description
Note
: There is a new label key
model_alias
introduced in multi-model support v1.10.0. This label key adjustment was made to preserve unique label set production during writing produced metrics back to VictoriaMetrics.
Note
: as a part of self-monitoring metrics enhancement (v.1.17.0), new metrics, like
vmanomaly_model_run_errors
, was added. Some of them changed the type (Summary
->Histogram
), likevmanomaly_model_run_duration_seconds
.
Metric | Type | Description | Labelnames |
---|---|---|---|
|
Counter |
How many successful stage (fit , infer , fit_infer ) runs occurred for models of class model_alias based on results from the query_key query, within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Histogram (was Summary prior to v1.17.0) |
The total time (in seconds) taken by model invocations during the stage (fit , infer , fit_infer ), based on the results of the query_key query, for models of class model_alias , within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The number of datapoints accepted (excluding NaN or Inf values) by models of class model_alias from the results of the query_key query during the stage (infer , fit_infer ), within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The number of datapoints generated by models of class model_alias during the stage (infer , fit_infer ) based on results from the query_key query, within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Gauge |
The number of model instances of class model_alias currently available for inference for the query_key query, within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The number of times model runs (of class model_alias ) were skipped in expected situations (e.g., no data for fitting/inference, or no new data to infer on) during the stage (fit , infer , fit_infer ), based on results from the query_key query, within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The number of times model runs (of class model_alias ) failed due to internal service errors during the stage (fit , infer , fit_infer ), based on results from the query_key query, within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
stage , query_key , model_alias , scheduler_alias , preset
|
Writer behaviour metrics
Label names description
Note
: additional labels (
scheduler_alias
,preset
) were added to writer and reader metrics in v1.17.0 to improve consistency across the components. Also, metricsvmanomaly_writer_request_duration_seconds
andvmanomaly_writer_request_serialize_seconds
changed their type toHistogram
(wasSummary
prior to v1.17.0).
Metric | Type | Description | Labelnames |
---|---|---|---|
|
Histogram (was Summary prior to v1.17.0) |
The total time (in seconds) taken by write requests to VictoriaMetrics url for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode.
|
|
|
Counter |
The count of response codes received from VictoriaMetrics url for the query_key query, categorized by code , within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode.
|
|
|
Counter |
The total number of bytes sent to VictoriaMetrics url for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Histogram (was Summary prior to v1.17.0) |
The total time (in seconds) taken for serializing data for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
|
Counter |
The total number of datapoints sent to VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
vmanomaly_writer_timeseries_sent
|
Counter |
The total number of timeseries sent to VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias , in the vmanomaly service running in preset mode. |
|
Labelnames
stage
- stage of model - 'fit', 'infer' or 'fit_infer' for models that do it simultaneously, see model types.query_key
- query alias fromreader
config section.model_alias
- model alias frommodels
config section. Introduced in v1.10.0.scheduler_alias
- scheduler alias fromschedulers
config section. Introduced in v1.11.0.preset
- preset alias forpreset
mode ofvmanomaly
. Introduced in v1.12.0.url
- writer or reader url endpoint.code
- response status code orconnection_error
,timeout
.step
- json or dataframe reading step.
Logs generated by vmanomaly
The vmanomaly
service logs operations, errors, and performance for its components (service, reader, writer), alongside self-monitoring metrics updates. Below is a description of key logs for each component and the related metrics affected.
{{X}}
indicates a placeholder in the log message templates described below, which will be replaced with the appropriate entity during logging.
Note
: Applicable to version v1.17.1 or newer.
Note
: By default,
vmanomaly
uses theINFO
logging level. You can change this by specifying the--loggerLevel
argument. See command-line arguments here.
Startup logs
The vmanomaly
service logs important information during the startup process. This includes checking for the license, validating configurations, and setting up schedulers, readers, and writers. Below are key logs that are generated during startup, which can help troubleshoot issues with the service's initial configuration or license validation.
License check. If no license key or file is provided, the service will fail to start and log an error message. If a license file is provided but cannot be read, the service logs a failure. Log messages:
Please provide a license code using --license or --licenseFile arg, or as VM_LICENSE_FILE env. See https://victoriametrics.com/products/enterprise/trial/ to obtain a trial license.
failed to read file {{args.license_file}}: {{error_message}}
Config validation. If the service's configuration fails to load or does not meet validation requirements, an error message is logged and the service will exit. If the configuration is loaded successfully, a message confirming the successful load is logged. Log messages:
Config validation failed, please fix these errors: {{error_details}}
Config has been loaded successfully.
Model and data directory setup. The service checks the environment variables VMANOMALY_MODEL_DUMPS_DIR
and VMANOMALY_DATA_DUMPS_DIR
to determine where to store models and data. If these variables are not set, models and data will be stored in memory. Please find the on-disk mode details here. Log messages:
Using ENV MODEL_DUMP_DIR=`{{model_dump_dir}}` to store anomaly detection models.
ENV MODEL_DUMP_DIR is not set. Models will be kept in RAM between consecutive `fit` calls.
Using ENV DATA_DUMP_DIR=`{{data_dump_dir}}` to store anomaly detection data.
ENV DATA_DUMP_DIR is not set. Models' training data will be stored in RAM.
Scheduler and service initialization. After configuration is successfully loaded, the service initializes schedulers and services for each defined scheduler_alias
. If there are issues with a specific scheduler (e.g., no models or queries found to attach to a scheduler), a warning is logged. When schedulers are initialized, the service logs a list of active schedulers. Log messages:
Scheduler {{scheduler_alias}} wrapped and initialized with {{N}} model spec(s).
No model spec(s) found for scheduler `{{scheduler_alias}}`, skipping setting it up.
Active schedulers: {{list_of_schedulers}}.
Reader logs
The reader
component logs events during the process of querying VictoriaMetrics and retrieving the data necessary for anomaly detection. This includes making HTTP requests, handling SSL, parsing responses, and processing data into formats like DataFrames. The logs help to troubleshoot issues such as connection problems, timeout errors, or misconfigured queries.
Starting a healthcheck request. When the reader
component initializes, it checks whether the VictoriaMetrics endpoint is accessible by sending a request for _vmanomaly_healthcheck
. Log messages:
[Scheduler {{scheduler_alias}}] Max points per timeseries set as: {{vm_max_datapoints_per_ts}}
[Scheduler {{scheduler_alias}}] Reader endpoint SSL error {{url}}: {{error_message}}
[Scheduler {{scheduler_alias}}] Reader endpoint inaccessible {{url}}: {{error_message}}
[Scheduler {{scheduler_alias}}] Reader endpoint timeout {{url}}: {{error_message}}
No data found (False). Based on query_from_last_seen_timestamp
VmReader flag. A warning
log is generated when no data is found in the requested range. This could indicate that the query was misconfigured or that no new data exists for the time period requested. Log message format:
[Scheduler {{scheduler_alias}}] No data between {{start_s}} and {{end_s}} for query "{{query_key}}"
No unseen data found (True). Based on query_from_last_seen_timestamp
VmReader flag. A warning
log is generated when no new data is returned (i.e., all data has already been seen in a previous inference step(s)). This helps in identifying situations where data for inference has already been processed. Based on VmReader's adjust
flag. Log messages:
[Scheduler {{scheduler_alias}}] No unseen data between {{start_s}} and {{end_s}} for query "{{query_key}}"
Connection or timeout errors. When the reader fails to retrieve data due to connection or timeout errors, a warning
log is generated. These errors could result from network issues, incorrect query endpoints, or VictoriaMetrics being temporarily unavailable. Log message format:
[Scheduler {{scheduler_alias}}] Error querying {{query_key}} for {{url}}: {{error_message}}
Max datapoints warning. If the requested query range (defined by fit_every
or infer_every
scheduler args) exceeds the maximum number of datapoints allowed by VictoriaMetrics, a warning
log is generated, and the request is split into multiple intervals (option available since v1.14.1). This ensures that the request does not violate VictoriaMetrics’ constraints. Log messages:
[Scheduler {{scheduler_alias}}] Query "{{query_key}}" from {{start_s}} to {{end_s}} with step {{step}} may exceed max datapoints per timeseries and will be split...
Multi-tenancy warnings. If the reader detects any issues related to missing or misconfigured multi-tenancy labels (supported since v1.16.2), a warning
log is generated to indicate the issue. See additional details here. Log message format:
The label vm_account_id was not found in the label set of {{query_key}}, but tenant_id='multitenant' is set in reader configuration...
Metrics updated in read operations. During successful query execution process, the following reader self-monitoring metrics are updated:
-
vmanomaly_reader_request_duration_seconds
: Records the time (in seconds) taken to complete the query request. -
vmanomaly_reader_responses
: Tracks the number of response codes received from VictoriaMetrics. -
vmanomaly_reader_received_bytes
: Counts the number of bytes received in the response. -
vmanomaly_reader_response_parsing_seconds
: Records the time spent parsing the response into different formats (e.g., JSON or DataFrame). -
vmanomaly_reader_timeseries_received
: Tracks how many timeseries were retrieved in the query result. -
vmanomaly_reader_datapoints_received
: Counts the number of datapoints retrieved in the query result.
Metrics skipped in case of failures. If an error occurs (connection or timeout), vmanomaly_reader_received_bytes
, vmanomaly_reader_timeseries_received
, and vmanomaly_reader_datapoints_received
are not incremented because no valid data was received.
Service logs
The model
component (wrapped in service) logs operations during the fitting and inference stages for each model spec attached to particular scheduler scheduler_alias
. These logs inform about skipped runs, connection or timeout issues, invalid data points, and successful or failed model operations.
Skipped runs. When there are insufficient valid data points to fit or infer using a model, the run is skipped and a warning
log is generated. This can occur when the query returns no new data or when the data contains invalid values (e.g., NaN
, INF
). The skipped run is also reflected in the vmanomaly_model_runs_skipped
metric. Log messages:
When there are insufficient valid data points (at least 1 for online models and 2 for offline models)
[Scheduler {{scheduler_alias}}] Skipping run for stage 'fit' for model '{{model_alias}}' (query_key: {{query_key}}): Not enough valid data to fit: {{valid_values_cnt}}
When all the received timestamps during an infer
call have already been processed, meaning the anomaly_score
has already been produced for those points
[Scheduler {{scheduler_alias}}] Skipping run for stage 'infer' for model '{{model_alias}}' (query_key: {{query_key}}): No unseen data to infer on.
When the model fails to produce any valid or finite outputs (such as anomaly_score
)
[Scheduler {{scheduler_alias}}] Skipping run for stage 'infer' for model '{{model_alias}}' (query_key: {{query_key}}): No (valid) datapoints produced.
Errors during model execution. If the model fails to fit or infer data due to internal service errors or model spec misconfigurations, an error
log is generated and the error is also reflected in the vmanomaly_model_run_errors
metric. This can occur during both fit
and infer
stages. Log messages:
[Scheduler {{scheduler_alias}}] Error during stage 'fit' for model '{{model_alias}}' (query_key: {{query_key}}): {{error_message}}
[Scheduler {{scheduler_alias}}] Error during stage 'infer' for model '{{model_alias}}' (query_key: {{query_key}}): {{error_message}}
Model instance created during inference. In cases where an online model instance is created during the inference stage (without a prior fit, a feature introduced in v1.15.2), a debug
log is produced. This helps track models that are created dynamically based on incoming data. Log messages:
[Scheduler {{scheduler_alias}}] Model instance '{{model_alias}}' created for '{{query_key}}' during inference.
Successful model runs. When a model successfully fits, logs track the number of valid datapoints processed and the time taken for the operation. These logs are accompanied by updates to self-monitoring metrics like vmanomaly_model_runs
, vmanomaly_model_run_duration_seconds
, vmanomaly_model_datapoints_accepted
, and vmanomaly_model_datapoints_produced
. Log messages:
[Scheduler {{scheduler_alias}}] Fitting on {{valid_values_cnt}}/{{total_values_cnt}} valid datapoints for "{{query_key}}" using model "{{model_alias}}".
[Scheduler {{scheduler_alias}}] Model '{{model_alias}}' fit completed in {{model_run_duration}} seconds for {{query_key}}.
For rolling models (combined stage)
[Scheduler {{scheduler_alias}}] Fit-Infer on {{datapoint_count}} points for "{{query_key}}" using model "{{model_alias}}".
Metrics updated in model runs. During successful fit or infer operations, the following self-monitoring metrics are updated for each run:
-
vmanomaly_model_runs
: Tracks how many times the model ran (fit
,infer
, orfit_infer
) for a specificquery_key
. -
vmanomaly_model_run_duration_seconds
: Records the total time (in seconds) for the model invocation, based on the results of thequery_key
. -
vmanomaly_model_datapoints_accepted
: The number of valid datapoints processed by the model during the run. -
vmanomaly_model_datapoints_produced
: The number of datapoints generated by the model during inference. -
vmanomaly_models_active
: Tracks the number of models currently available for infer for a specificquery_key
.
Metrics skipped in case of failures. If a model run fails due to an error or if no valid data is available, the metrics such as vmanomaly_model_datapoints_accepted
, vmanomaly_model_datapoints_produced
, and vmanomaly_model_run_duration_seconds
are not updated.
Writer logs
The writer
component logs events during the process of sending produced data (like anomaly_score
metrics) to VictoriaMetrics. This includes data preparation, serialization, and network requests to VictoriaMetrics endpoints. The logs can help identify issues in data transmission, such as connection errors, invalid data points, and track the performance of write requests.
Starting a write request. A debug
level log is produced when the writer
component starts the process of writing data to VictoriaMetrics. It includes details like the number of datapoints, bytes of payload, and the query being written. This is useful for tracking the payload size and performance at the start of the request. Log messages:
[Scheduler {{scheduler_alias}}] POST {{url}} with {{N}} datapoints, {{M}} bytes of payload, for {{query_key}}
No valid data points. A warning
log is generated if there are no valid datapoints to write (i.e., all are NaN
or unsupported like INF
). This indicates that the writer will not send any data to VictoriaMetrics. Log messages:
[Scheduler {{scheduler_alias}}] No valid datapoints to save for metric: {{query_key}}
Connection, timeout, or I/O errors. When the writer fails to send data due to connection, timeout, or I/O errors, an error
log is generated. These errors often arise from network problems, incorrect URLs, or VictoriaMetrics being unavailable. The log includes details of the failed request and the reason for the failure. Log messages:
[Scheduler {{scheduler_alias}}] Cannot write {{N}} points for {{query_key}}: connection error {{url}} {{error_message}}
[Scheduler {{scheduler_alias}}] Cannot write {{N}} points for {{query_key}}: timeout for {{url}} {{error_message}}
[Scheduler {{scheduler_alias}}] Cannot write {{N}} points for {{query_key}}: I/O error for {{url}} {{error_message}}
Multi-tenancy warnings. If the tenant_id
is set to multitenant
but the vm_account_id
label is missing from the query result, or vice versa, a warning
log is produced (supported since v1.16.2). This helps in debugging label set issues that may occur due to the multi-tenant configuration - see this section for details. Log messages:
The label vm_account_id was not found in the label set of {{query_key}}, but tenant_id='multitenant' is set in writer...
The label set for the metric {{query_key}} contains multi-tenancy labels, but the write endpoint is configured for single-tenant mode (tenant_id != 'multitenant')...
Metrics updated in write operations. During the successful write process of non-empty data, the following self-monitoring metrics are updated:
-
vmanomaly_writer_request_duration_seconds
: Records the time (in seconds) taken to complete the write request. -
vmanomaly_writer_sent_bytes
: Tracks the number of bytes sent in the request. -
vmanomaly_writer_responses
: Captures the HTTP response code returned by VictoriaMetrics. In case of connection, timeout, or I/O errors, a specific error code (connection_error
,timeout
, orio_error
) is recorded instead. -
vmanomaly_writer_request_serialize_seconds
: Records the time taken for data serialization. -
vmanomaly_writer_datapoints_sent
: Counts the number of valid datapoints that were successfully sent. -
vmanomaly_writer_timeseries_sent
: Tracks the number of timeseries sent to VictoriaMetrics.
Metrics skipped in case of failures. If an error occurs (connection, timeout, or I/O error), only vmanomaly_writer_request_duration_seconds
is updated with appropriate error code.