VictoriaMetrics/docs/anomaly-detection/components/monitoring.md
Fred Navruzov a534df6cf3
docs/vmanomaly: add self-monitoring section (#7558)
### Describe Your Changes

- Added self-monitoring guide for `vmanomaly`.
- Added cross-referencing on other pages.
- Slight improvements in wording on related pages
- Update references to v1.18.4
- [x] publish Grafana dashboard to
https://grafana.com/orgs/victoriametrics/dashboards:
https://grafana.com/grafana/dashboards/22337-victoriametrics-vmanomaly/

@AndrewChubatiuk , JFYI if it somehow impacts your work on supporting
`vmanomaly` in operator.

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
2024-11-18 20:14:46 +02:00

41 KiB
Raw Blame History

title weight menu aliases
Monitoring 5
docs
parent weight identifier
vmanomaly-components 5 vmanomaly-monitoring
./monitoring.html

There are 2 models to monitor VictoriaMetrics Anomaly Detection behavior - push and pull. Parameters for each of them should be specified in the config file, monitoring section.

Note

: there was an enhancement of self-monitoring metrics for consistency across the components (v.1.17.0). Documentation was updated accordingly. Key changes included:

  • Converting several self-monitoring metrics from Summary to Histogram to enable quantile calculation. This addresses the limitation of the prometheus_client's Summary implementation, which does not support quantiles. The change ensures metrics are more informative for performance analysis. Affected metrics are:
    • vmanomaly_reader_request_duration_seconds (VmReader)
    • vmanomaly_reader_response_parsing_seconds (VmReader)
    • vmanomaly_writer_request_duration_seconds (VmWriter)
    • vmanomaly_writer_request_serialize_seconds (VmWriter)
  • Adding a query_key label to the vmanomaly_reader_response_parsing_seconds metric to provide finer granularity in tracking the performance of individual queries. This metric has also been switched from Summary to Histogram to align with the other metrics and support quantile calculations.
  • Adding preset and scheduler_alias keys to VmReader and VmWriter metrics for consistency in multi-scheduler setups.
  • Renaming Counters vmanomaly_reader_response_count to vmanomaly_reader_responses and vmanomaly_writer_response_count to vmanomaly_writer_responses.

Pull Model Config parameters

Parameter Default Description

addr

"0.0.0.0"

Server IP Address

port

8080

Port

Push Config parameters

Parameter Default Description

url

Link where to push metrics to. Example: "http://localhost:8480/"

tenant_id

Tenant ID for cluster version. Example: "0:0"

health_path

"health"

Deprecated since v1.8.0. Absolute, to override /health path

user

BasicAuth username

password

BasicAuth password
bearer_token token Token is passed in the standard format with header: Authorization: bearer {token}. Available since v1.15.9
bearer_token_file path_to_file Path to a file, which contains token, that is passed in the standard format with header: Authorization: bearer {token}. Available since v1.15.9
verify_tls false Verify TLS certificate. If False, it will not verify the TLS certificate. If True, it will verify the certificate using the system's CA store. If a path to a CA bundle file (like ca.crt), it will verify the certificate using the provided CA bundle.
tls_cert_file path/to/cert.crt Path to a file with the client certificate, i.e. client.crt. Available since v1.16.3.
tls_key_file path/to/key.crt Path to a file with the client certificate key, i.e. client.key. Available since v1.16.3.

timeout

"5s"

Stop waiting for a response after a given number of seconds.

extra_labels

Section for custom labels specified by user.

Monitoring section config example

monitoring:
  pull: # Enable /metrics endpoint.
    addr: "0.0.0.0"
    port: 8080
  push:
    url: "http://localhost:8480/"
    tenant_id: "0:0" # For cluster version only
    user: "USERNAME"
    password: "PASSWORD"
    verify_tls: False
    timeout: "5s"
    extra_labels:
      job: "vmanomaly-push"
      test: "test-1"

mTLS protection

Starting from v1.16.3, vmanomaly components such as VmWriter support mTLS to ensure secure communication with VictoriaMetrics Enterprise, configured with mTLS.

For detailed guidance on configuring mTLS parameters such as verify_tls, tls_cert_file, and tls_key_file, please refer to the mTLS protection section in the Reader documentation. The configuration principles apply consistently across all these vmanomaly components.

Metrics generated by vmanomaly

Startup metrics

Metric Type Description

vmanomaly_start_time_seconds

Gauge vmanomaly start time in UNIX time

vmanomaly_version_info

Gauge vmanomaly version information, contained in version label. Added in v1.17.2

vmanomaly_ui_version_info

Gauge vmanomaly UI version information, contained in version label. Added in v1.17.2

vmanomaly_available_memory_bytes

Gauge Virtual memory size in bytes, available to the process. Added in v1.18.4

vmanomaly_cpu_cores_available

Gauge Number of (logical) CPU cores available to the process. Added in v1.18.4

Back to metric sections

Reader behaviour metrics

Label names description

Note

: additional labels (scheduler_alias, preset) were added to writer and reader metrics in v1.17.0 to improve consistency across the components. Also, metrics vmanomaly_reader_request_duration_seconds and vmanomaly_reader_response_parsing_seconds changed their type to Histogram (was Summary prior to v1.17.0).

Metric Type Description Labelnames

vmanomaly_reader_request_duration_seconds

Histogram (was Summary prior to v1.17.0) The total time (in seconds) taken by queries to VictoriaMetrics url for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_reader_responses (named vmanomaly_reader_response_count prior to v1.17.0)

Counter The count of responses received from VictoriaMetrics url for the query_key query, categorized by code, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, code, scheduler_alias, preset

vmanomaly_reader_received_bytes

Counter The total number of bytes received in responses for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_reader_response_parsing_seconds

Histogram (was Summary prior to v1.17.0) The total time (in seconds) taken for data parsing at each step (json, dataframe) for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

step, url, query_key, scheduler_alias, preset

vmanomaly_reader_timeseries_received

Counter The total number of timeseries received from VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_reader_datapoints_received

Counter The total number of datapoints received from VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

Back to metric sections

Models behaviour metrics

Label names description

Note

: There is a new label key model_alias introduced in multi-model support v1.10.0. This label key adjustment was made to preserve unique label set production during writing produced metrics back to VictoriaMetrics.

Note

: as a part of self-monitoring metrics enhancement (v.1.17.0), new metrics, like vmanomaly_model_run_errors, was added. Some of them changed the type (Summary -> Histogram), like vmanomaly_model_run_duration_seconds.

Metric Type Description Labelnames

vmanomaly_model_runs

Counter How many successful stage (fit, infer, fit_infer) runs occurred for models of class model_alias based on results from the query_key query, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

stage, query_key, model_alias, scheduler_alias, preset

vmanomaly_model_run_duration_seconds

Histogram (was Summary prior to v1.17.0) The total time (in seconds) taken by model invocations during the stage (fit, infer, fit_infer), based on the results of the query_key query, for models of class model_alias, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

stage, query_key, model_alias, scheduler_alias, preset

vmanomaly_model_datapoints_accepted

Counter The number of datapoints accepted (excluding NaN or Inf values) by models of class model_alias from the results of the query_key query during the stage (infer, fit_infer), within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

stage, query_key, model_alias, scheduler_alias, preset

vmanomaly_model_datapoints_produced

Counter The number of datapoints generated by models of class model_alias during the stage (infer, fit_infer) based on results from the query_key query, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

stage, query_key, model_alias, scheduler_alias, preset

vmanomaly_models_active

Gauge The number of model instances of class model_alias currently available for inference for the query_key query, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

query_key, model_alias, scheduler_alias, preset

vmanomaly_model_runs_skipped

Counter The number of times model runs (of class model_alias) were skipped in expected situations (e.g., no data for fitting/inference, or no new data to infer on) during the stage (fit, infer, fit_infer), based on results from the query_key query, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

stage, query_key, model_alias, scheduler_alias, preset

vmanomaly_model_run_errors

Counter The number of times model runs (of class model_alias) failed due to internal service errors during the stage (fit, infer, fit_infer), based on results from the query_key query, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode. stage, query_key, model_alias, scheduler_alias, preset

Back to metric sections

Writer behaviour metrics

Label names description

Note

: additional labels (scheduler_alias, preset) were added to writer and reader metrics in v1.17.0 to improve consistency across the components. Also, metrics vmanomaly_writer_request_duration_seconds and vmanomaly_writer_request_serialize_seconds changed their type to Histogram (was Summary prior to v1.17.0).

Metric Type Description Labelnames

vmanomaly_writer_request_duration_seconds

Histogram (was Summary prior to v1.17.0) The total time (in seconds) taken by write requests to VictoriaMetrics url for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_writer_responses (named vmanomaly_reader_response_count prior to v1.17.0)

Counter The count of response codes received from VictoriaMetrics url for the query_key query, categorized by code, within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, code, query_key, scheduler_alias, preset

vmanomaly_writer_sent_bytes

Counter The total number of bytes sent to VictoriaMetrics url for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_writer_request_serialize_seconds

Histogram (was Summary prior to v1.17.0) The total time (in seconds) taken for serializing data for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_writer_datapoints_sent

Counter The total number of datapoints sent to VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

vmanomaly_writer_timeseries_sent Counter The total number of timeseries sent to VictoriaMetrics for the query_key query within the specified scheduler scheduler_alias, in the vmanomaly service running in preset mode.

url, query_key, scheduler_alias, preset

Back to metric sections

Labelnames

  • stage - stage of model - 'fit', 'infer' or 'fit_infer' for models that do it simultaneously, see model types.
  • query_key - query alias from reader config section.
  • model_alias - model alias from models config section. Introduced in v1.10.0.
  • scheduler_alias - scheduler alias from schedulers config section. Introduced in v1.11.0.
  • preset - preset alias for preset mode of vmanomaly. Introduced in v1.12.0.
  • url - writer or reader url endpoint.
  • code - response status code or connection_error, timeout.
  • step - json or dataframe reading step.

Back to metric sections

Logs generated by vmanomaly

The vmanomaly service logs operations, errors, and performance for its components (service, reader, writer), alongside self-monitoring metrics updates. Below is a description of key logs for each component and the related metrics affected.

{{X}} indicates a placeholder in the log message templates described below, which will be replaced with the appropriate entity during logging.

Note

: Applicable to version v1.17.1 or newer.

Note

: By default, vmanomaly uses the INFO logging level. You can change this by specifying the --loggerLevel argument. See command-line arguments here.

Startup logs

The vmanomaly service logs important information during the startup process. This includes checking for the license, validating configurations, and setting up schedulers, readers, and writers. Below are key logs that are generated during startup, which can help troubleshoot issues with the service's initial configuration or license validation.


License check. If no license key or file is provided, the service will fail to start and log an error message. If a license file is provided but cannot be read, the service logs a failure. Log messages:

Please provide a license code using --license or --licenseFile arg, or as VM_LICENSE_FILE env. See https://victoriametrics.com/products/enterprise/trial/ to obtain a trial license.
failed to read file {{args.license_file}}: {{error_message}}

Config validation. If the service's configuration fails to load or does not meet validation requirements, an error message is logged and the service will exit. If the configuration is loaded successfully, a message confirming the successful load is logged. Log messages:

Config validation failed, please fix these errors: {{error_details}}
Config has been loaded successfully.

Model and data directory setup. The service checks the environment variables VMANOMALY_MODEL_DUMPS_DIR and VMANOMALY_DATA_DUMPS_DIR to determine where to store models and data. If these variables are not set, models and data will be stored in memory. Please find the on-disk mode details here. Log messages:

Using ENV MODEL_DUMP_DIR=`{{model_dump_dir}}` to store anomaly detection models.
ENV MODEL_DUMP_DIR is not set. Models will be kept in RAM between consecutive `fit` calls.
Using ENV DATA_DUMP_DIR=`{{data_dump_dir}}` to store anomaly detection data.
ENV DATA_DUMP_DIR is not set. Models' training data will be stored in RAM.

Scheduler and service initialization. After configuration is successfully loaded, the service initializes schedulers and services for each defined scheduler_alias. If there are issues with a specific scheduler (e.g., no models or queries found to attach to a scheduler), a warning is logged. When schedulers are initialized, the service logs a list of active schedulers. Log messages:

Scheduler {{scheduler_alias}} wrapped and initialized with {{N}} model spec(s).
No model spec(s) found for scheduler `{{scheduler_alias}}`, skipping setting it up.
Active schedulers: {{list_of_schedulers}}.

Back to logging sections


Reader logs

The reader component logs events during the process of querying VictoriaMetrics and retrieving the data necessary for anomaly detection. This includes making HTTP requests, handling SSL, parsing responses, and processing data into formats like DataFrames. The logs help to troubleshoot issues such as connection problems, timeout errors, or misconfigured queries.


Starting a healthcheck request. When the reader component initializes, it checks whether the VictoriaMetrics endpoint is accessible by sending a request for _vmanomaly_healthcheck. Log messages:

[Scheduler {{scheduler_alias}}] Max points per timeseries set as: {{vm_max_datapoints_per_ts}}
[Scheduler {{scheduler_alias}}] Reader endpoint SSL error {{url}}: {{error_message}}
[Scheduler {{scheduler_alias}}] Reader endpoint inaccessible {{url}}: {{error_message}}
[Scheduler {{scheduler_alias}}] Reader endpoint timeout {{url}}: {{error_message}}

No data found (False). Based on query_from_last_seen_timestamp VmReader flag. A warning log is generated when no data is found in the requested range. This could indicate that the query was misconfigured or that no new data exists for the time period requested. Log message format:

[Scheduler {{scheduler_alias}}] No data between {{start_s}} and {{end_s}} for query "{{query_key}}"

No unseen data found (True). Based on query_from_last_seen_timestamp VmReader flag. A warning log is generated when no new data is returned (i.e., all data has already been seen in a previous inference step(s)). This helps in identifying situations where data for inference has already been processed. Based on VmReader's adjust flag. Log messages:

[Scheduler {{scheduler_alias}}] No unseen data between {{start_s}} and {{end_s}} for query "{{query_key}}"

Connection or timeout errors. When the reader fails to retrieve data due to connection or timeout errors, a warning log is generated. These errors could result from network issues, incorrect query endpoints, or VictoriaMetrics being temporarily unavailable. Log message format:

[Scheduler {{scheduler_alias}}] Error querying {{query_key}} for {{url}}: {{error_message}}

Max datapoints warning. If the requested query range (defined by fit_every or infer_every scheduler args) exceeds the maximum number of datapoints allowed by VictoriaMetrics, a warning log is generated, and the request is split into multiple intervals (option available since v1.14.1). This ensures that the request does not violate VictoriaMetrics constraints. Log messages:

[Scheduler {{scheduler_alias}}] Query "{{query_key}}" from {{start_s}} to {{end_s}} with step {{step}} may exceed max datapoints per timeseries and will be split...

Multi-tenancy warnings. If the reader detects any issues related to missing or misconfigured multi-tenancy labels (supported since v1.16.2), a warning log is generated to indicate the issue. See additional details here. Log message format:

The label vm_account_id was not found in the label set of {{query_key}}, but tenant_id='multitenant' is set in reader configuration...

Metrics updated in read operations. During successful query execution process, the following reader self-monitoring metrics are updated:

  • vmanomaly_reader_request_duration_seconds: Records the time (in seconds) taken to complete the query request.

  • vmanomaly_reader_responses: Tracks the number of response codes received from VictoriaMetrics.

  • vmanomaly_reader_received_bytes: Counts the number of bytes received in the response.

  • vmanomaly_reader_response_parsing_seconds: Records the time spent parsing the response into different formats (e.g., JSON or DataFrame).

  • vmanomaly_reader_timeseries_received: Tracks how many timeseries were retrieved in the query result.

  • vmanomaly_reader_datapoints_received: Counts the number of datapoints retrieved in the query result.


Metrics skipped in case of failures. If an error occurs (connection or timeout), vmanomaly_reader_received_bytes, vmanomaly_reader_timeseries_received, and vmanomaly_reader_datapoints_received are not incremented because no valid data was received.

Back to logging sections

Service logs

The model component (wrapped in service) logs operations during the fitting and inference stages for each model spec attached to particular scheduler scheduler_alias. These logs inform about skipped runs, connection or timeout issues, invalid data points, and successful or failed model operations.


Skipped runs. When there are insufficient valid data points to fit or infer using a model, the run is skipped and a warning log is generated. This can occur when the query returns no new data or when the data contains invalid values (e.g., NaN, INF). The skipped run is also reflected in the vmanomaly_model_runs_skipped metric. Log messages:

When there are insufficient valid data points (at least 1 for online models and 2 for offline models)

[Scheduler {{scheduler_alias}}] Skipping run for stage 'fit' for model '{{model_alias}}' (query_key: {{query_key}}): Not enough valid data to fit: {{valid_values_cnt}}

When all the received timestamps during an infer call have already been processed, meaning the anomaly_score has already been produced for those points

[Scheduler {{scheduler_alias}}] Skipping run for stage 'infer' for model '{{model_alias}}' (query_key: {{query_key}}): No unseen data to infer on.

When the model fails to produce any valid or finite outputs (such as anomaly_score)

[Scheduler {{scheduler_alias}}] Skipping run for stage 'infer' for model '{{model_alias}}' (query_key: {{query_key}}): No (valid) datapoints produced.

Errors during model execution. If the model fails to fit or infer data due to internal service errors or model spec misconfigurations, an error log is generated and the error is also reflected in the vmanomaly_model_run_errors metric. This can occur during both fit and infer stages. Log messages:

[Scheduler {{scheduler_alias}}] Error during stage 'fit' for model '{{model_alias}}' (query_key: {{query_key}}): {{error_message}}
[Scheduler {{scheduler_alias}}] Error during stage 'infer' for model '{{model_alias}}' (query_key: {{query_key}}): {{error_message}}

Model instance created during inference. In cases where an online model instance is created during the inference stage (without a prior fit, a feature introduced in v1.15.2), a debug log is produced. This helps track models that are created dynamically based on incoming data. Log messages:

[Scheduler {{scheduler_alias}}] Model instance '{{model_alias}}' created for '{{query_key}}' during inference.

Successful model runs. When a model successfully fits, logs track the number of valid datapoints processed and the time taken for the operation. These logs are accompanied by updates to self-monitoring metrics like vmanomaly_model_runs, vmanomaly_model_run_duration_seconds, vmanomaly_model_datapoints_accepted, and vmanomaly_model_datapoints_produced. Log messages:

For non-rolling models

[Scheduler {{scheduler_alias}}] Fitting on {{valid_values_cnt}}/{{total_values_cnt}} valid datapoints for "{{query_key}}" using model "{{model_alias}}".
[Scheduler {{scheduler_alias}}] Model '{{model_alias}}' fit completed in {{model_run_duration}} seconds for {{query_key}}.

For rolling models (combined stage)

[Scheduler {{scheduler_alias}}] Fit-Infer on {{datapoint_count}} points for "{{query_key}}" using model "{{model_alias}}".

Metrics updated in model runs. During successful fit or infer operations, the following self-monitoring metrics are updated for each run:

  • vmanomaly_model_runs: Tracks how many times the model ran (fit, infer, or fit_infer) for a specific query_key.

  • vmanomaly_model_run_duration_seconds: Records the total time (in seconds) for the model invocation, based on the results of the query_key.

  • vmanomaly_model_datapoints_accepted: The number of valid datapoints processed by the model during the run.

  • vmanomaly_model_datapoints_produced: The number of datapoints generated by the model during inference.

  • vmanomaly_models_active: Tracks the number of models currently available for infer for a specific query_key.


Metrics skipped in case of failures. If a model run fails due to an error or if no valid data is available, the metrics such as vmanomaly_model_datapoints_accepted, vmanomaly_model_datapoints_produced, and vmanomaly_model_run_duration_seconds are not updated.


Back to logging sections

Writer logs

The writer component logs events during the process of sending produced data (like anomaly_score metrics) to VictoriaMetrics. This includes data preparation, serialization, and network requests to VictoriaMetrics endpoints. The logs can help identify issues in data transmission, such as connection errors, invalid data points, and track the performance of write requests.


Starting a write request. A debug level log is produced when the writer component starts the process of writing data to VictoriaMetrics. It includes details like the number of datapoints, bytes of payload, and the query being written. This is useful for tracking the payload size and performance at the start of the request. Log messages:

[Scheduler {{scheduler_alias}}] POST {{url}} with {{N}} datapoints, {{M}} bytes of payload, for {{query_key}}

No valid data points. A warning log is generated if there are no valid datapoints to write (i.e., all are NaN or unsupported like INF). This indicates that the writer will not send any data to VictoriaMetrics. Log messages:

[Scheduler {{scheduler_alias}}] No valid datapoints to save for metric: {{query_key}}

Connection, timeout, or I/O errors. When the writer fails to send data due to connection, timeout, or I/O errors, an error log is generated. These errors often arise from network problems, incorrect URLs, or VictoriaMetrics being unavailable. The log includes details of the failed request and the reason for the failure. Log messages:

[Scheduler {{scheduler_alias}}] Cannot write {{N}} points for {{query_key}}: connection error {{url}} {{error_message}}
[Scheduler {{scheduler_alias}}] Cannot write {{N}} points for {{query_key}}: timeout for {{url}} {{error_message}}
[Scheduler {{scheduler_alias}}] Cannot write {{N}} points for {{query_key}}: I/O error for {{url}} {{error_message}}

Multi-tenancy warnings. If the tenant_id is set to multitenant but the vm_account_id label is missing from the query result, or vice versa, a warning log is produced (supported since v1.16.2). This helps in debugging label set issues that may occur due to the multi-tenant configuration - see this section for details. Log messages:

The label vm_account_id was not found in the label set of {{query_key}}, but tenant_id='multitenant' is set in writer...
The label set for the metric {{query_key}} contains multi-tenancy labels, but the write endpoint is configured for single-tenant mode (tenant_id != 'multitenant')...

Metrics updated in write operations. During the successful write process of non-empty data, the following self-monitoring metrics are updated:

  • vmanomaly_writer_request_duration_seconds: Records the time (in seconds) taken to complete the write request.

  • vmanomaly_writer_sent_bytes: Tracks the number of bytes sent in the request.

  • vmanomaly_writer_responses: Captures the HTTP response code returned by VictoriaMetrics. In case of connection, timeout, or I/O errors, a specific error code (connection_error, timeout, or io_error) is recorded instead.

  • vmanomaly_writer_request_serialize_seconds: Records the time taken for data serialization.

  • vmanomaly_writer_datapoints_sent: Counts the number of valid datapoints that were successfully sent.

  • vmanomaly_writer_timeseries_sent: Tracks the number of timeseries sent to VictoriaMetrics.

Metrics skipped in case of failures. If an error occurs (connection, timeout, or I/O error), only vmanomaly_writer_request_duration_seconds is updated with appropriate error code.

Back to logging sections