
4.8 KiB

sort weight title menu aliases
5 5 Common mistakes
identifier parent weight
stream-aggregation-common-mistakes stream-aggregation 5

Place aggregation agents behind a load balancer

Partial aggregation (only a subset of all data, which satisfies match expression was pushed to an aggregation agent) is not acceptable. It produces wrong aggregations which are not usable and not comparable to equivalent recording rules.

To keep aggregation results consistent, it should be either fully processed on a single VMAgent or data can be sharded across multiple VMAgents by metric name.

Create separate aggregator for each recording rule

As was mentioned in use case scenarios, stream aggregation can be considered as a substitution for recording rules, but straightforward conversion of recording rules to stream aggregation config can lead to inefficient resource usage on the component it's configured on (VMAgent or VMSingle).

To optimize this, we recommend merging together aggregations which only differ in match expressions. E.g:

Given list of recording rules:

- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) BY (instance)
  record: instance:node_cpu:rate:sum
- expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
  record: instance:node_network_receive_bytes:rate:sum
- expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
  record: instance:node_network_transmit_bytes:rate:sum
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m]))
  record: cluster:node_cpu:sum_rate5m

can be converted to aggregation rules:

- match: node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}
  interval: 3m
  - instance
  - source_labels: [__name__]
    target_label: __name__
    replacement: instance:node_cpu:rate:sum
- match: node_network_receive_bytes_total
  interval: 3m
  - instance
  - source_labels: [__name__]
    target_label: __name__
    replacement: instance:node_network_receive_bytes:rate:sum
- match: node_network_transmit_bytes_total
  interval: 3m
  - instance
  - source_labels: [__name__]
    target_label: __name__
    replacement: instance:node_network_transmit_bytes:rate:sum
- match: node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}
  interval: 5m
  - source_labels: [__name__]
    target_label: __name__
    replacement: cluster:node_cpu:sum_rate5m

note, that first 3 aggregation rules differ only in match, so they can be merged together:

- match:
  - node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}
  - node_network_receive_bytes_total
  - node_network_transmit_bytes_total
  interval: 3m
  - instance
  - source_labels: [__name__]
    target_label: __name__
    regex: regex: "(.+)(_seconds)?(_total)?:.+"
    replacement: cluster:node_cpu:sum_rate5m
- match: node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}
  interval: 5m
  - source_labels: [__name__]
    target_label: __name__
    replacement: cluster:node_cpu:sum_rate5m

Note: having separate aggregator for a certain match expression can only be justified when aggregator cannot keep up with all the data pushed to an aggregator within an aggregation interval

Use identical --remoteWrite.streamAggr.config for all remote writes

As it's described in previous case having many aggregators leads to increased resource usage so having n identical aggregation configurations -remoteWrite.streamAggr.config for multiple -remoteWrite.url requires n * x resources.

As an optimization, we suggest using -streamAggr.config as a replacement for -remoteWrite.streamAggr.config. It places the global aggregator in front of all remote writes, which helps to reduce resource usage.

Treat aggregated metrics in the same manner as original ones

Stream aggregation allows to keep for aggregation result the name of a source metric using keep_metric_names: true. But graphs and alerts, which were previously used for a raw metric can become incorrect for aggregated one.

Dashboards and alerts should be updated according to aggregation configurations.