'any' type is supported starting from Go1.18. Let's consistently use it
instead of 'interface{}' type across the code base, since `any` is easier to read than 'interface{}'.
The change adds a new docs section with examples on how source can be
overridden. It should address questions like
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6536
While there, fix the example in `external.alert.source` cmd-line flag
and docker-compose examples.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Reason for revert:
There are many statsd servers exist:
- https://github.com/statsd/statsd - classical statsd server
- https://docs.datadoghq.com/developers/dogstatsd/ - statsd server from DataDog built into DatDog Agent ( https://docs.datadoghq.com/agent/ )
- https://github.com/avito-tech/bioyino - high-performance statsd server
- https://github.com/atlassian/gostatsd - statsd server in Go
- https://github.com/prometheus/statsd_exporter - statsd server, which exposes the aggregated data as Prometheus metrics
These servers can be used for efficient aggregating of statsd data and sending it to VictoriaMetrics
according to https://docs.victoriametrics.com/#how-to-send-data-from-graphite-compatible-agents-such-as-statsd (
the https://github.com/prometheus/statsd_exporter can be scraped as usual Prometheus target
according to https://docs.victoriametrics.com/#how-to-scrape-prometheus-exporters-such-as-node-exporter ).
Adding support for statsd data ingestion protocol into VictoriaMetrics makes sense only if it provides
significant advantages over the existing statsd servers, while has no significant drawbacks comparing
to existing statsd servers.
The main advantage of statsd server built into VictoriaMetrics and vmagent - getting rid of additional statsd server.
The main drawback is non-trivial and inconvenient streaming aggregation configs, which must be used for the ingested statsd metrics (
see https://docs.victoriametrics.com/stream-aggregation/ ). These configs are incompatible with the configs for standalone statsd servers.
So you need to manually translate configs of the used statsd server to stream aggregation configs when migrating
from standalone statsd server to statsd server built into VictoriaMetrics (or vmagent).
Another important drawback is that it is very easy to shoot yourself in the foot when using built-in statsd server
with the -statsd.disableAggregationEnforcement command-line flag or with improperly configured streaming aggregation.
In this case the ingested statsd metrics will be stored to VictoriaMetrics as is without any aggregation.
This may result in high CPU usage during data ingestion, high disk space usage for storing all the unaggregated
statsd metrics and high CPU usage during querying, since all the unaggregated metrics must be read, unpacked and processed
during querying.
P.S. Built-in statsd server can be added to VictoriaMetrics and vmagent after figuring out more ergonomic
specialized configuration for aggregating of statsd metrics. The main requirements for this configuration:
- easy to write, read and update (ideally it should work out of the box for most cases without additional configuration)
- hard to misconfigure (e.g. hard to shoot yourself in the foot)
It would be great if this configuration will be compatible with the configuration of the most widely used statsd server.
In the mean time it is recommended continue using external statsd server.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6265
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5053
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5052
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/206
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4600
This reverts commit 5a3abfa041.
Reason for revert: exemplars aren't in wide use because they have numerous issues which prevent their adoption (see below).
Adding support for examplars into VictoriaMetrics introduces non-trivial code changes. These code changes need to be supported forever
once the release of VictoriaMetrics with exemplar support is published. That's why I don't think this is a good feature despite
that the source code of the reverted commit has an excellent quality. See https://docs.victoriametrics.com/goals/ .
Issues with Prometheus exemplars:
- Prometheus still has only experimental support for exemplars after more than three years since they were introduced.
It stores exemplars in memory, so they are lost after Prometheus restart. This doesn't look like production-ready feature.
See 0a2f3b3794/content/docs/instrumenting/exposition_formats.md (L153-L159)
and https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage
- It is very non-trivial to expose exemplars alongside metrics in your application, since the official Prometheus SDKs
for metrics' exposition ( https://prometheus.io/docs/instrumenting/clientlibs/ ) either have very hard-to-use API
for exposing histograms or do not have this API at all. For example, try figuring out how to expose exemplars
via https://pkg.go.dev/github.com/prometheus/client_golang@v1.19.1/prometheus .
- It looks like exemplars are supported for Histogram metric types only -
see https://pkg.go.dev/github.com/prometheus/client_golang@v1.19.1/prometheus#Timer.ObserveDurationWithExemplar .
Exemplars aren't supported for Counter, Gauge and Summary metric types.
- Grafana has very poor support for Prometheus exemplars. It looks like it supports exemplars only when the query
contains histogram_quantile() function. It queries exemplars via special Prometheus API -
https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars - (which is still marked as experimental, btw.)
and then displays all the returned exemplars on the graph as special dots. The issue is that this doesn't work
in production in most cases when the histogram_quantile() is calculated over thousands of histogram buckets
exposed by big number of application instances. Every histogram bucket may expose an exemplar on every timestamp shown on the graph.
This makes the graph unusable, since it is litterally filled with thousands of exemplar dots.
Neither Prometheus API nor Grafana doesn't provide the ability to filter out unneeded exemplars.
- Exemplars are usually connected to traces. While traces are good for some
I doubt exemplars will become production-ready in the near future because of the issues outlined above.
Alternative to exemplars:
Exemplars are marketed as a silver bullet for the correlation between metrics, traces and logs -
just click the exemplar dot on some graph in Grafana and instantly see the corresponding trace or log entry!
This doesn't work as expected in production as shown above. Are there better solutions, which work in production?
Yes - just use time-based and label-based correlation between metrics, traces and logs. Assign the same `job`
and `instance` labels to metrics, logs and traces, so you can quickly find the needed trace or log entry
by these labes on the time range with the anomaly on metrics' graph.
- Export streamaggr.LoadFromData() function, so it could be used in tests outside the lib/streamaggr package.
This allows removing a hack with creation of temporary files at TestRemoteWriteContext_TryPush_ImmutableTimeseries.
- Move common code for mustParsePromMetrics() function into lib/prompbmarshal package,
so it could be used in tests for building []prompbmarshal.TimeSeries from string.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6205
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6206
Move the code responsible for relabelCtx clearing into deferred function.
This allows making more clear the remoteWriteCtx.TryPush code.
This is a follow-up for 879771808b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6205
While at it, clarify the description of the bugfix at docs/CHANGELOG.md
Use metricsql.IsLikelyInvalid() function for determining whether the given query is likely invalid,
e.g. there is high change the query is incorrectly written, so it will return unexpected results.
The query is invalid most of the time if it passes something other than series selector into rollup function.
For example:
- rate(sum(foo))
- rate(foo + bar)
- rate(foo > bar)
Improtant note: the query is considered valid if it misses the lookbehind window in square brackes inside rollup function,
e.g. rate(foo), since this is very convenient MetricsQL extention to PromQL, and this query returns the expected results
most of the time.
Other unsafe query types can be added in the future into metricsql.IsLikelyInvalid().
TODO: probably, the -search.disableImplicitConversion command-line flag must be set by default in the future releases of VictoriaMetrics.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4338
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6180
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6450
- Clarify docs for `Ignore aggregation intervals on start` feature.
- Make more clear the code dealing with ignoreFirstIntervals at aggregator.runFlusher() functions.
It is better from readability and maintainability PoV using distinct a.flush() calls
for distinct cases instead of merging them into a single a.flush() call.
- Take into account the first incomplete interval when tracking the number of skipped aggregation intervals,
since this behaviour is easier to understand by the end users.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6137
Previously the in-memory buffer could remain unflushed for long periods of time under low ingestion rate.
The ingested logs weren't visible for search during this time.
### Describe Your Changes
- added stale metrics counters for input and output samples
- added labels for aggregator metrics =>
`name="{rwctx}:{aggrId}:{aggrSuffix}"`
- rwctx - global or number starting from 1
- aggrid - aggregator id starting from 1
- aggrSuffix - <interval>_(by|without)_label1_label2_labeln
e.g: `name="global:1:1m_without_instance_pod"`
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Fixes#6453
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
- Fixed the update of the relative time range when `Execute Query` is
clicked
- Optimized server requests: now, if an error occurs in the `/query`
request, the `/hits` request will not be executed.
#6345 (duplicates: #6440, #6312)
The TryParseTimestampRFC3339Nano() must properly parse RFC3339 timestamps with timezone offsets.
While at it, make tryParseTimestampISO8601 function private in order to prevent
from improper usage of this function from outside the lib/logstorage package.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6508
This reverts commit 6e395048d3.
Reason for revert: the previous logic was correct.
The purpose of `-search.maxSamplesPerQuery` command-line flag is to limit the amounts of CPU resources,
which could be taken by a single query - see https://docs.victoriametrics.com/#resource-usage-limits .
VictoriaMetrics processes samples in blocks during querying - it reads the block, then unpacks it,
then filters out samples outside the selected time range. This means that it _spends CPU time_
on reading and unpacking of _all the samples_ in every block on the requested time range,
even if only a single sample per each block matches the given time range.
The previous logic was effectively limiting CPU time a single query could take.
The new logic fails limiting CPU time a single query could take in some pathological cases
when only a small fraction of samples per each requested block fit the requested time range.
This allows performing multiplication DoS-attacks by querying very narrow time ranges over historical blocks,
which tend to be full. For example, if the `-search.maxSamplesPerQuery` equals to a billion,
and the query requests a single sample out of 8K samples per each block, this means that the query
may unpack a billion of such blocks without exceeding the limit, e.g. it may unpack and process 8K*1e9=8e12 samples.
This is not what the resource usage limits were created for originally - see https://docs.victoriametrics.com/#resource-usage-limits
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5851
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6464
* rm extra interface method for rw Client, as it has low applicability
and doesn't fit multitenancy well
* add `GetDroppedRows` method instead
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Added a bar chart displaying the number of log entries over a time
range.
#6404
- When `_msg` is empty, all fields are displayed in a single line.
- Added double quotes when copying pairs: `key: "value"`.
- Minor style adjustments.
Check for ranged vector arguments in aggregate expressions when
`-search.disableImplicitConversion` or `-search.logImplicitConversion`
are enabled.
For example, `sum(up[5m])` will fail to execute if these flags are set.
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [*] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This change fixes the following panic:
```
2024-06-04T11:16:52.899Z warn app/vmauth/auth_config.go:353 cannot discover backend SRV records for http://srv+localhost:8080: lookup localhost on 10.100.10.4:53: server misbehaving; use it literally
panic: runtime error: integer divide by zero
goroutine 9 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/httpserver.handlerWrapper.func1()
/Users/lhhdz/wd/projects/go/VictoriaMetrics/lib/httpserver/httpserver.go:291 +0x58
panic({0x103115100?, 0x10338d700?})
/Users/lhhdz/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.3.darwin-arm64/src/runtime/panic.go:770 +0x124
main.getLeastLoadedBackendURL({0x0?, 0x22?, 0x1400014757b?}, 0x1400013c120?)
/Users/lhhdz/wd/projects/go/VictoriaMetrics/app/vmauth/auth_config.go:473 +0x210
main.(*URLPrefix).getBackendURL(0x140000aa080)
/Users/lhhdz/wd/projects/go/VictoriaMetrics/app/vmauth/auth_config.go:312 +0xb8
```
---------
Co-authored-by: Haley Wang <haley@victoriametrics.com>
Add support for markdown format and emoji for the `_msg` field in the
"Group" view.
Add markdown rendering toggle. Disabled by default. Value is stored in
`localStorage`.
It is better to remove deprecated flag completely, so vmctl will
fail if this flag is used and user can immediately fix the issue.
Before, flag was ignored and it is worse then fail fast.
follow-up after 8b46bb0c41 (diff-2bfab3db5cc1baf4c6d3ff6b19901926e3bdf4411ec685dac973e5fcff1c723b)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* adds idleConnTimeout flag, which must reduce probability of `broken
pipe` and `connection reset` errors.
* one-time retry trivial network requests for the same backend
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* deprecate `--vm-disable-progress-bar` in favour of `--disable-progress-bar`
* new `--disable-progress-bar` consistently disables usage of progress bar
for all migration modes.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6367
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Unsupported path is already handled by `lib/httpserver`.
This prevents from misleading errors in logs caused by double-writing response headers.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* "*.idleConnTimeout" flags must reduce probability of `write: broken
pipe` and `read: connection reset by peer` errors Those errors may occur
if remote server closes TCP socket for connection, while it's still
exist at client.
* single time retries for `write: broken pipe` and `read: connection
reset by peer` must handle a case for incorrectly configured timeouts at
middleware proxies, mitigate minor network issues.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5661
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
It allows to create alert for possible item drops at indexdb. It may
happen, if ingested metric size exceeds max indexdb item size.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* It must reduce memory usage for misbehaving clients. Since
VictoriaMetrics stores sparse index inmemory.
* Reduce disk space usage for indexdb.
* Prevent possible indexDB items drops.
* It may trigger slow insert and new timeseries registration due to
default value for flag change
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6176
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This PR fixes an issue where parsing long `_msg` values caused errors,
resulting in some log records not being displayed.
The error occurred due to partial processing of strings. In some cases,
a long record could be split into multiple chunks, causing only part of
the record to be processed instead of the entire entry.
#6281
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Prevent excessive resource usage when stream aggregation config file
contains no matchers by prevent pushing data into Aggregators object.
Before this change a lot of extra work was invoked without reason.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Occasionally, vmagent sends empty blocks to downstream servers. If a
downstream server returns an unexpected response, vmagent gets stuck in
a retry loop. While vmagent handles 400 and 409 errors, there are
various prometheus remote write implementations that return different
error codes. For example, vector returns a 422 error. To mitigate the
risk of vmagent getting stuck in a retry loop, it is advisable to skip
sending empty blocks to downstream servers.
Co-authored-by: hao.peng <hao.peng@smartx.com>
Co-authored-by: Zhu Jiekun <jiekun.dev@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* adds datadog extensions for statsd:
- multiple packed values (v1.1)
- additional types distribution, histogram
* adds type check and append metric type to the labels with special tag
name `__statsd_metric_type__`. It simplifies streaming aggregation
config.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Allocations are reduced by implementing custom json parser via fastjson
lib.
The change also re-uses `promInstant` object in attempt to reduce number
of
allocations when parsing big responses, as usually happens with heavy
recording rules.
```
name old allocs/op new allocs/op delta
ParsePrometheusResponse/Instant-10 9.65k ± 0% 5.60k ± 0% ~ (p=1.000 n=1+1)
```
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Allocations are reduced by re-using the byte buffer when converting
labels to string keys.
```
name old allocs/op new allocs/op delta
GetStaleSeries-10 703 ± 0% 203 ± 0% ~ (p=1.000 n=1+1)
```
Signed-off-by: hagen1778 <roman@victoriametrics.com>
it's needed to remove Summary metric type from the global state of
metrics package. metrics package tracks each bucket of summary and
periodically swaps old buckets with new.
Simple set unregister is not enough to release memory used by Set
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6247
* FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent.html):
allow configuring `-remoteWrite.disableOnDiskQueue` and
`-remoteWrite.dropSamplesOnOverload` cmd-line flags per each
`-remoteWrite.url`. See this [pull
request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6065).
Thanks to @rbizos for implementaion!
* FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent.html): add
labels `path` and `url` to metrics
`vmagent_remotewrite_push_failures_total` and
`vmagent_remotewrite_samples_dropped_total`. Now number of failed pushes
and dropped samples can be tracked per `-remoteWrite.url`.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Raphael Bizos <r.bizos@criteo.com>
### Describe Your Changes
Added makefile rule for `GOARCH=loong64` to support building all
VictoriaMetrics components on the `loongarch64` platform.
### Checklist
The following checks are **mandatory**:
* [X] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: qiangxuhui <qiangxuhui@loongson.cn>
This code adds Exemplars to VMagent and the promscrape parser adhering
to OpenMetrics Specifications. This will allow forwarding of exemplars
to Prometheus and other third party apps that support OpenMetrics specs.
---------
Signed-off-by: Ted Possible <ted_possible@cable.comcast.com>
When at least one remote write has deduplication configured it cleans up
timeseries while they can be in use by another remote write without
deduplication
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6205
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
The new flag type is supposed to be used for specifying URL values which
could contain sensitive information such as auth tokens in GET params or
HTTP basic authentication.
The URL flag also allows loading its value from files if `file://`
prefix is specified. As example, the new flag type was used in
app/vmbackup as it requires specifying `authKey` param for making the
snapshot.
See related issue
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5973
Thanks to @wasim-nihal for initial implementation
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6060
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6140
We don't cover this corner case as it has low chance for reproduction.
Precisely, the requirements are following:
1. vmagent need to be configured with multiple identical `remoteWrite.url` flags;
2. At least one of the persistent queues need to be non-empty, which already
signalizes about issues with setup;
3. vmagent need to be restarted with removing of one of `remoteWrite.url` flags.
We do not document this case in vmagent.md as it seems to be a rare corner case
and its explanation will require too much of explanation and confuse users.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Starting from v1.99.0 vmalert could ignore anchors pointing to specific
rule groups if `search` param was present in URL.
This change makes anchors compatible with `search` param in UI.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Stream aggregation may yield inaccurate results if it processes incomplete data.
This issue can arise when data is sourced from clients that maintain a queue of unsent data, such as Prometheus or vmagent.
If the queue isn't fully cleared within the aggregation interval, only a portion of the time series may be included in that period, leading to distorted calculations.
To mitigate this we add an option to ignore first N aggregation intervals. It is expected, that client queues
will be cleared during the time while aggregation ignores first N intervals and all subsequent aggregations
will be correct.
The memory usage for plain sync.Pool doesn't increase comparing to the memory usage for the hybrid scheme,
so it is better to use plain sync.Pool in order to simplify the code and make it more readable and maintainable.
This is a follow-up for c22da2f917
Data ingestion benchmark doesn't show memory usage difference between two approaches,
so let's use simpler approach in order to improve code readability and maintainability.
This is a follow-up for 77c597738c
This scheme was used for reducing memory usage when vmagent runs on a machine with big number of CPU cores
and the ingestion rate isn't too big. The scheme with channel-based pool could reduce memory usage,
since it minimizes the number of PushCtx structs in the pool in this case.
Performance tests didn't reveal significant difference in memory usage under both low and high ingestion rate
between plain sync.Pool and the current hybrid scheme, so replace the scheme with plain sync.Pool in order
to simplify the code.
There was a sleep statement in the test, waiting for Group
to perform a couple of evaluation. But looks like
it worked unreliable for some CI tests like the one below
https://github.com/VictoriaMetrics/VictoriaMetrics/actions/runs/8718213844/job/23915007958?pr=6115
This commit changes the sleep statement on a function that
waits for a specific number of evaluations. It should make this
test faster in general case, and more reliable for slow environemnts.