Examples:
1) -metricsAuthKey=file:///abs/path/to/file - reads flag value from the given absolute filepath
2) -metricsAuthKey=file://./relative/path/to/file - reads flag value from the given relative filepath
3) -metricsAuthKey=http://some-host/some/path?query_arg=abc - reads flag value from the given url
The flag value is automatically updated when the file contents changes.
* app/vmauth: adds metric_labels and backend_errors counter
it must improve observability for user requests with new metric - per user backend errors counter.
it's needed to calculate requests fail rate to the configured backends.
metric_labels configuration allows to perform additional aggregations on top of multiple users from configuration section.
It could be multiple clients or clients with separate read/write tokens
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmselect/promql: properly handle possible negative results caused by float operations precision error in rollup functions like rate() or increase()
* fix test
- docs/sd_configs.md: moved hetzner_sd_configs docs to the correct place according to alphabetical order of SD names,
document missing __meta_hetzner_role label.
- lib/promscrape/config.go: added missing MustStop() call for Hetzner SD,
and moved the code to the correct place according to alphabetical order of SD names.
- lib/promscrape/discovery/hetzner: properly handle pagination for hloud API responses,
populate missing __meta_hetzner_role label like Prometheus does.
- Properly populate __meta_hetzner_public_ipv6_network label like Prometheus does.
- Remove unused SDConfig.Token.
- Remove "omitempty" annotation from SDConfig.Role field, since this field is mandatory.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3154
Each Grafana dashboard has unique ID which can be used to fetch the dashboard
from grafana.com: https://grafana.com/grafana/dashboards/11176
The same dashboard can be accessed via URL with slug: https://grafana.com/grafana/dashboards/11176-victoriametrics-cluster/
But using slug implies that any change to dashboard name will break the link.
So it is better to just use ID, so the dashboard URL will never break.
This is follow-up for ff33e60a3d
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is likely this value was borrowed from panel `Slow inserts` panel
from Grafana dasbhoard for VM single/cluster installations. This is a mistake.
There is no such thing as "tolerable churn rate", as tolerancy depends on the amount
of allocated resources.
Although, it is unclear what is meant by 5%. If it refers to 5% of new time series per second,
then such churn rate is extremely high. It would mean that the avg life of a time series is 20s.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Properly determine time range search for instant queries with too big look-behind window like `foo[100y]`.
Previously, such queries could return empty responses even if `foo` is present in database.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5553
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Clarify the bugfix description at docs/CHANGELOG.md
- Simplify the code by accessing prefetchedMetricIDs struct under the lock
instead of using lockless access to immutable struct.
This shouldn't worsen code scalability too much on busy systems with many CPU cores,
since the code executed under the lock is quite small and fast.
This allows removing cloning of prefetchedMetricIDs struct every time
new metric names are pre-fetched. This should reduce load on Go GC,
since the cloning of uin64set.Set struct allocates many new objects.
Add docs to explain that `latest` backup folder can be accessed
more frequently than others. Hence, it is recommended to keep it
on storages with low access price.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Before, this cache was limited only by size.
Cache invalidation by time happens with jitter to prevent thundering herd problem.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It was calculating the number of dropped time series instead of the number of dropped samples.
While at it, drop vmalert_remotewrite_dropped_bytes_total metric, since it was inconsistently calculated -
at one place it was calculating raw protobuf-encoded sample sizes, while at another place it was calculating
the size of snappy-compressed prompbmarshal.WriteRequest protobuf message.
Additionally, this metric has zero practical sense, so just drop it in order to reduce the level of confusion.
* update link to book a demo for AD & RCA
* fix invalid refs in components/model
* - fix staging -> prod links
- replace capitalized FAQ headers
- change the section order on main page
- replace :latest tag with current stable
* add panels for detailed visualization of traffic usage between vmstorage, vminsert, vmselect
components and their clients. New panels are available in the rows dedicated to specific components.
* update "Slow Queries" panel to show percentage of the slow queries to the total number of read queries
served by vmselect. The percentage value should make it more clear for users whether there is a service degradation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 463455665b)
automatically add `exported_` prefix for original evaluation result label if it's conflicted with external or reserved one,
previously it was overridden.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5161
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 1f477aba41)
The user may which to control the endpoint parameters for instance to
set the audience when requesting an access token. Exposing the
parameters as a map allows for additional use cases without requiring
modification.
* lib/awsapi: properly assume role with webIdentity token
introduce new irsaRoleArn param for config. It's only needed for authorization with webIdentity token.
First credentials obtained with irsa role and the next sts assume call for an actual roleArn made with those credentials.
Common use case for it - cross AWS accounts authorization
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3822
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Some clients may ingest samples via OpenTelemetry protocol without Resource labels.
Previously VictoriaMetrics was silently dropping such samples.
The commit 317834f876 added vm_protoparser_rows_dropped_total{type="opentelemetry",reason="resource_not_set"}
counter for tracking of such dropped samples. See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5459
It is better from usability PoV to accept such samples instead of dropping them and incrementing the corresponding counter.
Before, retries happened only on writes into a network connection
between source and destination. But errors returned by server after
all the data was transmitted were logged, but not retried.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 664fa5cb78)
Requests with wrong or no schema in `-external.url` could be rejected by alertmanager.
So we validate schema on start up.
(cherry picked from commit 9253c24dd6)
* vmui: add show quick tip for autocomplete
* vmui: auto-completion usability improvements #5348
* vmui: add const for min symbols in autocomplete
* Use proper queries to VictoriaMetrics
* vmui: fix comments for autocomplete
* app/vmselect: run `make vmui-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix typos in docs
* add `shard-` prefix to generated links when `-promscrape.cluster.memberURLTemplate` is enabled
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit e0fc5ef140)
Before, vmalert would send notifications with labels containing characters
not supported by Alertmanager validator, resulting into validation errors
like `msg="Failed to validate alerts" err="invalid label set: invalid name "foo.bar"`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously the lower bound could be too small, which could result in missing values at the beginning of the graph
for default_rollup() function. This function is automatically applied to all the series selectors if they aren't
explicitly wrapped into a rollup function - see https://docs.victoriametrics.com/MetricsQL.html#implicit-query-conversions
While at it, properly take into account `-search.minStalenessInterval` command-line flag when adjusting
the lower bound for the selected time range.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5388
* vmagent: export `vm_promscrape_scrape_pool_targets` metric to track the number of targets that each scrape_job discovers
* add extra panel for new metric
- Add links to relevant docs into descriptions for every -kafka.* and -gcp.pubsub.* command-line flags.
- Wait until message processing goroutines are stopped before returning from gcppubsub.Stop().
- Prevent from multiple calls to Init() without Stop().
- Drop message if tenantID cannot be parsed properly.
- Take into account tenantID for all the supported message formats.
- Support gzip-compressed messages for graphite format.
- Use exponential backoff sleep when the message cannot be pushed to remote storage systems
because of disabled on-disk persistence - https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence
- Unblock from sleep as soon as Stop() is called. Previously the sleep could take up to 2 seconds after Stop() is called.
- Remove unused globalCtx and initContext from app/vmagent/remotewrite/gcppubsub
- Mention Google PubSub support at docs/enterprise.md
- Make Google PubSub docs more clear at docs/vmagent.md
This is a follow-up for commits 115245924a5f096c5a3383d6cc8e8b6fbd421984
and e6eab781ce42285a6a1750dc01eba6801dd35516 .
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/717
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/713
* app/vmalert: expose `/vmalert/api/v1/rule` and `/api/v1/rule` API which returns rule status in JSON format
* app/vmalert: hide updates if query param not set
* app/vmalert: fix panic (recursion call)
* app/vmalert: add needed group name and file name
* app/vmalert: fix comment, update behavior
* app/vmalert: fix description
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Previously concurrency for static and fast queries was limited with the -search.maxConcurrentRequests
command-line flag. This could complicate identifying heavy queries via `vmui` at `Top queries` and `Active queries` pages,
since `vmui` and these pages couldn't be opened on overloaded vmselect.
Thanks to @f41gh7 for the idea.
Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).
Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.
While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
* lib/backup/s3remote: remove prev object versions for recursive delete
- fix error caused by sending empty objects list to be deleted. This was possible in case old versions of objects where deleted, but root-level entries where still available. This caused paginator to return an empty page which wasn't skipped.
- delete previous versions of objects recursively for S3 remote
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/changelog: add vmbackupmanager fix entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/backup/s3remote: unify path construction for S3 objects
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/streamaggr: properly reference slice with labels
by limiting slice capacity. It must fix issues with slice modification, in case of append new slice will be allocated, instead of modifying refrenced slice
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5402
* Reduce memory allocations when output_relabel_configs adds new labels to output samples
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
(cherry picked from commit 41f7940f97)
* prevent /api/v1 from panic on parsing rows
* add tests for Extract function for v1 and v2 api's
* separate request types in different pools to prevent different objects mixing
* add changelog line
543f218fe9
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 98d0f81f21)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Andrew Chubatiuk <andrew.chubatiuk@motional.com>
Co-authored-by: Nikolay <https://github.com/f41gh7>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
(cherry picked from commit 543f218fe9)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Add Try* prefix to functions, which return bool result in order to improve readability and reduce the probability of missing check
for the result returned from these functions.
- Call the adjustSampleValues() only once on input samples. Previously it was called on every attempt to flush data to peristent queue.
- Properly restore the initial state of WriteRequest passed to tryPushWriteRequest() before returning from this function
after unsuccessful push to persistent queue. Previously a part of WriteRequest samples may be lost in such case.
- Add -remoteWrite.dropSamplesOnOverload command-line flag, which can be used for dropping incoming samples instead
of returning 429 Too Many Requests error to the client when -remoteWrite.disableOnDiskQueue is set and the remote storage
cannot keep up with the data ingestion rate.
- Add vmagent_remotewrite_samples_dropped_total metric, which counts the number of dropped samples.
- Add vmagent_remotewrite_push_failures_total metric, which counts the number of unsuccessful attempts to push
data to persistent queue when -remoteWrite.disableOnDiskQueue is set.
- Remove vmagent_remotewrite_aggregation_metrics_dropped_total and vm_promscrape_push_samples_dropped_total metrics,
because they are replaced with vmagent_remotewrite_samples_dropped_total metric.
- Update 'Disabling on-disk persistence' docs at docs/vmagent.md
- Update stale comments in the code
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5088
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* app/vmagent: allow to disabled on-disk queue
Previously, it wasn't possible to build data processing pipeline with a
chain of vmagents. In case when remoteWrite for the last vmagent in the
chain wasn't accessible, it persisted data only when it has enough disk
capacity. If disk queue is full, it started to silently drop ingested
metrics.
New flags allows to disable on-disk persistent and immediatly return an
error if remoteWrite is not accessible anymore. It blocks any writes and
notify client, that data ingestion isn't possible.
Main use case for this feature - use external queue such as kafka for
data persistence.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* adds test, updates readme
* apply review suggestions
* update docs for vmagent
* makes linter happy
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Tests showed that importing a single line with 70MB size takes 5.3GiB
RSS memory for VictoriaMetrics single-node.
In the scenario when user exports and imports data from one VM to another,
it could possibly lead to OOM exception for destination VM.
Importing a single line with 16MB size taks 1.3GiB RSS memory.
Hence, the limit for `import.maxLineLen` was decreased from 100MB to 10MB
to improve reliability of VictoriaMetrics during imports.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This case is possible after the following steps:
1. vmagent successfully performed handshake with the -remoteWrite.url and the remote storage supports zstd-compressed data.
2. remote storage became unavailable or slow to ingest data, vmagent compressed the collected data into blocks with zstd and puts these blocks to persistent queue on disk.
3. vmagent restarts and the remote storage is unavailable during the handshake, then vmagent falls back to Snappy compression.
4. vmagent starts sending zstd-compressed data from persistent queue to the remote storage, while falsely advertizing it sends Snappy-compressed data.
5. The remote storage receives zstd-compressed data and fails unpacking it with Snappy.
The solution is the same as 12cd32fd75, just fall back to zstd decompression if Snappy decompression fails.
The buffered connection could have exceeded the underlying connection
deadline during reading or writing to an internal buffer.
With this change, buffered connection struct additionally checks
for a deadline in Read/Write methods.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
evalRollupFuncNoCache() may return time series with identical labels (aka duplicate series)
when performing queries satisfying all the following conditions:
- It must select time series with multiple metric names. For example, {__name__=~"foo|bar"}
- The series selector must be wrapped into rollup function, which drops metric names. For example, rate({__name__=~"foo|bar"})
- The rollup function must be wrapped into aggregate function, which has no streaming optimization.
For example, quantile(0.9, rate({__name__=~"foo|bar"})
In this case VictoriaMetrics shouldn't return `cannot merge series: duplicate series found` error.
Instead, it should fall back to query execution with disabled cache.
Also properly store the merged results. Previously they were incorrectly stored because of a typo
introduced in the commit 41a0fdaf39
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5332
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5337
`version` label won't show the difference if various flavors of the same
version were deployed. But `short_version` will.
For example, on the sandbox env we test VM builds before new version release.
Without this change, the version update won't be visible on dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit d389a4fcf3)
* lib/querytracer: makes package concurrent safe to use
it must fix various issues with concurrent code usage.
Especially, when it's not reasonable to wait for all goroutines to be finished
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmstorage: close vminsert connections gradually before stopping storage
Implements graceful shutdown approach suggested here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922#issuecomment-1768146878
Test results for this can be found here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922#issuecomment-1790640274
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: update graceful shutdown logic
- close connections from vminsert in determenistic order
- update flag description
- lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks).
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/cluster: add information about re-routing enhancement during restart
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/changelog: add entry for new command-line flag
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* {app/vmstorage,lib/ingestserver}: address review feedback
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/cluster: add note to update workload scheduler timeout
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* wip
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This case is possible after the following steps:
1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.
The solution is to just fall back to Snappy decompression if zstd decompression fails.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
Previously the `Host` header was remained unchanged when passing it in requests to backends.
This may improperly work if the backend uses host-based routing.
While at it, allows http/2.0 requests to backends. While VictoriaMetrics components
do not accept http/2.0 requests, other backends can require such requests.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
- Re-use identically configured http.Transport across multiple users.
This fixes handling of the limit on the number of connection, which can be established per each backend
via -maxIdleConnsPerBackend command-line flag. This limit stopped working after 323f3720ed
- Add docs about backend TLS setup at https://docs.victoriametrics.com/vmauth.html#backend-tls-setup
- Add ability to disable backend TLS verification for all the users via -backend.tlsInsecureSkipVerify command-line flag.
This flag may be useful when -auth.config contains big number of users, and every user must disable backend TLS verification.
- Add ability to specify TLS Root CA via tls_ca_file option at per-user basis and via -backend.tlsCAFile command-line flag
across all the users.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
Previously entries which were accessed only 1 time weren't cached.
It has been appeared that some rarely executed heavy queries may read indexdb block twice
in a row instead of once. There is no need in caching such a block then.
This change should eliminate cache size spikes for indexdb/dataBlocks when such heavy queries are executed.
Expose -blockcache.missesBeforeCaching command-line flag, which can be used for fine-tuning
the number of cache misses needed before storing the block in the caching.
* app/vmalert: update remote-write process
* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
Add `tls_insecure_skip_verify` option on per-user basis which allows to disable TLS verification for all requests to backend on behalf of this user.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit 323f3720ed)
reduce the number of queries for restoring alerts state on start-up.
The change should speed up the restore process and reduce pressure on `remoteRead.url`.
(cherry picked from commit 90d45574bf)
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).
* add link to issue in CHANGELOG
* Extend vmauth docs
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:
- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()
The basic idea of optimization is to calculate
rf(m[d] @ t)
as
rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))
where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously
The offset may be in the range of up to 1 hour.
The SECURITY label should be applied only to changes, which fix security issues.
The change at ad839aa492 adds new command-line flags, which can be used
for improving security in some cases. They do not fix any security issues.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5111
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new panel supposed to indicate alerting groups that miss their evaluations.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit aaf9e3d526)
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 9866974a53)
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 8874b525b7)
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit abcb21aa5e)
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit ad839aa492)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().
Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216
The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
don't prevent from processing the corresponding scrape targets after the file becomes correct,
without the need to restart vmagent.
Previously scrape targets with invalid TLS CA file or TLS client certificate files
were permanently dropped after the first attempt to initialize them, and they didn't
appear until the next vmagent reload or the next change in other places of the loaded scrape configs.
- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
Previously the old TLS CA was used until vmagent restart.
- Properly handle errors during http request creation for the second attempt to send data to remote system
at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
since the returned request is nil on error.
- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
Previously it could miss details on the source of the request.
- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
of every http request issued by vmagent during service discovery or target scraping.
Re-use the HTTP client instead until the corresponding scrape config changes.
- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
when auth header cannot be obtained because of temporary error.
- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
when scraping big number of targets with the same tls_config.
- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.
- Improve test coverage at lib/promauth
- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
Previously vmagent was exitting in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
* app/vmselect: limit the number of parallel workers by 32
The change should improve performance and memory usage during query processing
on machines with big number of CPU cores. The number of parallel workers for
query processing is controlled via `-search.maxWorkersPerQuery` command-line flag.
By default, the number of workers is limited by the number of available CPU cores,
but not more than 32. The limit can be increased via `-search.maxWorkersPerQuery`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
- The `-search.maxWorkersPerQuery` command-line flag doesn't limit resource usage,
so move it from the `resource usage limits` to `troubleshooting` chapter at docs/Single-server-VictoriaMetrics.md
- Make more clear the description for the `-search.maxWorkersPerQuery` command-line flag
- Add the description of `-search.maxWorkersPerQuery` to docs/Cluster-VictoriaMetrics.md
- Limit the maximum value, which can be passed to `-search.maxWorkersPerQuery`, to GOMAXPROCS,
because bigger values may worsen query performance and increase CPU usage
- Improve the the description of the change at docs/CHANGELOG.md. Mark it as FEATURE instead of BUGFIX,
since it is closer to a feature than to a bugfix.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix inconsistent behaviors with prometheus when scraping
1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
1. vmagent remoteWrite;
2. vmalert datasource/remoteRead/remoteWrite/notifier.
* add changelogs
* address review comments
* fix ut
c9375cac5e
Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.
`eval_delay` was added to tests to verify it can be unmarshalled.
`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.
`eval_delay` and `eval_offset` was added to API output.
`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit a216fe6728)
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 003ef3a518)
This can be useful in the following queries:
drop_empty_series(temperature <= 30) default 40
This query drops temperature series with all the values bigger than 30 on the selected time range,
while replacing gaps in the remaining series with 40.
The query without drop_empty_series:
(temperature <= 30) default 40
would leave all the temperature series with all the values bigger than 30 on the selected time range,
and replace all their values with 40. This is not what could be epxected in some cases
like here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5071
This reverts commit 3d7a77bf82.
Reason for revert: relative links do not work properly at GitHub code
and at GitHub wiki. For example, the following page contains broken links
before reverting this commit:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/VictoriaLogs/CHANGELOG.md
It is always better to use absolute links thank relative links, since the page contents
can be copy-n-pasted to other pages, which are located in vastly different directories,
and all the links will remain working.
* fixed error when creating a full backup using the `-origin` flag (#5144)
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This is a follow-up for f60c08a7bd
Changes:
- Make sure all the urls related to NewRelic protocol start from /newrelic . Previously some urls were started from /api/v1/newrelic
- Remove /api/v1 part from NewRelic urls, since it has no sense
- Remove automatic transformation from CamelCase to snake_case for NewRelic labels and metric names,
since it may complicate the transition from NewRelic to VictoriaMetrics. Preserve all the metric names and label names,
so users could query metrics and labels by the same names which are used in NewRelic.
The automatic transformation from CamelCase to snake_case can be added later as a special action for relabeling rules if needed.
- Properly update per-tenant data ingestion stats at app/vmagent/newrelic/request_handler.go . Previously it was always zero.
- Fix NewRelic urls in vmagent when multitenant data ingestion is enabled. Previously they were mistakenly started from `/`.
- Document NewRelic data ingestion url at docs/Cluster-VictoriaMetrics.md
- Remove superflouos memory allocations at lib/protoparser/newrelic
- Improve tests at lib/protoparser/newrelic/*
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3520
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4712
- Make more clear the docs at docs/enterprise.md, so readers could figure out faster
on how to obtain enterprise key and how to pass it to VictoriaMetrics Enterprise components.
- Fix examples at docs/enterprise.md, which were referring to non-existing `-license-file` command-line flag.
The `-licenseFile` command-line flag must be used instead.
- Improve the description of `-license*` command-line flags, so users could understand
faster how to use them.
- Improve the warning message, which is emitted when the deprecated -eula command-line flag is passed,
so the user could figure out how to switch faster to -license* command-line flags.
- Disallow running VictoriaMetrics components with both -license and -licenseFile command-line flags.
- Disallow running VictoriaMetrics components when -licensFile points to an empty file.
- Consistently use the phrase "This flag is available only in Enterprise binaries" across
all the enterprise-specific command-line flags.
- Remove unneeded level of indirection for `noLicenseMessage` and `expiredMessage` string contants
in order to improve code readability and maintainability.
- Remove unneded `return` statements after `logger.Fatalf()` calls, since these calls exit the app and never return.
- Make sure that the info log message about successful license verification is emitted
when the license is verified successfully. Previously the error message could be logged
when the license payload is invalid or if it misses some required features.
- Reduce vertical space usage, so more information is available on the screen without the need to scroll.
- Show information for lines with higher values at the top of the legend under the graph.
This should simplify graph analysis when it contains many lines.
Fix vminsert/vmstorage/vmselect metrics filtering when dashboard is used
to display data from many sub-clusters with unique job names.
Before, only one specific job could have been accounted for component-specific panels,
instead of all available jobs for the component.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
reduce lock contention for heavy aggregation requests
previously lock contetion may happen on machine with big number of CPU due to enabled string interning. sync.Map was a choke point for all aggregation requests.
Now instead of interning, new string is created. It may increase CPU and memory usage for some cases.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
* vmalert: add `query_time_alignment` for rule group
1. add `eval_alignment` attribute for group which by default is true. So group rule query stamp will be aligned with interval and propagated to ALERT metrics and the messages for alertmanager;
2. deprecate `datasource.queryTimeAlignment` flag.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049
(cherry picked from commit 2aa0f5fc41)
Strip sensitive information such as auth headers or passwords from datasource, remote-read,
remote-write or notifier URLs in log messages or UI. This behavior is by default and is controlled via
`-datasource.showURL`, `-remoteRead.showURL`, `remoteWrite.showURL` or `-notifier.showURL` cmd-line flags.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5044
(cherry picked from commit 244c887825)
* lib/promscrape: make concurrency control optional
Before, `-maxConcurrentInserts` was limiting all calls to `promscrape.Parse`
function: during ingestion and scraping. This behavior is incorrect.
Cmd-line flag `-maxConcurrentInserts` should have effect onl on ingestion.
Since both pipelines use the same `promscrape.Parse` function, we extend it
to make concurrency limiter optional. So caller can decide whether concurrency
should be limited or not.
This commit makes c53b5788b4
obsolete.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "dashboards: move `Concurrent inserts` panel to Troubleshooting section"
This reverts commit c53b5788b4.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Move uniqueFields from rows to blockStreamMerger struct.
This allows localizing all the references to uniqueFields inside blockStreamMerger.mustWriteBlock(),
which should improve readability and maintainability of the code.
- Remove logging of the event when blocks cannot be merged because they contain more than maxColumnsPerBlock,
since the provided logging didn't provide the solution for the issue with too many columns.
I couldn't figure out the proper solution, which could be helpful for end user,
so decided to remove the logging until we find the solution.
This commit also contains the following additional changes:
- It truncates field names longer than 128 chars during logs ingestion.
This should prevent from ingesting bogus field names.
This also should prevent from too big columnsHeader blocks,
which could negatively affect search query performance,
since columnsHeader is read on every scan of the corresponding data block.
- It limits the maximum length of const column value to 256.
Longer values are stored in an ordinary columns.
This helps limiting the size of columnsHeader blocks
and improving search query performance by avoiding
reading too long const columns on every scan of the corresponding data block.
- It deduplicates columns with identical names during data ingestion
and background merging. Previously it was possible to pass columns with duplicate names
to block.mustInitFromRows(), and they were stored as is in the block.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4969
docs: add clarification of the retention filter usage
Updated documentation regarding retention filter usage if duration is set lower than
`-retentionPeriod` flag value.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape: add metric `vm_promscrape_scrapes_skipped_total`
add metric `vm_promscrape_scrapes_skipped_total`to show whether vmagent skips the scrapes.
This could happen if vmagent is overloaded or target is responding too slow for configured `scrape_interval`.
The follow-up commit should add a corresponding alerting rule and panel to vmagent dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: add `TooManyScrapeSkips` alerting rule for vmagent
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add panels `Scrape duration 0.99 quantile` and `Skipped scrapes` to vmagent dashboard
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/logstorage: prevent from panic during background merge
Fixes panic during background merge when resulting block would contain more columns than maxColumnsPerBlock.
Buffered data will be flushed and replaced by the next block.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/logstorage: clarify field description and comment
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/logstorage: switch to read-only mode when running out of disk space
Added support of `--storage.minFreeDiskSpaceBytes` command-line flag to allow graceful handling of running out of disk space at `--storageDataPath`.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/logstorage: fix error handling logic during merge
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/logstorage: fix log level
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmui: update information about tsdb usage in cluster version
* vmui: cleanup
* vmui: add CHANGELOG.md
* vmui: cleanup
* vmui: update logic, move information to the visible place
* app/vmui: remove values fetch, update documentation for cardinality explorer
* app/vmui: update CHANGELOG.md
Moved because this panel is related to both: scraped and ingested data.
Before, it could have give a misleading impression that it is related to ingested metrics only.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
`median_over_time` is handled by predefined WITH template in MetricsQL library which translates it to `quantile_over_time(0.5)`
This makes it impossble to use `median_over_time` as a usual rollup function for `aggr_over_time`.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5034
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs
It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).
Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit 8d99c12a7d)
* lib/backup: fix issue with inconsistent copying of appliedRetention.txt
appliedRetention.txt can be modified in place, so it should be always copied just the same as parts.json
Updates: https://github.com/victoriaMetrics/victoriaMetrics/issues/5005
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add changelog entry for appliedRetention.txt copying fix
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* expose metrics `vmauth_config_last_reload_*` for tracking the state of config reloads, similarly to vmagent/vmalert components.
* do not print logs like `SIGHUP received...` once per configured `-configCheckInterval` cmd-line flag. This log will be printed only if config reload was invoked manually.
* prevent configuration reloading if there were no changes in config. This improves memory usage when `-configCheckInterval` cmd-line flag is configured and config has extensive list of regexp expressions requiring additional memory on parsing.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Switch from summary to histogram for vl_http_request_duration_seconds metric.
This allows calculating request duration quantiles across multiple hosts
via histogram_quantile(0.99, sum(vl_http_request_duration_seconds_bucket) by (vmrange)).
- Take into account only successfully processed data ingestion requests
when updating vl_http_request_duration_seconds histogram.
Failed requests are ignored, since they may significantly skew measurements.
- Clarify the description of the change at docs/VictoriaLogs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4934
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* lib/promscrape/discovery/kubernetes: fix leaking api watcher
goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server
This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: clarify comment
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Apply suggestions from code review
* lib/promscrape/discovery/kubernetes: address review feedback
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* lib/backup: force copying of parts.json
Copying of parts.json is required because `part.key()` comparison can create same key value for files with different contents. This will result in inconsistent backup being created or restored.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/backup: ensure parts.json is only copied once
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Adding limit on ingestion allows to avoid issues like this one https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Such issues are often caused by misconfigurtion on log persing/ingestion side and preventing such rows from being ingested allows to avoid performance implications created by storing such log rows.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This should prevent from emitting too long lines when too long args are passed to logger.* functions.
For example, too long MetricsQL queries or too long data samples.
Adds `eval_offset` attribute for Groups.
If specified, Group will be evaluated at the exact time offset on the range of [0...evaluationInterval].
The setting might be useful for cron-like rules which must be evaluated at specific moments of time.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3409
Signed-off-by: Haley Wang <pipilong.25@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 45c0e4bb31)
* Introduce flagutil.Duration
To avoid conversion bugs
* Fix tests
* Clarify documentation re. month=31 days
* Add fasttime.UnixTime() to obtain time.Time
The goal is to refactor out the last usage of `.Msecs`.
* Use fasttime for time.Now()
* wip
- Remove fasttime.UnixTime(), since it doesn't improve code readability and maintainability
- Run `make docs-sync` for syncing changes from README.md to docs/ folder
- Make lib/flagutil.Duration.Msec private
- Rename msecsPerMonth const to msecsPer31Days in order to be consistent with retention31Days
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4932
* typo fix
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vminsert: properly close vmstorage connection
previously vmstorage may stuck in broken state until vminsert restarts
since vmstorage was marked as read-only and connection was broken to it.
checkReadonly function never marked connection as broken
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4870
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add cardinality support for prometheus (#4320)
* docs/CHANGELOG.md: add cardinality support for prometheus
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* app/vminsert: fixes readonly check
previously vminsert doesn't check readOnly state for vmstorage, since check was never performed for nil buffer
In this case every 30 second storage node loss readonly state and received some data.
It caused re-routing and possible slow down for ingestion
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4870
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
app/vmselect: fix panic when using `/select/multitenant` endpoint
Such requests must be rejected as not found since vmselect does not support multitenant endpoint.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4910
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker: disable provenance in buildx
it must fix an issue with multi-platform manifest generation
at buildx >= 0.10 backward compatibility was broken and generated image cannot be used with docker systems that doesn't support oci.
disabling attestat temporary fixes it.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4907https://docs.docker.com/build/attestations/slsa-provenance/
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Set the default value for -vmstorageUserTimeout to 3 seconds. This is much better
than the 0 value, which means that TCP connection to unreachable vmstorage could block
for up to 16 minutes.
- Document -vmstorageUserTimeout at docs/Cluster-VictoriaMetrics.md
Initially, stream parse mode was reading data from response and parsing it on flight. This was causing longer delay to read the whole response and required increasing timeout value to allow data processing while reading. So that 908e35affd increased timeout value to fix this.
But after 74c00a8762 response in stream parse mode is saved into memory and then parsed eliminating necessity of having timeout value higher that for usual scrape.
Updates: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4847
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit 6e8611f301)
* vmagent: retry failed write request on the closed connection
Retry failed write request on the closed connection immediately,
without waiting for backoff. This should improve data delivery speed
and reduce amount of error logs emitted by vmagent when using idle connections.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4139
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmagent: retry failed write request on the closed connection
Re-instantinate request before retry as body could have been already spoiled.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
(cherry picked from commit 992a1c0a3a)
* vmalert: correctly re-instantinate HTTP req on retries
Previosly, request retry to datasource re-used existing HTTP request.
But if request object was already partially processed (body was read),
then retry will be unsuccessful.
The change re-instantinates HTTP request object before retry.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: review fix
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit ddf87b32ed)
* Add button to prettify query
Just capitalizes query text for now
* Add /prettify-query API handler
* Replace UI pretiffier using prettifier API
* Add showing server errors
Had to pass setQueryErrors from useFetchQuery.ts
* Use serverUrl from global AppState
* Change icon to AutoAwsome icon + added style change color when button is active
* Add sync/await to prettifyQuery function
* Doc public function for lint
* Minor async fix
* Removed extra blank lines
* Extract usePrettifyQuery hook
* Made more generic style for :active button
* Refactor usePrettifyQuery
However, prettify errors don't clean up query errors, but should
* Add prettyQuery functionality to CHANGELOG.md
* Reuse queryErrors
* Unhide errors on start
---------
Co-authored-by: Tamara <toma.vashchuk@gmail.com>
(cherry picked from commit 7349f18c55)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
exclude assets/README.md from publishing on the docs website
as its purpose is different to other docs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit e9d246f367)
app/vmctl: don't interrupt migration process if tenant has no data
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Alexander Marshalov <_@marshalov.org>
(cherry picked from commit 39623ae428)
vmagent: properly add extra labels before sending data to remote storage
labels from `remoteWrite.label` are now added to sent metrics just before they
are pushed to `remoteWrite.url` after all relabelings, including stream aggregation relabelings (#4247)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247
Signed-off-by: Alexander Marshalov <_@marshalov.org>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
(cherry picked from commit a27c2f3773)
Fix display of ingested rows rate for `Samples ingested/s`
and `Samples rate` panels for vmagent's dasbhoard.
Previously, not all ingested protocols were accounted in these panels.
An extra panel `Rows rate` was added to `Ingestion` section to display the split
for rows ingested rate by protocol.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 481a2c70fd)
Opentelemetry format allows histograms with non-counter buckets. In this case it makes no sense to add buckets into database
and save only counter with _count suffix.
It could be used as gauge.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4814
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vlinsert/elasticsearch: add a command-line flag to provide ES version
Adds a flag which will allow to change version which will be reported by ES endpoint for compatibility checks performed by external logs shippers(such as filebeat).
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4777
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Document the -elasticsearch.version command-line flag
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4777
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/promrelabel: fix relabeling if clause being applied to labels outside of current context
Relabeling is applied to each metric row separately, but in order to lower amount of memory allocations it is reusing labels.
Functions which are working on current metric row labels are supposed to use only current metric labels by using provided offset, but if clause matcher was using the whole labels set instead of local metrics.
This leaded to invalid relabeling results such as one described here: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4806
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/CHANGELOG.md: document the bugfix
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1998
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4806
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
{vmagent/remotewrite,vminsert/common}: fix dropInput and keepInput flags inconsistency
Sync behavior for dropInput and keepInput flags between single-node and vmagent.
Fix vmagent not respecting dropInput flag and reverse logic for keepInput.
* rename `configErr` to `lastConfigErr` to reduce confusion
* add tests to verify metrics and msg are set properly
* fix mistake when config success metric wasn't restored after an error
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Value of `-dedup.minScrapeInterval` comand-line flag must be higher than `evaluation_interval` in order to make sure that only one sample on each evaluation will be left after deduplication.
Moreover, value of `-dedup.minScrapeInterval` must be a multiple of vmalert's `evaluation_interval` in order to make sure that samples will be aligned between deduplication window periods.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4774#issuecomment-1663940811
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Correctly calculate `Bytes per point` value for single-server and cluster VM dashboards.
Before, the calculation mistakenly accounted for the number of entries in indexdb in
denominator, which could have shown lower values than expected.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The `ConcurrentFlushesHitTheLimit` could be related to components like
vminsert, vmstorage, vm-single-node and vmagent. Moving this alert
to the `health` section of alerts will be benefitial for all components
and will remove the duplicates from single/cluster alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new panel supposed to show whether the number of concurrent
inserts processed by vmagent isn't reaching the limit.
The panel contains recommendation what to do if limit is reached.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Binary export API protocol can be disabled via `-vm-native-disable-binary-protocol` cmd-line flag when migrating data from VictoriaMetrics. Disabling binary protocol
can be useful for deduplication of the exported data before ingestion.
For this, deduplication need to be configured at `-vm-native-src-addr` side
and `-vm-native-disable-binary-protocol` should be set on vmctl side.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Links of form `/api/v1/<groupID>/<alertID>/status` were deprecated
in favour of `/api/v1/alerts?group_id=<>&alert_id=<>` links in
v1.79.0. See more details here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2825
This change removes code responsible for deprecated functionality.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Adding anchors to the 1.92 changelog breaks consistency
of navigation section at https://docs.victoriametrics.com/CHANGELOG.html
All other releases do not have subsections, so should 1.92.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "vmalert: unittest support stale datapoint (#4696)"
This reverts commit 0b44df7ec8.
* Revert "docs: specify min version and limitations for vmalert's unit tests"
This reverts commit a24541bd
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "vmalert: init unit test (#4596)"
This reverts commit da60a68d
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: mention unittest revert in changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 9f1b9b86cc)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/protoparser: adds opentelemetry parser
app/{vmagent,vminsert}: adds opentelemetry ingestion path
Adds ability to ingest data with opentelemetry protocol
protobuf and json encoding is supported
data converted into prometheus protobuf timeseries
each data type has own converter and it may produce multiple timeseries
from single datapoint (for summary and histogram).
only cumulative aggregationFamily is supported for sum(prometheus
counter) and histogram.
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
updates deps
fixes tests
wip
wip
wip
wip
lib/protoparser/opentelemetry: moves to vtprotobuf generator
go mod vendor
lib/protoparse/opentelemetry: reduce memory allocations
* wip
- Remove support for JSON parsing, since it is too fragile and is rarely used in practice.
The most clients send OpenTelemetry metrics in protobuf.
The JSON parser can be added in the future if needed.
- Remove unused code from lib/protoparser/opentelemetry/pb and lib/protoparser/opentelemetry/proto
- Do not re-use protobuf message between ParseStream() calls, since there is high chance
of high fragmentation of the re-used message because of too complex nested structure of the message.
* wip
* wip
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The important change is to highlight that restore procedure happens
only once and only for already loaded rules. Config hot-reload
doesn't trigger the restore procedure.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Add -remoteWrite.shardByURL command-line flag, which instructs vmagent to spread evenly
outgoing time series data among the configured remote storage systems specified via -remoteWrite.url .
Samples for the same time series go to the same -remoteWrite.url . This allows building horizontally
scalable stream aggregation when samples for counter and histogram series must be aggregated
by the same second-level vmagent instance.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4637
- Use a byte slice instead of a map for tracking indexes for matching series.
This improves performance, since access by slice index is faster than access by map key.
- Re-use the byte slice for tracking indexes for matching series.
This removes unnecessary memory allocations and improves stream aggregation performance a bit.
- Add an ability to return to the previous behvaiour by specifying -remoteWrite.streamAggr.dropInput command-line flag.
In this case all the input samples are dropped when stream aggregation is enabled.
- Backport the new stream aggregation behaviour from vmagent to single-node VictoriaMetrics when -streamAggr.config
option is set.
- Improve docs regarding this change at docs/CHANGELOG.md
- Document the new behavior at docs/stream-aggregation.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4243
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4575
* {lib/streamaggr,vmagent/remotewrite}: breaking change for keepInput flag
Changes default behaviour of keepInput flag to write series which did not match any aggregators to the remote write.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4243
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update app/vmagent/remotewrite/remotewrite.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: pre-create timeseries before indexDB rotation
during an hour before indexDB rotation start creating records at the next indexDB
it must improve performance during switch for the next indexDB and remove ingestion issues.
Since there is no need for creation new index records for timeseries already ingested into current indexDB
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4563
* lib/storage: further work on indexdb rotation optimization
- Document the change at docs/CHAGNELOG.md
- Move back various caches from indexDB to Storage. This makes the change less intrusive.
The dateMetricIDCache now takes into account indexDB generation, so it stores (date, metricID)
entries for both the current and the next indexDB.
- Consolidate the code responsible for idbNext pre-filling into prefillNextIndexDB() function.
This improves code readability and maintainability a bit.
- Rewrite and simplify the code responsible for calculating the next retention timestamp.
Add various tests for corner cases of this code.
- Remove indexdb pre-filling from RegisterMetricNames() function, since this function is rarely called.
It is OK to add indexdb entries on demand in this function. This simplifies the code.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
* docs/CHANGELOG.md: refer to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4563
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmalert/datasource/graphite: allow overriding "from" parameter for datasource queries
Fixes construction of URL parameters for graphite render to allow overriding "from" parameter.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4685
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmalert/datasource/graphite: update flow for building URL parameters
Makes flow of building URL parameters same as Prometheus datasource has:
1) Setting all default values
2) Merging those values with provided `extraParams`
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
- Improve docs
- Hide `debug relabeling` column when -promscrape.dropOriginalLabels command-line flag is set
- Inline the code from the added template functions, since the code is harder to follow
with the template functions, especially when these functions have misleading names.
Also, these functions are used only in one place, e.g. they do not reduce the amounts of code.
- Hide `click to show original labels` title at `labels` column when original labels aren't available.
- Show the reason on whey original labels aren't available at /service-discovery page.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4597
- Parse protobuf if Content-Type isn't set to `application/json` - this behavior is documented at https://grafana.com/docs/loki/latest/api/#push-log-entries-to-loki
- Properly handle gzip'ped JSON requests. The `gzip` header must be read from `Content-Encoding` instead of `Content-Type` header
- Properly flush all the parsed logs with the explicit call to vlstorage.MustAddRows() at the end of query handler
- Check JSON field types more strictly.
- Allow parsing Loki timestamp as floating-point number. Such a timestamp can be generated by some clients,
which store timestamps in float64 instead of int64.
- Optimize parsing of Loki labels in Prometheus text exposition format.
- Simplify tests.
- Remove lib/slicesutil, since there are no more users for it.
- Update docs with missing info and fix various typos. For example, it should be enough to have `instance` and `job` labels
as stream fields in most Loki setups.
- Allow empty of missing timestamps in the ingested logs.
The current timestamp at VictoriaLogs side is then used for the ingested logs.
This simplifies debugging and testing of the provided HTTP-based data ingestion APIs.
The remaining MAJOR issue, which needs to be addressed: victoria-logs binary size increased from 13MB to 22MB
after adding support for Loki data ingestion protocol at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4482 .
This is because of shitty protobuf dependencies. They must be replaced with another protobuf implementation
similar to the one used at lib/prompb or lib/prompbmarshal .
* app/vmagent: fix creating target id if `--promscrape.dropOriginalLabels` flag was used
* app/vmagent: hide links if OriginalLabels was dropped
* app/vmagent: update CHANGELOG.md and added information to the docs
* app/vmagent: fix comments
* app/vlinsert: add support of loki push protocol
- implemented loki push protocol for both Protobuf and JSON formats
- added examples in documentation
- added example docker-compose
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert: move protobuf metric into its own file
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: update reference to docker image
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: make volume name unique
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: add license reference
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: fix volume name
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/VictoriaLogs/data-ingestion: add stream fields for loki JSON ingestion example
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: move entities to places where those are used
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: refactor to use common components
- use CommonParameters from insertutils
- stop ingestion after first error similar to elasticsearch and jsonline
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: address review feedback
- add missing logstorage.PutLogRows calls
- refactor tenant ID parsing to use common function
- reduce number of allocations for parsing by reusing logfields slices
- add tests and benchmarks for requests processing funcs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Add `Active queries` chapter to VMUI docs
- Set `Content-Type: json` header inside promql.WriteActiveQueries() handler,
in order to be consistent with other request handlers called at app/vmselect/main.go
- Pass the request to promql.WriteActiveQueries() handler, so it can change its output
depending on the provided request params. This also improves consistency of
promql.WriteActiveQueries() args with other request hanlers at app/vmselect/main.go
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4653
* feat: add page to display a list of active queries (#4598)
* app/vmagent: code formatting
* fix: remove console
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
The following syntax is supported: _time:filter offset off
For example:
- _time:5m offset 1h - 5-minute duration one hour before the current time
- _time:2023 offset 2w - 2023 year with the 2 weeks offset in the past
The `a op b keep_metric_names` is ambigouos to `a op (b keep_metric_names)` when `b` is a transform or rollup function.
For example, `a + rate(b) keep_metric_names`. So it is better to use more clear syntax: `(a op b) keep_metric_names`
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3710
* metricsql: add support of using keep_metric_names for binary operations
This should help to avoid confusion with queries like one in the issue #3710.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* wip
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously all the newly ingested time series were registered in global `MetricName -> TSID` index.
This index was used during data ingestion for locating the TSID (internal series id)
for the given canonical metric name (the canonical metric name consists of metric name plus all its labels sorted by label names).
The `MetricName -> TSID` index is stored on disk in order to make sure that the data
isn't lost on VictoriaMetrics restart or unclean shutdown.
The lookup in this index is relatively slow, since VictoriaMetrics needs to read the corresponding
data block from disk, unpack it, put the unpacked block into `indexdb/dataBlocks` cache,
and then search for the given `MetricName -> TSID` entry there. So VictoriaMetrics
uses in-memory cache for speeding up the lookup for active time series.
This cache is named `storage/tsid`. If this cache capacity is enough for all the currently ingested
active time series, then VictoriaMetrics works fast, since it doesn't need to read the data from disk.
VictoriaMetrics starts reading data from `MetricName -> TSID` on-disk index in the following cases:
- If `storage/tsid` cache capacity isn't enough for active time series.
Then just increase available memory for VictoriaMetrics or reduce the number of active time series
ingested into VictoriaMetrics.
- If new time series is ingested into VictoriaMetrics. In this case it cannot find
the needed entry in the `storage/tsid` cache, so it needs to consult on-disk `MetricName -> TSID` index,
since it doesn't know that the index has no the corresponding entry too.
This is a typical event under high churn rate, when old time series are constantly substituted
with new time series.
Reading the data from `MetricName -> TSID` index is slow, so inserts, which lead to reading this index,
are counted as slow inserts, and they can be monitored via `vm_slow_row_inserts_total` metric exposed by VictoriaMetrics.
Prior to this commit the `MetricName -> TSID` index was global, e.g. it contained entries sorted by `MetricName`
for all the time series ever ingested into VictoriaMetrics during the configured -retentionPeriod.
This index can become very large under high churn rate and long retention. VictoriaMetrics
caches data from this index in `indexdb/dataBlocks` in-memory cache for speeding up index lookups.
The `indexdb/dataBlocks` cache may occupy significant share of available memory for storing
recently accessed blocks at `MetricName -> TSID` index when searching for newly ingested time series.
This commit switches from global `MetricName -> TSID` index to per-day index. This allows significantly
reducing the amounts of data, which needs to be cached in `indexdb/dataBlocks`, since now VictoriaMetrics
consults only the index for the current day when new time series is ingested into it.
The downside of this change is increased indexdb size on disk for workloads without high churn rate,
e.g. with static time series, which do no change over time, since now VictoriaMetrics needs to store
identical `MetricName -> TSID` entries for static time series for every day.
This change removes an optimization for reducing CPU and disk IO spikes at indexdb rotation,
since it didn't work correctly - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 .
At the same time the change fixes the issue, which could result in lost access to time series,
which stop receving new samples during the first hour after indexdb rotation - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
The issue with the increased CPU and disk IO usage during indexdb rotation will be addressed
in a separate commit according to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401#issuecomment-1553488685
This is a follow-up for 1f28b46ae9
* app/vmctl: fix panic `--remote-read-filter-time-start` flag not defined
* app/vmctl: update CHANGELOG.md
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
It could happen for low evaluation intervals and irregular
delays during execution that evaluation time would get
a negative offset. This could result into cumulative
discrepancy between the actual time and evaluation time for rules.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: make `httpAuth.*` flags description less ambiguous
Currently, it may confuse users whether `httpAuth.*` flags are used by HTTP client or server configuration(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4586 for example).
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: fix a typo
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Clarify docs about -replicationFactor command-line flag at vmselect
- Clarify description for -replicationFactor and -search.skipSlowReplicas command-line flags
- Fix the logic for returning responses if -search.skipSlowReplicas command-line flag
is enabled. The logic was broken in the 173ccf4333,
so it could return responses only if some of vmstorage nodes return error,
while it should return when query results are successfully collected from more than
(len(storageNodes) - replicationFactor) vmstorage nodes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
* vmselect: introduce `search.skipSlowReplicas` cmd-line flag
vmselect has two logical conditions during request processing when
`-replicationFactor` cmd-line flag is set:
1. If at least `len(storageNodes) - replicationFactor` responded, it could skip
waiting for the rest of nodes to respond. This could lead to problems described
here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207.
2. Mark response as partial if less than `len(storageNodes) - replicationFactor` responded
without an error.
The P1 showed itself error-prone and became the main reason why
`-replicationFactor` wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.
But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag `search.skipSlowReplicas`
should make `-replicationFactor` flag usable again. And let users
choose whether they want P1 to be respected.
Related issues
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: update changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmalert: allow disabling of `step` param attached to instant queries
This might be useful for using vmalert with datasources that to not support this param,
unlike VictoriaMetrics.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4573
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix removing storage data dir before restoring from backup
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fixes after merge with `enterprise-single-node` branch
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
- Clarify the scope of the fix at docs/CHANGELOG.md
- Handle the case when -search.maxSamplesPerSeries limit is exceeded
in the same way as the -search.maxSamplesPerQuery limit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4472
The validation was needed for covering corner cases when storage is tested with data from 1970.
This resulted into unexpected search results, as year was parsed incorrectly from the given timestamp.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
It is required for periods to be multiplies, but it was not stated clearly in documentation.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The change focuses on rectifying inconsistencies in the navigation behavior of the application
and eliminating issues encountered when manually altering the URL.
The key updates include:
- Refactoring of the routing mechanism to handle all possible routes and their states.
- Enhancement of the React Router usage to ensure a smoother navigation experience.
- Handling application state when the URL is manually changed.
Added a line for `probeSelector` and links to objects selectors section to make it easier to find more details about selectors.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
expose `vmauth_user_request_duration_seconds`
and `vmauth_unauthorized_user_request_duration_seconds` summary metrics
for measuring requests latency per user.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is impossible to run OS vmauth with the provided config.
The example of using ip filters should be only a part of docs.
All other examples should work seamlessly with OS version.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: add scroll to the selected element
* docs: scroll to root if element not found
* docs: simplify code
* docs: code cleanup
* docs: fix comments (fix code formatting, check element only inside sidebar container)
By default, vmalert will make multiple retry attempts with exponential delay.
The total time spent during retry attempts shouldn't exceed `-remoteWrite.retryMaxTime` (default is 30s).
When retry time is exceeded vmalert drops the data dedicated for `-remoteWrite.url`.
Before, vmalert dropped data after 5 retry attempts with 1s delay between attempts (not configurable).
See `-remoteWrite.retryMinInterval` and `-remoteWrite.retryMaxTime` cmd-line flags.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
This simplifies routing at auth proxies such as vmauth to vlselect component,
which serves VMUI - just route all the requests, which start with /select/, to vlselect.
If the `Content-Type: application/json` request header isn't set,
then the server can improperly consume the request body when parsing request parameters
vmalert: retry all errors except 4XX status codes
Retry all errors except 4XX status codes while pushing via remote-write
to the remote storage. Previously, errors like broken connection could
prevent vmalert from retrying the request.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: optimize the preparation of data for the graph
* fix: optimize tooltip rendering
* fix: optimize re-rendering of the chart
* vmui: memory leak fix
* lib/storage: creates parts.json on start-up if it not exists.
It fixes migrations from versions below v1.90.0.
Previously parts.json was created only after successful merge.
But if merge was interruped for some reason (OOM or shutdown), parts.json wasn't created and partitions left after interruped merge weren't properly deleted.
Since VM cannot check if it must be removed or not.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4336
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update lib/storage/partition.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
- Clarify the scope of the fix at docs/CHANGELOG.md
- Handle the case when -search.maxSamplesPerSeries limit is exceeded
in the same way as the -search.maxSamplesPerQuery limit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4472
Properly return the error to user when `-search.maxSamplesPerQuery` limit is exceeded.
Before, user could have received a partial response instead.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmstorage: clarify the min value for `-retentionPeriod` flag
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
(cherry picked from commit 24f34347f1)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmagent/remotewrite: fix vmagent panic on shutdown
Currently, when vmagent is stopping it first flushes pending series in remote write context and proceeds to stop streaming aggregation. This leads to streaming aggregation being unable to write results into pending timeseries (since it is already nil) and panic.
This can lead to losing some aggregation results being lost almost silently.
The fix is reordering flow to first stop streaming aggregation and flush all pending time series after that.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit ce7141383d)
* app/vmctl: add verbose output for docker installations or when TTY isn't available
* app/vmctl: fix tests
* app/vmctl: make vmctl interactive if no tty
* app/vmctl: cleanup
* app/vmctl: add comment
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
(cherry picked from commit fc5292d8ed)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix nil map assignment
The storage instance with nil map params was created for remote-read purposes.
And before change 7a9ae9de0d this map was ignored in ApplyParams.
Now, it started to be used and vmalert panics in runtime.
The fix properly inits map for at `NewVMStorage` and verifies it is not nil
on assignment in `ApplyParams`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add to changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit de94812088)
* app/vmui: fix behavior when changing url in global settings
* app/vmctl: minor fix
* app/vmui: fix behavior when changing url in global settings
(cherry picked from commit 9843ec0e1d)
The purpose of the change is too highlight what HA pair is
and how deduplication needs identical labels to be present
in raw samples.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4367
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 8185c2466c)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The update should make understanding of multi-tenancy more clear
for influxdb users.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 1c3f50f791)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts the following commits:
- e0e16a2d36
- 2ce02a7fe6
The reason for revert: the updated logic breaks assumptions made
when fixing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698 .
For example, if a time series stop receiving new samples during the first
day after the indexdb rotation, there are chances that the time series
won't be registered in the new indexdb. This is OK until the next indexdb
rotation, since the time series is registered in the previous indexdb,
so it can be found during queries. But the time series will become invisible
for search after the next indexdb rotation, while its data is still there.
There is also incompletely solved issue with the increased CPU and disk IO resource
usage just after the indexdb rotation. There was an attempt to fix it, but it didn't fix
it in full, while introducing the issue mentioned above. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
TODO: to find out the solution, which simultaneously solves the following issues:
- increased memory usage for setups high churn rate and long retention (e.g. what the reverted commit does)
- increased CPU and disk IO usage during indexdb rotation ( https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 )
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
- Document the change at docs/CHANGELOG.md
- Clarify comments for non-trivial code touched by the commit
- Improve the logic behind maybeCreateIndexes():
- Correctly create per-day indexes if the indexdb rotation is performed during
the first hour or the last hour of the day by UTC.
Previously there was a possibility of missing index entries on that day.
- Increase the duration for creating new indexes in the current indexdb for up to 22 hours
after indexdb rotation. This should reduce the increased resource usage
after indexdb rotation.
It is safe to postpone index creation for the current day until the last hour
of the current day after indexdb rotation by UTC, since the corresponding (date, ...)
entries exist in the previous indexdb.
- Search for TSID by (date, MetricName) in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
- Search for (date, metricID) entries in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
* added backup locking/unlocking against retention policy to vmbackupmanager
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added docs for new commands
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comments
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* feat: improvement of the top queries page
* vmui/docs: enhancements to top queries page
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmui: change default font size to 14px for better readability
vmui: fix bug with missing text on buttons in safari
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* app/vmui: added Labels with the highest number of unique values
* app/vmui: cleanup
* app/vmui: cleanup
* app/vmui: add table description
* app/vmui: fix comment, updated CHANGELOG.md
* app/vmui: disable links
* app/vmui: added actions to the table, it will show values for selected label with the highest number of series
* app/vmui: fix comment
* vmalert: expand rule groups on anchor click
before, anchor click was only updating the URL.
To expand the group, user had to click on rule's block.
Now, group will toggle automatically.
* vmalert: allow filtering group in web UI
The new filter allows to filter groups and rules within
groups by: errors only or noMatch only.
The filtering supposed to help navigating big numbers of groups/rules.
Filtering is reflected in URL, so can be shared as a link.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Without reset, labels duplicates could have been added during stream aggregation.
Since `ctx.Labels` is reused during processing of many series, each series will
add its labels to the context. Even if the same labels were already addeded on prev
iteration. Now, we reset `ctx.Labels` on each iteration to contain so labels from
different series didn't interfere.
This could have cause exceeding of the limit on number of labels per pushed time series.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4277
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmalert: detect alerting rules which don't match any series at all
vmalert starts to understand /query responses which contain object:
```
"stats":{"seriesFetched": "42"}
```
If object is present, vmalert parses it and populates a new field
`SeriesFetched`. This field is then used to populate the new metric
`vmalert_alerting_rules_last_evaluation_series_fetched` and to
display warnings in the vmalert's UI.
If response doesn't contain the new object (Prometheus or
VictoriaMetrics earlier than v1.90), then `SeriesFetched=nil`.
In this case, UI will contain no additional warnings.
And `vmalert_alerting_rules_last_evaluation_series_fetched` will
be set to `-1`. Negative value of the metric will help to compile
correct alerting rule in follow-up.
Thanks for the initial implementation to @Haleygo
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4056
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4039
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It appears that 90% usage for anonymous mem usage
is already concerning. So we lowering the threshold to 80%.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
When using `retentionTimezoneOffset` and having local timezone being more than 4 hours different from UTC indexdb retention calculation could return negative value. This caused indexdb rotation to get in loop.
Fix calculation of offset to use `retentionTimezoneOffset` value properly and add test to cover all legit timezone configs.
See:
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4207
- https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4206
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Windows doesn't allow to remove dir with opened files. Usually it's a case for snapshots, hard cannot be removed if file is openned.
With this change, dir will be renamed and properly deleted at the next process start.
It's recommended to restart vmstorage/vmsingle for snapshots deletion completion periodically.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* lib/promscrape/discovery/kubernetes: add common labels to all ports discovered from endpoints
Sets
`__meta_kubernetes_endpoints_name` and `__meta_kubernetes_namespace` labels to all ports of pod.
Prometheus sets those labels to all ports in pod (0ab9553611/discovery/kubernetes/endpoints.go (L267C15-L269)) even if port is not matching any service.
See: #4154
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix test for updated discovery logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Properly return empty slices instead of nil for `/api/v1/rules` and `/api/v1/alerts` API handlers.
This improves compatibility with Grafana.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4221
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Supports using `**` for `-rule` and `-rule.templates`: `dir/**/*.tpl` loads contents of dir and all subdirectories recursively.
See: #4041
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Artem Navoiev <tenmozes@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* lib/promscrape: adds filter for consul_sd_configs:
it allows advanced filtering for consul service discovery requests
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4183
* typo fix
* removes deprecation mentions since it's not relevant
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Templating of `-external.alert.source` is not expected to have access to the query which was causing runtime error when query function was passed as nil.
See: #4181
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmagent,lib/persistentqueue: show warning message if `--remoteWrite.maxDiskUsagePerURL` flag lower than 500MB
* app/vmagent,lib/persistentqueue: linter fix
* app/vmagent,lib/persistentqueue: fix comment
* feat: display heatmap in the explore metrics (#4111)
* fix: correct calc step for heatmap
* fix: remove spaces in the result of getDurationFromMilliseconds
* feat: add button "show today" to date picker
* feat: add comparison with the prev day (#3967)
* vmui/docs: add comparison of data to cardinality page
* feat: add WithTemplate page
* app/vmselect/prometheus: enable json mode for expand with expr API
* app/vmselect/prometheus: enable CORS and add content type
* feat: add api for expand with templates
* fix: remove console from useExpandWithExprs
* app/vmselect/prometheus: fix escaping
* vmui: integrate WITH template
* app/vmctl: check content type instead of form param
* fix: add content-type for fetch with-exprs
* fix: add a header to the server's response that allows the "Content-Type" header
* app/vmctl: added comment and cleanup
* app/vmctl: use format query param
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
* app/vmctl: add support for the different time format in the native binary protocol
* app/vmctl: update flag description, update CHANGELOG.md
* app/vmctl: add comment to exported function
* lib/httpserver: introduce `-http.maxConcurrentRequests` command-line flag
Introduce `-http.maxConcurrentRequests` command-line flag to protect
VM components from resource exhaustion during unexpected spikes of HTTP requests.
By default, the new flag's value is set to 0 which means no limits are applied.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/httpserver: mention http.maxConcurrentRequests in docs
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- added info about metric `vm_vminsert_metrics_read_total`,
- small doc refactoring
- and added make-command for running docs in docker.
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* vmalert: retry datasource requests with EOF or unexpected EOF errors
Retry failed read request on the closed connection one more time.
This may improve rules execution reliability when connection
between vmalert and datasource closes unexpectedly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix old tests
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This handler will instruct search engines that indexing is not allowed for the content exposed to the internet. This should help to address issues like #4128 when instances are exposed to the internet without authentication.
"How do We Keep Metrics for a Long Time in VictoriaMetrics" article is referenced twice in "Third-party articles and slides about VictoriaMetrics" section
Improperly configured -bigMergeConcurrency command-line flag usually leads to uncontrolled
growth of unmerged parts, which, in turn, increases CPU usage and query durations.
So it is better deprecating this flag. In rare cases -smallMergeConcurrency command-line flag
can be used instead for controlling the concurrency of background merges.
This makes it easier to understand exact point in time which is included in this backup.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The logic employed for re-using the previously loaded scrape target was broken initially.
The commit cc0427897c tried to fix it, but the new logic
became too complex and fragile. So it is better to just remove this logic,
since the targets from temporarily broken file should be eventually loaded on next
attempts every -promscrape.fileSDCheckInterval
This also allows removing fragile hacks around __vm_filepath label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3989
* lib/promscrape: fix the problem with scrape work duplicates when file_sd_config can't be read
* lib/promscrape: clarified comment
* lib/promscrape: made better approach to handle a problem with growing []*ScrapeWork on each error when loading config
* lib/promscrape: added CHANGELOG.md
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add tips for working with the graph and legend
* feat: add the ability to collapse the legend
* vmui/docs: add the ability to collapse the legend
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: check for free disk space before opening tables
We check for free disk space before call to `openTable`,
so `Storage` can be set to ReadOnly before mergeWorkers start.
Before the change, there was a chance that merges will start
even if Storage has to start in ReadOnly mode because of
`-storage.minFreeDiskSpaceBytes` limit.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4023
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/storage: chore
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update lib/storage/storage.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Make sure that the last successfully loaded config is used on hot-reload failure
- Properly cleanup resources occupied by already initialized aggregators
when the current aggregator fails to be initialized
- Expose distinct vmagent_streamaggr_config_reload* metrics per each -remoteWrite.streamAggr.config
This should simplify monitoring and debugging failed reloads
- Remove race condition at app/vminsert/common.MustStopStreamAggr when calling sa.MustStop() while sa
could be in use at realoadSaConfig()
- Remove lib/streamaggr.aggregator.hasState global variable, since it may negatively impact scalability
on system with big number of CPU cores at hasState.Store(true) call inside aggregator.Push().
- Remove fine-grained aggregator reload - reload all the aggregators on config change instead.
This simplifies the code a bit. The fine-grained aggregator reload may be returned back
if there will be demand from real users for it.
- Check -relabelConfig and -streamAggr.config files when single-node VictoriaMetrics runs with -dryRun flag
- Return back accidentally removed changelog for v1.87.4 at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3639
Verifying status code helps to avoid misleading errors caused by attempt to parse unsuccessful response.
Related issue: #4034
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Compare directory names instead of paths to directory when determining which persistent queues must be deleted
This is less error-prone solution, since paths to the same directory can differ, which could lead
to accidental directory removal for the existing -remoteWrite.url
- Log the `removed %d dangling queues` message when at least a single queue has been removed
- Consistently use filepath.Join() for creating paths to persistent queues.
This is needed for Windows support (see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70 )
- Clarify the description of the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4014
- Allocate and initialize seriesByWorkerID slice in a single go instead
of initializing every item in the list separately.
This should reduce CPU usage a bit.
- Properly set anti-false sharing padding at timeseriesWithPadding structure
- Document the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
* allowed using dashes and dots in environment variables names for templating config files with envtemplate (#3999)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* Apply suggestions from code review
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Setting up VMAnomaly on NodeExporter metrics with VictoriaMetrics and AlertManager.
* vmanomaly-guide-draft
* aletr graphs and description
* readme vmanomaly tutorial
* Added back fit_every param for performance
* vmanomaly guide fixes
* added spaces div
* spaces + resize image
* alert example grammar
* quotation marks
* docker link
* typo fixed
* more links
* reader section rephrased
* label change
* lower case for grafana service
* lower case for vm service
* yaml markdown
---------
Co-authored-by: Dima Lazerka <dima@victoriametrics.com>
* lib/netutil: log only parsing errors for proxy-protocol
Previosly every error was logged. With configured TCP health checks at load-balancer or kubernetes, vmauth spams a lot of false positive error message into logs
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update lib/netutil/tcplistener.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
This commit changes background merge algorithm, so it becomes compatible with Windows file semantics.
The previous algorithm for background merge:
1. Merge source parts into a destination part inside tmp directory.
2. Create a file in txn directory with instructions on how to atomically
swap source parts with the destination part.
3. Perform instructions from the file.
4. Delete the file with instructions.
This algorithm guarantees that either source parts or destination part
is visible in the partition after unclean shutdown at any step above,
since the remaining files with instructions is replayed on the next restart,
after that the remaining contents of the tmp directory is deleted.
Unfortunately this algorithm doesn't work under Windows because
it disallows removing and moving files, which are in use.
So the new algorithm for background merge has been implemented:
1. Merge source parts into a destination part inside the partition directory itself.
E.g. now the partition directory may contain both complete and incomplete parts.
2. Atomically update the parts.json file with the new list of parts after the merge,
e.g. remove the source parts from the list and add the destination part to the list
before storing it to parts.json file.
3. Remove the source parts from disk when they are no longer used.
This algorithm guarantees that either source parts or destination part
is visible in the partition after unclean shutdown at any step above,
since incomplete partitions from step 1 or old source parts from step 3 are removed
on the next startup by inspecting parts.json file.
This algorithm should work under Windows, since it doesn't remove or move files in use.
This algorithm has also the following benefits:
- It should work better for NFS.
- It fits object storage semantics.
The new algorithm changes data storage format, so it is impossible to downgrade
to the previous versions of VictoriaMetrics after upgrading to this algorithm.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3236
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3821
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* app/vmagent: allow vm proto for kafka consumer and producer
it should reduce network usage up to 50%.
According to benchmarks without any encoding at kafka topic, it reduces traffic up to 50%.
With enabled zstd at kafka topic, it shows no diffence in traffic. So it
doesn't make much sense to use it.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1225
* mention eb61a7dd68b834b08d01727a918f207700348ada at changelog
* app/vmagent: bumps kafka lib version
it allows compiling vmagent for arm64 machines
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2271
* mention d19b1a888248c96cfd7ccee00ba6f596d89be1d7 at change log
* app/vmagent: adds natural concurrency for kafka consumer
it should improve performance for data consumption
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1957
* mention change 0c143bb22ca2e7e0b7eec9bc84a94ee2b41626ca
* Update app/vmagent/kafka/consumer.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update app/vmagent/kafka/consumer_cgo.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmalert: support concurrent reading from object storage
Config reading from GCS or S3 can be slow if object storage
contains a big number of files. Object storages are usually
fast for downloading and are slow for individual operations.
If there would be thousands of files to read, vmalert could
spend significant time for retrieving those because it is
done sequentially.
The change introduces ability to read configs from object
storage concurrently. By default, both GCS and S3 are now
read with 50 concurrent readers. This significantly reduces
the load time:
* loading 500 files with concurrency=1 takes 27s
* loading 500 files with concurrency=50 takes <1s
* vmalert: add note to Changelog
* vmalert: cleanup
* vmalert: use ticker properly
* app/vmalert: improve status reporting during config loading
* vmalert: support concurrent reading from object storage
Config reading from GCS or S3 can be slow if object storage
contains a big number of files. Object storages are usually
fast for downloading and are slow for individual operations.
If there would be thousands of files to read, vmalert could
spend significant time for retrieving those because it is
done sequentially.
The change introduces ability to read configs from object
storage concurrently. By default, both GCS and S3 are now
read with 50 concurrent readers. This significantly reduces
the load time:
* loading 500 files with concurrency=1 takes 27s
* loading 500 files with concurrency=50 takes <1s
* app/vmalert: make linter happy
The change also introduces `List` method to `FS` interface.
The `List` method can be used for wildcard support in object storage FS.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- Sync the description for -httpListenAddr.useProxyProtocol command-line flag at vmagent and vmauth,
so it is consistent with the description at vmauth and victoria-metrics
- Add a sample of panic text to docs/CHANGELOG.md, so it could be googled
- Mention the -httpListenAddr.useProxyProtocol command-line flag in the description for the bugfix
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3335
app/vmctl: vm-native - split migration on per-metric basis
`vm-native` mode now splits the migration process on per-metric basis.
This allows to migrate metrics one-by-one according to the specified filter.
This change allows to retry export/import requests for a specific metric and provides a better
understanding of the migration progress.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
lib{mergset,storage}: prevent possible race condition with logging stats for merges
Previously partwrapper could be release by background process and reference for part may be invalid
during logging stats. It will lead to panic at vmstorage
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3897
- Use flag.Duration instead of flagutil.Duration for -snapshotCreateTimeout,
since the flagutil.Duration is intended mostly for big durations, e.g. days, months and years,
while the -snapshotCreateTimeout is usually smaller than one hour.
- Add links to https://docs.victoriametrics.com/#how-to-work-with-snapshots in docs/CHANGELOG.md,
so readers could easily find the corresponding docs when reading the changelog.
- Properly remove all the created directories on unsuccessful attempt to create
snapshot in Storage.CreateSnapshot().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3551
* lib/{fs,mergeset,storage}: skip `.must-remove.` dirs when creating snapshot (#3858)
* lib/{mergeset,storage}: add timeout configuration for snapshots creation, remove incomplete snapshots from storage
* docs: fix formatting
* app/vmstorage: add metrics to track status of snapshots
* app/vmstorage: use `vm_http_requests_total` metric for snapshot endpoints metrics, rename new flag to make name more clear
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: update flag name in docs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: reflect new metrics names change in docs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/promscrape: set `vm_promscrape_config_last_reload_successful` to 1 if there was no promscrape config provided
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape: register `vm_promscrape_config_*` metrics only in case promscrape config is used
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmselect/promql: check for deadline in `count_values` fn
`count_values` could be very slow during the data processing.
Checking for deadline between iterations supposed to reduce
probability of exceeding `search.maxQueryDuration`.
The change also adds a new trace record, which captures the time
spent in aggregation function. Before that, the trace for aggr funcs
could be confusing since it doesn't account for all the places where
time was spent.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* metricsql: support optional 2nd argument for rollup functions
Support optional 2nd argument `min`, `max` or `avg` for rollup functions:
* rollup
* rollup_delta
* rollup_deriv
* rollup_increase
* rollup_rate
* rollup_scrape_interval
If second argument is passed, then rollup function will return only the selected aggregation type.
This change can be useful for situations where only one type of rollup calculation is needed.
For example, `rollup_rate(requests_total[5m], "max")`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Return immediately on context cancel during the backoff sleep.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3747
- Add a comment describing why the second attempt to obtain the response from remote side
is perfromed immediately after the first attempt.
- Remove fasthttp dependency from lib/promscrape/discoveryutils
- Set context deadline before calling doRequestWithPossibleRetry().
This simplifies the doRequestWithPossibleRetry() a bit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3293
* fix: do not use exponential backoff for first retry of scrape request (#3293)
* lib/promscrape: refactor `doRequestWithPossibleRetry` backoff to simplify logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update lib/promscrape/client.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* lib/promscrape: refactor `doRequestWithPossibleRetry` to make it more straightforward
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
- Do not generate __meta_server label, since it is unavailable in Prometheus.
- Add a link to https://docs.victoriametrics.com/sd_configs.html#kuma_sd_configs to docs/CHANGELOG.md,
so users could click it and read the docs without the need to search the corresponding docs.
- Remove kumaTarget struct, since it is easier generating labels for discovered targets
directly from the response returned by Kuma. This simplifies the code.
- Store the generated labels for discovered targets inside atomic.Value. This allows reading them
from concurrent goroutines without the need to use mutex.
- Use synchronouse requests to Kuma instead of long polling, since there is a little sense
in the long polling when the Kuma server may return 304 Not Modified response every -promscrape.kumaSDCheckInterval.
- Remove -promscrape.kuma.waitTime command-line flag, since it is no longer needed when long polling isn't used.
- Set default value for -promscrape.kumaSDCheckInterval to 30s in order to be consistent with Prometheus.
- Remove unnecessary indirections for string literals, which are used only once, in order to improve code readability.
- Remove unused fields from discoveryRequest and discoveryResponse.
- Update tests.
- Document why fetch_timeout and refresh_interval options are missing in kuma_sd_config.
- Add docs to discoveryutils.RequestCallback and discoveryutils.ResponseCallback,
since these are public types.
Side notes: it is weird that Prometheus implementation for kuma_sd_configs sets `instance` label,
since usually this label is set by the Prometheus itself to __address__ after the relabeling phase.
See https://www.robustperception.io/life-of-a-label/
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3389
See https://github.com/prometheus/prometheus/issues/7919
and https://github.com/prometheus/prometheus/pull/8844
as a reference implementation in Prometheus
While at it, stop sending requests to unavailable backend for 3 seconds
before the next attempt. This should reduce the amounts of useless work
and the number of useless network packets when the backend is temporarily unavailable.
* app/vmauth: add concurent requests limit per auth record
* app/vmauth: added clarification comment
* app/vmauth: remove unused code
* app/vmauth: move read from limiter
* app/vmauth: fix text
* app/vmauth: fix comments
* - Clarify the docs for the max_concurrent_requests option at docs/vmauth.md
- Clarify the description of the change at docs/CHANGELOG.md
- Make sure that the -maxConcurrentRequests takes precedence over per-user max_concurrent_requests
- Update tests for verifying that the max_concurrent_requests option is parsed properly
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3346
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Add `Reading rules from object storage` section to docs/vmalert.md
- Add `s3` prefix to command-line flags related to the configuration of s3 and gcs clients
- Explicitly mention that reading rules from object storage is supported only in enterprise version
* vmalert: use group's ID in UI to avoid collisions
Identical group names are allowed. So we should used IDs
for various groupings and aggregations in UI.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: prevent disabling state updates tracking
The minimum number of update states to track is now set to 1.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly update `debug` and `update_entries_limit` params on hot-reload
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: display `debug` field for rule in UI
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: exclude `updates` field from json marhsaling
This field isn't correctly marshaled right now.
And implementing the correct marshaling for it doesn't
seem right, since json representation is mostly used
by systems like Grafana. And Grafana doesn't expect this
field to be present.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix test for disabled state
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix test for disabled state
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
docs: clarifications between standalone/cluster ingestion endpoints
This is an attempt to make it a bit clearer to the user that the cluster version ingestion URLs are different from the standalone ones. I have also changed the order of the list items to make it a bit clearer and hopefully stop the user simply inferring that `/prometheus/api/v1` is only related to Prometheus data.
* vmalert: speed up state restore procedure on start
Alerts state restore procedure has been changed to become asynchronous.
It doesn't block groups start anymore which significantly improves vmalert's startup time.
Instead, state restore is called by each group in their goroutines after the first rules
evaluation.
While previously state restore attempt was made for all loaded alerting rules,
now it is called only for alerts which became active after the first evaluation.
This reduces the amount of API calls to the configured remote read URL.
This also means that `remoteRead.ignoreRestoreErrors` command-line flag becomes deprecated now
and will have no effect if configured.
See relevant issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2608
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* make lint happy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Apply suggestions from code review
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This allows better controlling requests to backends and providing better error logging.
For example, if the backend was unavailable, then the ReverseProxy was logging the error
message without client ip and the initial request uri. This could harden debugging.
This is based on https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3486
The docs/assets folder should be used only for assets specific to docs generation at https://docs.victoriametrics.com, e.g. css, js and images.
All the other assets related to specific docs should be placed in the same folder as the corresponding *.md file.
These assets should have the same name prefix as the corresponding doc file name. This simplifies tracking the lifetime of these assets.
For example, if the doc is removed, it is very easy to remove all assets associated with it with a simple `rm -rf docs/doc-name*` command.
This also simplifies generating correct urls for doc-specific assets from both https://docs.victoriametrics.com
and from https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/ - just refer to the asset name without any directory prefixes.
* feat: include fonts in the build
* fix: reduce size fonts
* wip
- Document the change at docs/CHANGELOG.md
- Run `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Allow users fine-tuning the maximum string length for interning via -internStringMaxLen command-line flag.
This may be used for fine-tuning RAM vs CPU usage for certain workloads.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3692
- Document the fix at docs/CHANGELOG.md
- Limit the concurrency for sendStaleMarkers() function in order to limit its memory usage
when big number of targets disappear and staleness markers are sent
for all the metrics exposed by these targets.
- Make sure that the writeRequestCtx is returned to the pool
when there is no need to send staleness markers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3668
* feat: make the step input field global
* fix: correct get step from url
* fix: set minimumSignificantDigits to 1
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>