* app/vmalert: update remote-write process
* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
Add `tls_insecure_skip_verify` option on per-user basis which allows to disable TLS verification for all requests to backend on behalf of this user.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit 323f3720ed)
reduce the number of queries for restoring alerts state on start-up.
The change should speed up the restore process and reduce pressure on `remoteRead.url`.
(cherry picked from commit 90d45574bf)
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).
* add link to issue in CHANGELOG
* Extend vmauth docs
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduction is based on production testing.
Also expose -search.minWindowForInstantRollupOptimization command-line flag, so users could fine-tune this arg for their needs
vmalert expects string value for stats.seriesFetched, so it is impossible
switching to number without breaking compatibility with old vmalert releases :(
It is still unclear why stats.seriesFetched has string type in the first place...
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:
- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()
The basic idea of optimization is to calculate
rf(m[d] @ t)
as
rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))
where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously
The offset may be in the range of up to 1 hour.
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit abcb21aa5e)
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit ad839aa492)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
don't prevent from processing the corresponding scrape targets after the file becomes correct,
without the need to restart vmagent.
Previously scrape targets with invalid TLS CA file or TLS client certificate files
were permanently dropped after the first attempt to initialize them, and they didn't
appear until the next vmagent reload or the next change in other places of the loaded scrape configs.
- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
Previously the old TLS CA was used until vmagent restart.
- Properly handle errors during http request creation for the second attempt to send data to remote system
at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
since the returned request is nil on error.
- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
Previously it could miss details on the source of the request.
- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
of every http request issued by vmagent during service discovery or target scraping.
Re-use the HTTP client instead until the corresponding scrape config changes.
- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
when auth header cannot be obtained because of temporary error.
- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
when scraping big number of targets with the same tls_config.
- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.
- Improve test coverage at lib/promauth
- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
Previously vmagent was exitting in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
It could return either `failed to read` or `failed to parse` errors depending
on whether the given url can be loaded or not under the current environment
* app/vmselect: limit the number of parallel workers by 32
The change should improve performance and memory usage during query processing
on machines with big number of CPU cores. The number of parallel workers for
query processing is controlled via `-search.maxWorkersPerQuery` command-line flag.
By default, the number of workers is limited by the number of available CPU cores,
but not more than 32. The limit can be increased via `-search.maxWorkersPerQuery`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
- The `-search.maxWorkersPerQuery` command-line flag doesn't limit resource usage,
so move it from the `resource usage limits` to `troubleshooting` chapter at docs/Single-server-VictoriaMetrics.md
- Make more clear the description for the `-search.maxWorkersPerQuery` command-line flag
- Add the description of `-search.maxWorkersPerQuery` to docs/Cluster-VictoriaMetrics.md
- Limit the maximum value, which can be passed to `-search.maxWorkersPerQuery`, to GOMAXPROCS,
because bigger values may worsen query performance and increase CPU usage
- Improve the the description of the change at docs/CHANGELOG.md. Mark it as FEATURE instead of BUGFIX,
since it is closer to a feature than to a bugfix.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix inconsistent behaviors with prometheus when scraping
1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
1. vmagent remoteWrite;
2. vmalert datasource/remoteRead/remoteWrite/notifier.
* add changelogs
* address review comments
* fix ut
Follow up 130e0ea5f0.
vmalert-tool can't be easily adapted for vmcluster now, cause it needs to set up the whole vmcluster[vminsert+vmstorage+vmselect] first.
You can use vmalert-tool to run unit tests for alerting and recording rules.
It will perform the following actions:
- sets up an isolated VictoriaMetrics instance;
- simulates the periodic ingestion of time series;
- queries the ingested data for recording and alerting rules evaluation like vmalert;
But component packages have functions that not exported and variables with same name, so to implement this for cluster will need amount of code refactor and doesn't look like a good thing to themselves.
So I want to remove it from the cluster branch.
c9375cac5e
Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.
`eval_delay` was added to tests to verify it can be unmarshalled.
`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.
`eval_delay` and `eval_offset` was added to API output.
`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit a216fe6728)
This can be useful in the following queries:
drop_empty_series(temperature <= 30) default 40
This query drops temperature series with all the values bigger than 30 on the selected time range,
while replacing gaps in the remaining series with 40.
The query without drop_empty_series:
(temperature <= 30) default 40
would leave all the temperature series with all the values bigger than 30 on the selected time range,
and replace all their values with 40. This is not what could be epxected in some cases
like here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5071
This is a follow-up for f60c08a7bd
Changes:
- Make sure all the urls related to NewRelic protocol start from /newrelic . Previously some urls were started from /api/v1/newrelic
- Remove /api/v1 part from NewRelic urls, since it has no sense
- Remove automatic transformation from CamelCase to snake_case for NewRelic labels and metric names,
since it may complicate the transition from NewRelic to VictoriaMetrics. Preserve all the metric names and label names,
so users could query metrics and labels by the same names which are used in NewRelic.
The automatic transformation from CamelCase to snake_case can be added later as a special action for relabeling rules if needed.
- Properly update per-tenant data ingestion stats at app/vmagent/newrelic/request_handler.go . Previously it was always zero.
- Fix NewRelic urls in vmagent when multitenant data ingestion is enabled. Previously they were mistakenly started from `/`.
- Document NewRelic data ingestion url at docs/Cluster-VictoriaMetrics.md
- Remove superflouos memory allocations at lib/protoparser/newrelic
- Improve tests at lib/protoparser/newrelic/*
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3520
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4712
- Make more clear the docs at docs/enterprise.md, so readers could figure out faster
on how to obtain enterprise key and how to pass it to VictoriaMetrics Enterprise components.
- Fix examples at docs/enterprise.md, which were referring to non-existing `-license-file` command-line flag.
The `-licenseFile` command-line flag must be used instead.
- Improve the description of `-license*` command-line flags, so users could understand
faster how to use them.
- Improve the warning message, which is emitted when the deprecated -eula command-line flag is passed,
so the user could figure out how to switch faster to -license* command-line flags.
- Disallow running VictoriaMetrics components with both -license and -licenseFile command-line flags.
- Disallow running VictoriaMetrics components when -licensFile points to an empty file.
- Consistently use the phrase "This flag is available only in Enterprise binaries" across
all the enterprise-specific command-line flags.
- Remove unneeded level of indirection for `noLicenseMessage` and `expiredMessage` string contants
in order to improve code readability and maintainability.
- Remove unneded `return` statements after `logger.Fatalf()` calls, since these calls exit the app and never return.
- Make sure that the info log message about successful license verification is emitted
when the license is verified successfully. Previously the error message could be logged
when the license payload is invalid or if it misses some required features.
- Reduce vertical space usage, so more information is available on the screen without the need to scroll.
- Show information for lines with higher values at the top of the legend under the graph.
This should simplify graph analysis when it contains many lines.
reduce lock contention for heavy aggregation requests
previously lock contetion may happen on machine with big number of CPU due to enabled string interning. sync.Map was a choke point for all aggregation requests.
Now instead of interning, new string is created. It may increase CPU and memory usage for some cases.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
* vmalert: add `query_time_alignment` for rule group
1. add `eval_alignment` attribute for group which by default is true. So group rule query stamp will be aligned with interval and propagated to ALERT metrics and the messages for alertmanager;
2. deprecate `datasource.queryTimeAlignment` flag.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049
(cherry picked from commit 2aa0f5fc41)
Strip sensitive information such as auth headers or passwords from datasource, remote-read,
remote-write or notifier URLs in log messages or UI. This behavior is by default and is controlled via
`-datasource.showURL`, `-remoteRead.showURL`, `remoteWrite.showURL` or `-notifier.showURL` cmd-line flags.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5044
(cherry picked from commit 244c887825)
* lib/promscrape: make concurrency control optional
Before, `-maxConcurrentInserts` was limiting all calls to `promscrape.Parse`
function: during ingestion and scraping. This behavior is incorrect.
Cmd-line flag `-maxConcurrentInserts` should have effect onl on ingestion.
Since both pipelines use the same `promscrape.Parse` function, we extend it
to make concurrency limiter optional. So caller can decide whether concurrency
should be limited or not.
This commit makes c53b5788b4
obsolete.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "dashboards: move `Concurrent inserts` panel to Troubleshooting section"
This reverts commit c53b5788b4.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Compare the actual free disk space to the value provided via -storage.minFreeDiskSpaceBytes
directly inside the Storage.IsReadOnly(). This should work fast in most cases.
This simplifies the logic at lib/storage.
- Do not take into account -storage.minFreeDiskSpaceBytes during background merges, since
it results in uncontrolled growth of small parts when the free disk space approaches -storage.minFreeDiskSpaceBytes.
The background merge logic uses another mechanism for determining whether there is enough
disk space for the merge - it reserves the needed disk space before the merge
and releases it after the merge. This prevents from out of disk space errors during background merge.
- Properly handle corner cases for flushing in-memory data to disk when the storage
enters read-only mode. This is better than losing the in-memory data.
- Return back Storage.MustAddRows() instead of Storage.AddRows(),
since the only case when AddRows() can return error is when the storage is in read-only mode.
This case must be handled by the caller by calling Storage.IsReadOnly()
before adding rows to the storage.
This simplifies the code a bit, since the caller of Storage.MustAddRows() shouldn't handle
errors returned by Storage.AddRows().
- Properly store parsed logs to Storage if parts of the request contain invalid log lines.
Previously the parsed logs could be lost in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4945
* lib/logstorage: switch to read-only mode when running out of disk space
Added support of `--storage.minFreeDiskSpaceBytes` command-line flag to allow graceful handling of running out of disk space at `--storageDataPath`.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/logstorage: fix error handling logic during merge
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/logstorage: fix log level
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmui: update information about tsdb usage in cluster version
* vmui: cleanup
* vmui: add CHANGELOG.md
* vmui: cleanup
* vmui: update logic, move information to the visible place
* app/vmui: remove values fetch, update documentation for cardinality explorer
* app/vmui: update CHANGELOG.md
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
`median_over_time` is handled by predefined WITH template in MetricsQL library which translates it to `quantile_over_time(0.5)`
This makes it impossble to use `median_over_time` as a usual rollup function for `aggr_over_time`.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5034
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* expose metrics `vmauth_config_last_reload_*` for tracking the state of config reloads, similarly to vmagent/vmalert components.
* do not print logs like `SIGHUP received...` once per configured `-configCheckInterval` cmd-line flag. This log will be printed only if config reload was invoked manually.
* prevent configuration reloading if there were no changes in config. This improves memory usage when `-configCheckInterval` cmd-line flag is configured and config has extensive list of regexp expressions requiring additional memory on parsing.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Switch from summary to histogram for vl_http_request_duration_seconds metric.
This allows calculating request duration quantiles across multiple hosts
via histogram_quantile(0.99, sum(vl_http_request_duration_seconds_bucket) by (vmrange)).
- Take into account only successfully processed data ingestion requests
when updating vl_http_request_duration_seconds histogram.
Failed requests are ignored, since they may significantly skew measurements.
- Clarify the description of the change at docs/VictoriaLogs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4934
Adding limit on ingestion allows to avoid issues like this one https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Such issues are often caused by misconfigurtion on log persing/ingestion side and preventing such rows from being ingested allows to avoid performance implications created by storing such log rows.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Adds `eval_offset` attribute for Groups.
If specified, Group will be evaluated at the exact time offset on the range of [0...evaluationInterval].
The setting might be useful for cron-like rules which must be evaluated at specific moments of time.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3409
Signed-off-by: Haley Wang <pipilong.25@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 45c0e4bb31)
* app/vminsert: properly close vmstorage connection
previously vmstorage may stuck in broken state until vminsert restarts
since vmstorage was marked as read-only and connection was broken to it.
checkReadonly function never marked connection as broken
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4870
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add cardinality support for prometheus (#4320)
* docs/CHANGELOG.md: add cardinality support for prometheus
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* app/vminsert: fixes readonly check
previously vminsert doesn't check readOnly state for vmstorage, since check was never performed for nil buffer
In this case every 30 second storage node loss readonly state and received some data.
It caused re-routing and possible slow down for ingestion
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4870
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Disallow parsing multitenant token at auth.NewToken().
Use auth.NewTokenPossibleMultitenant() at vminsert only. All the other callers should call auth.NewToken(),
since they do not support multitenant token.
This is a follow-up for f0c06b428e
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4910
Remove extra error message when auth token is nil. The default message
about unsupported path should be more clear to the user who mistakenly
requested /multitenant path.
f0c06b428e
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmselect: fix panic when using `/select/multitenant` endpoint
Such requests must be rejected as not found since vmselect does not support multitenant endpoint.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4910
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Set the default value for -vmstorageUserTimeout to 3 seconds. This is much better
than the 0 value, which means that TCP connection to unreachable vmstorage could block
for up to 16 minutes.
- Document -vmstorageUserTimeout at docs/Cluster-VictoriaMetrics.md
`TCP_USER_TIMEOUT` (since Linux 2.6.37) specifies the maximum amount of
time that transmitted data may remain unacknowledged before TCP will
forcibly close the connection and return `ETIMEDOUT` to the application.
Setting a low TCP user timeout allows RPC connections quickly reroute
around unavailable storage nodes during network interruptions.
* vmagent: retry failed write request on the closed connection
Retry failed write request on the closed connection immediately,
without waiting for backoff. This should improve data delivery speed
and reduce amount of error logs emitted by vmagent when using idle connections.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4139
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmagent: retry failed write request on the closed connection
Re-instantinate request before retry as body could have been already spoiled.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
(cherry picked from commit 992a1c0a3a)
* vmalert: correctly re-instantinate HTTP req on retries
Previosly, request retry to datasource re-used existing HTTP request.
But if request object was already partially processed (body was read),
then retry will be unsuccessful.
The change re-instantinates HTTP request object before retry.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: review fix
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit ddf87b32ed)