Commit graph

1705 commits

Author SHA1 Message Date
Aliaksandr Valialkin
12cd32fd75
lib/protoparser/promremotewrite: fall back to Snappy decoding if zstd decoding fails
This case is possible after the following steps:

1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
   the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
   for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
   while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.

The solution is to just fall back to Snappy decompression if zstd decompression fails.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
2023-11-13 21:25:39 +01:00
Aliaksandr Valialkin
356deada8c
lib/htmlcomponents: use relative links for the top page and for favicon.ico
This allows hiding VictoriaMetrics components behind proxies with arbitrary path prefixes.
For example, vmagent HTTP handlers can be served via /vmagent/ path prefix:

- http://proxy/vmagent/targets
- http://proxy/vmagent/service-discovery

The path prefix can be arbitrary. For example, below are vmagent urls
for /tenantID/vmagent/ path prefix:

- http://proxy/tenantID/vmagent/targets
- http://proxy/tenantID/vmagent/service-discovery

While at it, consistently serve favicon.ico from any path directory.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5306
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5307
2023-11-13 20:28:17 +01:00
Aliaksandr Valialkin
fb2071a01e
lib/regexutil: properly handle alternate regexps surrounded by .+ or .*
Previously the following regexps were improperly handled:

  .+foo|bar.+
  .*foo|bar.*

This could lead to unexpected regexp match results.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5297

Thanks to @Haleygo for the initial attempt to fix the issue at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5308
2023-11-13 18:25:57 +01:00
Aliaksandr Valialkin
2a3352c70e
docs/CHANGELOG.md: remove trailing whitespace after bffd30b57a 2023-11-13 09:47:36 +01:00
Aliaksandr Valialkin
b9aba7edfb
app/vmauth: properly pass Host header to backends
Previously the `Host` header was remained unchanged when passing it in requests to backends.
This may improperly work if the backend uses host-based routing.

While at it, allows http/2.0 requests to backends. While VictoriaMetrics components
do not accept http/2.0 requests, other backends can require such requests.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
2023-11-13 09:45:34 +01:00
Aliaksandr Valialkin
78bc816220
app/vmauth: follow-up for 323f3720ed
- Re-use identically configured http.Transport across multiple users.
  This fixes handling of the limit on the number of connection, which can be established per each backend
  via -maxIdleConnsPerBackend command-line flag. This limit stopped working after 323f3720ed

- Add docs about backend TLS setup at https://docs.victoriametrics.com/vmauth.html#backend-tls-setup

- Add ability to disable backend TLS verification for all the users via -backend.tlsInsecureSkipVerify command-line flag.
  This flag may be useful when -auth.config contains big number of users, and every user must disable backend TLS verification.

- Add ability to specify TLS Root CA via tls_ca_file option at per-user basis and via -backend.tlsCAFile command-line flag
  across all the users.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
2023-11-13 09:45:16 +01:00
Aliaksandr Valialkin
76384b6d28
app/vmauth: improve docs a bit after 323f3720ed
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
2023-11-13 09:44:25 +01:00
Aliaksandr Valialkin
d9ecc3f6d7
lib/logger: add -loggerMaxArgLen command-line flag for fine-tuning the maximum length of logged args 2023-11-13 09:43:49 +01:00
Aliaksandr Valialkin
c916294b61
app/vmselect/promql: optimize instant queries with min_over_time() and max_over_time() rollup functions
This is a follow-up for 41a0fdaf39
2023-11-13 09:43:18 +01:00
Aliaksandr Valialkin
7bbdecb79a
deployment: update Go builder from Go1.21.3 to Go1.21.4
See https://github.com/golang/go/issues?q=milestone%3AGo1.21.4+label%3ACherryPickApproved
2023-11-13 09:40:08 +01:00
Aliaksandr Valialkin
ed79f9806a
lib/blockcache: do not cache entries, which were attempted to be accessed 1 or 2 times
Previously entries which were accessed only 1 time weren't cached.
It has been appeared that some rarely executed heavy queries may read indexdb block twice
in a row instead of once. There is no need in caching such a block then.
This change should eliminate cache size spikes for indexdb/dataBlocks when such heavy queries are executed.

Expose -blockcache.missesBeforeCaching command-line flag, which can be used for fine-tuning
the number of cache misses needed before storing the block in the caching.
2023-11-13 09:38:57 +01:00
Zakhar Bessarab
7c7e0a5caa
docs/changelog: document vmbackupmanager bugfix (#5303)
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-11-13 09:31:58 +01:00
Roman Khavronenko
becf7bf8df
app/vmalert: update remote-write process (#5284)
* app/vmalert: update remote-write process

* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* Update docs/CHANGELOG.md

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
2023-11-13 09:25:29 +01:00
Yury Molodov
d7c6153f68
vmui: display query error on Explore metrics page (#5272)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5202

(cherry picked from commit f90d2ec843)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-03 16:25:21 +01:00
Zakhar Bessarab
dea4695df5
app/vmauth: add option to skip TLS verification (#5256)
Add `tls_insecure_skip_verify` option on per-user basis which allows to disable TLS verification for all requests to backend on behalf of this user.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit 323f3720ed)
2023-11-03 12:05:26 +01:00
Aliaksandr Valialkin
bf01a97f17
docs/CHANGELOG.md: update the description of the optimization for SLO/SLI-like queries according to latest changes
See commits 4497a08e3d and 92826b0b4a
2023-11-02 20:09:22 +01:00
Roman Khavronenko
4e8c762fd9
app/vmalert: add label file pointing to the group's filename to metrics (#5281)
The filename should help identifying alerting rules belonging to specific groups
with identical names but different filenames.

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5267

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit b5254199c6)
2023-11-02 16:02:29 +01:00
Hui Wang
44fcdf0cf0
vmalert: reduce restore query request for each alerting rule (#5265)
reduce the number of queries for restoring alerts state on start-up.
The change should speed up the restore process and reduce pressure on `remoteRead.url`.

(cherry picked from commit 90d45574bf)
2023-11-02 16:02:28 +01:00
Aliaksandr Valialkin
44227c0287
docs/CHANGELOG.md: typo fix: tis -> this 2023-11-02 08:33:48 +01:00
Aliaksandr Valialkin
c04e667f9d
docs/CHANGELOG.md: document v1.93.7 LTS release 2023-11-02 08:21:10 +01:00
Aliaksandr Valialkin
369d37749d
app/vmagent/remotewrite: add -remoteWrite.shardByURL.labels command-line flag
This command-line flag can be used for specifying a list of labels used for sharding
among -remoteWrite.url entries when -remoteWrite.shardByURL command-line flag is set.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4942
2023-11-01 23:09:08 +01:00
Alexander Marshalov
ffeec24811
vmauth: add browser authorization request for http requests without… (#5234)
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).

* add link to issue in CHANGELOG

* Extend vmauth docs

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-01 21:00:52 +01:00
Aliaksandr Valialkin
6a98f9df54
app/vmui: show query execution duration in the header of query input field
This should simplify the process of query optimization
2023-11-01 16:46:42 +01:00
Hui Wang
4fafdda13e
vmalert: support specifying full http url in notifier static_configs target (#5261)
* vmalert: support specifying full http or https urls in notifier static_configs target address
* show right label results in ui
2023-11-01 16:44:54 +01:00
Aliaksandr Valialkin
28f0610e14
app/vmui: fix non-working Disable cache checkbox at JSON and Table views 2023-10-31 22:58:15 +01:00
Aliaksandr Valialkin
744f8c3fe7
app/vmselect/promql: add outliers_iqr(q) and outlier_iqr_over_time(m[d]) functions
These functions allow detecting anomalies in series and samples using Interquartile range method.
See Outliers section at https://en.wikipedia.org/wiki/Interquartile_range for more details.
2023-10-31 22:14:14 +01:00
Aliaksandr Valialkin
9661918bb4
app/vmselect/promql: optimize repeated SLI-like instant queries with lookbehind windows >= 1d
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:

- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()

The basic idea of optimization is to calculate

  rf(m[d] @ t)

as

  rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))

where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously

The offset may be in the range of up to 1 hour.
2023-10-31 20:08:38 +01:00
Aliaksandr Valialkin
5e7d495eb1
lib/httpserver: follow-up for 0638bbe69c
- Replace spaces with underscores in the `reason` label value for the vm_http_request_errors_total metric
  in order be consistent with Prometheus-like naming

- Clarify the description for the change at docs/CHANGELOG.md

Updates https://github.com/victoriaMetrics/victoriaMetrics/issues/4590
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5166
2023-10-31 19:10:48 +01:00
Aliaksandr Valialkin
40a53b516d
docs/CHANGELOG.md: move the description for -http.header.* command-line flags from SECURITY to FEATURE
The SECURITY label should be applied only to changes, which fix security issues.
The change at ad839aa492 adds new command-line flags, which can be used
for improving security in some cases. They do not fix any security issues.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5111
2023-10-31 19:05:01 +01:00
hagen1778
8c3bac8f40
dashboards/cluster: fix description about max threshold for Concurrent selects panel.
Before, it was mistakenly implying that `max` is equal to the double of available CPUs.

Addresses https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5214

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 19:03:21 +01:00
Roman Khavronenko
9d8f93050c
app/vmselect: expose vm_memory_intensive_queries_total counter metric (#5208)
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 19:02:22 +01:00
hagen1778
f9c7822588
docs: follow-up for 0638bbe69c
0638bbe69c
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit a8051d48c4)
2023-10-31 12:54:57 +01:00
hagen1778
9debdb497c
dashboards/vmalert: add new panel Missed evaluations
The new panel supposed to indicate alerting groups that miss their evaluations.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit aaf9e3d526)
2023-10-31 10:35:57 +01:00
hagen1778
659171686c
deployment/alerts: add TooManyMissedIterations alerting rule
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 9866974a53)
2023-10-31 10:35:57 +01:00
hagen1778
497c708aaa
dashboards: fix Errors rate to Alertmanager filter
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 8874b525b7)
2023-10-31 10:35:57 +01:00
Hui Wang
8a786e5df4
vmalert: fix alert firing state in replay mode (#5192)
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.

Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit abcb21aa5e)
2023-10-30 13:55:48 +01:00
Dima Lazerka
ed8fc04898
lib/httpserver: add flags to specify HSTS / Frame-Options / CSP headers for httpserver (#5111)
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.

Co-authored-by: hagen1778 <roman@victoriametrics.com>

(cherry picked from commit ad839aa492)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 11:41:38 +01:00
Roman Khavronenko
733b73ffed
lib/storage: log warning about RO mode only on state change (#5191)
Before, vmstorage would log the same message each second producing excessive
amount of logs.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 29cebd82fb)
2023-10-30 11:29:49 +01:00
Aliaksandr Valialkin
20aeb8b65d
lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 14:34:25 +02:00
Hui Wang
69f4a58f76
do not print redundant error logs when failed to scrape consul or no… (#5239)
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
2023-10-27 14:18:47 +02:00
Aliaksandr Valialkin
f03e81c693
lib/promauth: follow-up for e16d3f5639
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
  don't prevent from processing the corresponding scrape targets after the file becomes correct,
  without the need to restart vmagent.
  Previously scrape targets with invalid TLS CA file or TLS client certificate files
  were permanently dropped after the first attempt to initialize them, and they didn't
  appear until the next vmagent reload or the next change in other places of the loaded scrape configs.

- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
  Previously the old TLS CA was used until vmagent restart.

- Properly handle errors during http request creation for the second attempt to send data to remote system
  at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
  since the returned request is nil on error.

- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
  Previously it could miss details on the source of the request.

- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
  of every http request issued by vmagent during service discovery or target scraping.
  Re-use the HTTP client instead until the corresponding scrape config changes.

- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
  e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
  when auth header cannot be obtained because of temporary error.

- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
  Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
  when scraping big number of targets with the same tls_config.

- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.

- Improve test coverage at lib/promauth

- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
  Previously vmagent was exitting in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
2023-10-26 09:55:47 +02:00
Aliaksandr Valialkin
02684a0b29
lib/promauth: properly parse string contents for ca, cert and key fields at tls_config
Previously yaml parser wasn't accepting string values for these fields,
because it was mistakenly expecting a list of uint8 values instead.
2023-10-26 09:54:18 +02:00
Alexander Marshalov
cf42a080af
lib/streamaggr: respect streamAgg.dropInput with empty stream aggr config (#5213)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5207
2023-10-26 09:30:12 +02:00
Roman Khavronenko
cd2247b24a
app/vmselect: limit the number of parallel workers by 32 (#5195)
* app/vmselect: limit the number of parallel workers by 32

The change should improve performance and memory usage during query processing
on machines with big number of CPU cores. The number of parallel workers for
query processing is controlled via `-search.maxWorkersPerQuery` command-line flag.
By default, the number of workers is limited by the number of available CPU cores,
but not more than 32. The limit can be increased via `-search.maxWorkersPerQuery`.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* wip

- The `-search.maxWorkersPerQuery` command-line flag doesn't limit resource usage,
  so move it from the `resource usage limits` to `troubleshooting` chapter at docs/Single-server-VictoriaMetrics.md

- Make more clear the description for the `-search.maxWorkersPerQuery` command-line flag

- Add the description of `-search.maxWorkersPerQuery` to docs/Cluster-VictoriaMetrics.md

- Limit the maximum value, which can be passed to `-search.maxWorkersPerQuery`, to GOMAXPROCS,
  because bigger values may worsen query performance and increase CPU usage

- Improve the the description of the change at docs/CHANGELOG.md. Mark it as FEATURE instead of BUGFIX,
  since it is closer to a feature than to a bugfix.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-10-26 09:15:27 +02:00
hagen1778
afab547821
lib/storage: follow-up after 188cfe3a85
188cfe3a85

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-26 09:04:49 +02:00
Hui Wang
d7dd7614eb
fix inconsistent behaviors with prometheus when scraping (#5153)
* fix inconsistent behaviors with prometheus when scraping

1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
    1. vmagent remoteWrite;
    2. vmalert datasource/remoteRead/remoteWrite/notifier.

* add changelogs

* address review comments

* fix ut
2023-10-26 08:56:54 +02:00
hagen1778
6c63ca18f5
app/vmalert: follow-up after c9375cac5e
c9375cac5e

Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.

`eval_delay` was added to tests to verify it can be unmarshalled.

`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.

`eval_delay` and `eval_offset` was added to API output.

`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit a216fe6728)
2023-10-25 14:39:49 +02:00
Hui Wang
86d861ec55
vmalert: add -rule.evalDelay flag and eval_delay as group attribute (#5185)
Also mark `-datasource.lookback` as will be deprecated, see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.

(cherry picked from commit c9375cac5e)
2023-10-25 14:39:49 +02:00
hagen1778
d349d6a9ce
deployment/alerts: make TooHighMemoryUsage more tolerable to spikes
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 003ef3a518)
2023-10-25 14:39:48 +02:00
hagen1778
46770409d9
dashboards/vmalert: respect job and instance filters in No data errors
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit c2d252c045)
2023-10-17 10:26:32 +02:00