Commit graph

7310 commits

Author SHA1 Message Date
hagen1778
659171686c
deployment/alerts: add TooManyMissedIterations alerting rule
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 9866974a53)
2023-10-31 10:35:57 +01:00
hagen1778
497c708aaa
dashboards: fix Errors rate to Alertmanager filter
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 8874b525b7)
2023-10-31 10:35:57 +01:00
Roman Khavronenko
afd160f1dc
docs: mention information loss when downsampling gauges (#5204)
Signed-off-by: hagen1778 <roman@victoriametrics.com>

(cherry picked from commit 23369321f1)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 19:05:39 +01:00
Roman Khavronenko
c6d74108fb
docs: explain motivation behind having -downsampling.period on vmselect (#5205)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 15:29:35 +01:00
Hui Wang
8a786e5df4
vmalert: fix alert firing state in replay mode (#5192)
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.

Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit abcb21aa5e)
2023-10-30 13:55:48 +01:00
hagen1778
f0d10e2004
docs/troubleshooting: mention issue with un-ordered labels
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5219#issuecomment-1773441711

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit e964df8039)
2023-10-30 13:55:45 +01:00
hagen1778
14b1997659
docs: rm mention of default values for security HTTP headers
The headers, their corresponding flags are mentioned at
https://docs.victoriametrics.com/#security

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit a64b37cf24)
2023-10-30 11:46:49 +01:00
Dima Lazerka
ed8fc04898
lib/httpserver: add flags to specify HSTS / Frame-Options / CSP headers for httpserver (#5111)
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.

Co-authored-by: hagen1778 <roman@victoriametrics.com>

(cherry picked from commit ad839aa492)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 11:41:38 +01:00
Roman Khavronenko
733b73ffed
lib/storage: log warning about RO mode only on state change (#5191)
Before, vmstorage would log the same message each second producing excessive
amount of logs.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 29cebd82fb)
2023-10-30 11:29:49 +01:00
Aliaksandr Valialkin
a66c261b55
app/vmui: change the order of tables at Top queries tab
Move the most interesting table - queries with the most summary time to execute - to the top
2023-10-28 11:57:08 +02:00
Aliaksandr Valialkin
15dda54e79
lib/promscrape/discovery/kubernetes: propagate possible errors at newAPIWatcher() to the caller
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.

This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
2023-10-27 20:27:58 +02:00
Hui Wang
a37125d043
lib/promscrape/discovery/kubernetes: avoid possible panic if given caFile under kubernetes.SDConfig.HTTPClientConfig is not exist (#5243)
follow up d5a599badc
2023-10-27 20:27:58 +02:00
Aliaksandr Valialkin
20aeb8b65d
lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 14:34:25 +02:00
Hui Wang
69f4a58f76
do not print redundant error logs when failed to scrape consul or no… (#5239)
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
2023-10-27 14:18:47 +02:00
hagen1778
ddedeb1d42
app/vmalert: remove unclear comment
The timestamp alignment should be applied as a last step
to keep the timestamp consistent.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-27 14:09:01 +02:00
Daria Karavaieva
076a796061
model list - isolation forest (#5235)
* model list - isolation forest

* curse of dimensionality

* isol forest definition change, minor fixes

* blank line fix
2023-10-27 14:08:30 +02:00
Aliaksandr Valialkin
8fbe5a0893
lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4 2023-10-27 14:06:49 +02:00
Dima Lazerka
1e48ad486e
Revert "lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4"
It broke CI (lint)

This reverts commit 5464376d16.
2023-10-27 14:06:31 +02:00
Aliaksandr Valialkin
46dd504d81
lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4 2023-10-26 09:56:55 +02:00
Aliaksandr Valialkin
af6dc9c963
lib/promscrape: properly track the number of updated service discovery routines inside Config.mustRestart()
This is a follow-up for d5a599badc
2023-10-26 09:56:36 +02:00
Aliaksandr Valialkin
b3bcf99cea
lib/promauth: typo fix in the error message after d5a599badc: obtaine -> obtain 2023-10-26 09:56:20 +02:00
Aliaksandr Valialkin
f03e81c693
lib/promauth: follow-up for e16d3f5639
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
  don't prevent from processing the corresponding scrape targets after the file becomes correct,
  without the need to restart vmagent.
  Previously scrape targets with invalid TLS CA file or TLS client certificate files
  were permanently dropped after the first attempt to initialize them, and they didn't
  appear until the next vmagent reload or the next change in other places of the loaded scrape configs.

- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
  Previously the old TLS CA was used until vmagent restart.

- Properly handle errors during http request creation for the second attempt to send data to remote system
  at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
  since the returned request is nil on error.

- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
  Previously it could miss details on the source of the request.

- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
  of every http request issued by vmagent during service discovery or target scraping.
  Re-use the HTTP client instead until the corresponding scrape config changes.

- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
  e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
  when auth header cannot be obtained because of temporary error.

- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
  Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
  when scraping big number of targets with the same tls_config.

- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.

- Improve test coverage at lib/promauth

- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
  Previously vmagent was exitting in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
2023-10-26 09:55:47 +02:00
Aliaksandr Valialkin
8c9e3b7b50
lib/promscrape/discovery/kubernetes/kubeconfig_test.go: make TestParseKubeConfigSuccess test code easier to follow 2023-10-26 09:54:40 +02:00
Aliaksandr Valialkin
02684a0b29
lib/promauth: properly parse string contents for ca, cert and key fields at tls_config
Previously yaml parser wasn't accepting string values for these fields,
because it was mistakenly expecting a list of uint8 values instead.
2023-10-26 09:54:18 +02:00
Aliaksandr Valialkin
194deeea1b
lib/promscrape: move duplicate code from functions, which collect ScrapeWork lists for distinct SD types into Config.getScrapeWorkGeneric()
This removes more than 200 lines of duplicate code
2023-10-26 09:53:59 +02:00
Aliaksandr Valialkin
19940b5629
app/vmalert/config: fix flacky test TestParseBad
It could return either `failed to read` or `failed to parse` errors depending
on whether the given url can be loaded or not under the current environment
2023-10-26 09:53:40 +02:00
Aliaksandr Valialkin
36a1fdca6c
all: consistently use %w instead of %s in when error is passed to fmt.Errorf()
This allows consistently using errors.Is() for verifying whether the given error wraps some other known error.
2023-10-26 09:44:40 +02:00
Aliaksandr Valialkin
dc6a718a52
lib/workingsetcache: fix outdated comments for Load() and New() functions 2023-10-26 09:43:04 +02:00
Artem Navoiev
3b00e7b599
github-actions: build search index during the doc sync (#5224)
* github-actions: build search index during the doc sync

Signed-off-by: Artem Navoiev <tenmozes@gmail.com>

---------

Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-10-26 09:40:46 +02:00
Alexander Marshalov
cf42a080af
lib/streamaggr: respect streamAgg.dropInput with empty stream aggr config (#5213)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5207
2023-10-26 09:30:12 +02:00
krakazyabra
c85ff74ac3
docs/case-studies: update Wedos info (#5211) 2023-10-26 09:26:31 +02:00
Aliaksandr Valialkin
94e061087f
docs: use https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest instead of https://github.com/VictoriaMetrics/VictoriaMetrics/releases link where needed
The https://github.com/VictoriaMetrics/VictoriaMetrics/releases link may show non-latest
releases at the top, such as LTS releases or VictoriaLogs releases.
So it is better to use https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest link,
which always redirect to the latest available release of VictoriaMetrics.
2023-10-26 09:23:17 +02:00
Aliaksandr Valialkin
190beb79d3
lib/storage: fix test TestStorageSeriesAreNotCreatedOnStaleMarkers 2023-10-26 09:22:15 +02:00
Roman Khavronenko
cd2247b24a
app/vmselect: limit the number of parallel workers by 32 (#5195)
* app/vmselect: limit the number of parallel workers by 32

The change should improve performance and memory usage during query processing
on machines with big number of CPU cores. The number of parallel workers for
query processing is controlled via `-search.maxWorkersPerQuery` command-line flag.
By default, the number of workers is limited by the number of available CPU cores,
but not more than 32. The limit can be increased via `-search.maxWorkersPerQuery`.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* wip

- The `-search.maxWorkersPerQuery` command-line flag doesn't limit resource usage,
  so move it from the `resource usage limits` to `troubleshooting` chapter at docs/Single-server-VictoriaMetrics.md

- Make more clear the description for the `-search.maxWorkersPerQuery` command-line flag

- Add the description of `-search.maxWorkersPerQuery` to docs/Cluster-VictoriaMetrics.md

- Limit the maximum value, which can be passed to `-search.maxWorkersPerQuery`, to GOMAXPROCS,
  because bigger values may worsen query performance and increase CPU usage

- Improve the the description of the change at docs/CHANGELOG.md. Mark it as FEATURE instead of BUGFIX,
  since it is closer to a feature than to a bugfix.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-10-26 09:15:27 +02:00
Yury Molodov
45501eccab
vmui: update dependencies (#5194) 2023-10-26 09:08:05 +02:00
Aliaksandr Valialkin
76606eed22
docs/Articles.md: add an article https://rtfm.co.ua/en/victoriametrics-vmauth-proxy-authentication-and-authorization/ 2023-10-26 09:06:54 +02:00
Aliaksandr Valialkin
52d7464095
docs/FAQ.md: add questions on why VictoriaMetrics doesnt rebalance data and doesnt restore replication factor between vmstorage nodes 2023-10-26 09:06:25 +02:00
Aliaksandr Valialkin
d3cab41426
docs/FAQ.md: refresh the answer to the question about how does VictoriaMetrics compare to competing solutions
- Mention Grafana Mimir
- Fix broken links
2023-10-26 09:06:01 +02:00
Github Actions
a35ae0716e
Automatic update operator docs from VictoriaMetrics/operator@2c826bb (#5188) 2023-10-26 09:05:33 +02:00
hagen1778
afab547821
lib/storage: follow-up after 188cfe3a85
188cfe3a85

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-26 09:04:49 +02:00
Ilya Trefilov
1fd3385965
lib/storage: do not create tsid if metric contains stale marker(#5069) (#5174)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5069
2023-10-26 09:03:19 +02:00
Hui Wang
d7dd7614eb
fix inconsistent behaviors with prometheus when scraping (#5153)
* fix inconsistent behaviors with prometheus when scraping

1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
    1. vmagent remoteWrite;
    2. vmalert datasource/remoteRead/remoteWrite/notifier.

* add changelogs

* address review comments

* fix ut
2023-10-26 08:56:54 +02:00
hagen1778
f00729ee24
app/vmalert: fix typo in tests
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit c07909a20b)
2023-10-26 08:55:20 +02:00
hagen1778
cf541c757a
app/vmalert: fix tests after a216fe6728
a216fe6728
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit eed0c3c6b0)
2023-10-26 08:55:06 +02:00
Hui Wang
855c25b6c4
remove vmalert-tool code from branch cluster (#5229)
Follow up 130e0ea5f0.
vmalert-tool can't be easily adapted for vmcluster now, cause it needs to set up the whole vmcluster[vminsert+vmstorage+vmselect] first.

You can use vmalert-tool to run unit tests for alerting and recording rules.
It will perform the following actions:

- sets up an isolated VictoriaMetrics instance;
- simulates the periodic ingestion of time series;
- queries the ingested data for recording and alerting rules evaluation like vmalert;

But component packages have functions that not exported and variables with same name, so to implement this for cluster will need amount of code refactor and doesn't look like a good thing to themselves.
So I want to remove it from the cluster branch.
2023-10-25 14:48:11 +02:00
hagen1778
6c63ca18f5
app/vmalert: follow-up after c9375cac5e
c9375cac5e

Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.

`eval_delay` was added to tests to verify it can be unmarshalled.

`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.

`eval_delay` and `eval_offset` was added to API output.

`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit a216fe6728)
2023-10-25 14:39:49 +02:00
Hui Wang
86d861ec55
vmalert: add -rule.evalDelay flag and eval_delay as group attribute (#5185)
Also mark `-datasource.lookback` as will be deprecated, see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.

(cherry picked from commit c9375cac5e)
2023-10-25 14:39:49 +02:00
hagen1778
d9118cdaab
deployment/alerts: update TooHighMemoryUsage annotation
The memory usage isn't measured on 5m interval anymore.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 4e0a779efe)
2023-10-25 14:39:48 +02:00
hagen1778
d349d6a9ce
deployment/alerts: make TooHighMemoryUsage more tolerable to spikes
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 003ef3a518)
2023-10-25 14:39:48 +02:00
hagen1778
361a3dcc0c
deployment/alerts: make RemoteWriteConnectionIsSaturated expr readable
Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 685f9c3c98)
2023-10-25 14:39:47 +02:00