Commit graph

631 commits

Author SHA1 Message Date
Aliaksandr Valialkin
613b545dfd
lib/promscrape/discovery/kubernetes: propagate possible errors at newAPIWatcher() to the caller
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.

This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
2023-10-27 20:24:46 +02:00
Hui Wang
90427abc65
lib/promscrape/discovery/kubernetes: avoid possible panic if given caFile under kubernetes.SDConfig.HTTPClientConfig is not exist (#5243)
follow up d5a599badc
2023-10-27 20:20:22 +02:00
Aliaksandr Valialkin
632d788b63
lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 13:51:35 +02:00
Hui Wang
7c90ce39cb
do not print redundant error logs when failed to scrape consul or no… (#5239)
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
2023-10-27 13:31:55 +08:00
Aliaksandr Valialkin
cdbc06a639
lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4 2023-10-25 17:57:56 -07:00
Dima Lazerka
8b41b506c2
Revert "lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4"
It broke CI (lint)

This reverts commit 5464376d16.
2023-10-25 16:24:31 -07:00
Aliaksandr Valialkin
5464376d16
lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4 2023-10-26 00:29:51 +02:00
Aliaksandr Valialkin
ac933cc423
lib/promscrape: properly track the number of updated service discovery routines inside Config.mustRestart()
This is a follow-up for d5a599badc
2023-10-26 00:06:29 +02:00
Aliaksandr Valialkin
d5a599badc
lib/promauth: follow-up for e16d3f5639
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
  don't prevent from processing the corresponding scrape targets after the file becomes correct,
  without the need to restart vmagent.
  Previously scrape targets with invalid TLS CA file or TLS client certificate files
  were permanently dropped after the first attempt to initialize them, and they didn't
  appear until the next vmagent reload or the next change in other places of the loaded scrape configs.

- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
  Previously the old TLS CA was used until vmagent restart.

- Properly handle errors during http request creation for the second attempt to send data to remote system
  at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
  since the returned request is nil on error.

- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
  Previously it could miss details on the source of the request.

- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
  of every http request issued by vmagent during service discovery or target scraping.
  Re-use the HTTP client instead until the corresponding scrape config changes.

- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
  e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
  when auth header cannot be obtained because of temporary error.

- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
  Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
  when scraping big number of targets with the same tls_config.

- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.

- Improve test coverage at lib/promauth

- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
  Previously vmagent was exitting in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
2023-10-25 23:19:37 +02:00
Aliaksandr Valialkin
c22e3e7b1d
lib/promscrape/discovery/kubernetes/kubeconfig_test.go: make TestParseKubeConfigSuccess test code easier to follow 2023-10-25 23:17:18 +02:00
Aliaksandr Valialkin
eed5206376
lib/promauth: properly parse string contents for ca, cert and key fields at tls_config
Previously yaml parser wasn't accepting string values for these fields,
because it was mistakenly expecting a list of uint8 values instead.
2023-10-25 23:12:21 +02:00
Aliaksandr Valialkin
4afcb2a689
lib/promscrape: move duplicate code from functions, which collect ScrapeWork lists for distinct SD types into Config.getScrapeWorkGeneric()
This removes more than 200 lines of duplicate code
2023-10-25 23:03:40 +02:00
Aliaksandr Valialkin
42dd71bb63
all: consistently use %w instead of %s in when error is passed to fmt.Errorf()
This allows consistently using errors.Is() for verifying whether the given error wraps some other known error.
2023-10-25 21:24:03 +02:00
Hui Wang
e16d3f5639
fix inconsistent behaviors with prometheus when scraping (#5153)
* fix inconsistent behaviors with prometheus when scraping

1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
    1. vmagent remoteWrite;
    2. vmalert datasource/remoteRead/remoteWrite/notifier.

* add changelogs

* address review comments

* fix ut
2023-10-17 17:58:19 +08:00
Aliaksandr Valialkin
b8c267075e
lib/promscrape: add a link to https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets in descriptions for -promscrape.cluster.* command-line flags
This should help users figuring out the purpose of -promscrape.cluster.* command-line flags
2023-10-16 14:46:22 +02:00
Roman Khavronenko
a4bd73ec7e
lib/promscrape: make concurrency control optional (#5073)
* lib/promscrape: make concurrency control optional

Before, `-maxConcurrentInserts` was limiting all calls to `promscrape.Parse`
function: during ingestion and scraping. This behavior is incorrect.
Cmd-line flag `-maxConcurrentInserts` should have effect onl on ingestion.

Since both pipelines use the same `promscrape.Parse` function, we extend it
to make concurrency limiter optional. So caller can decide whether concurrency
should be limited or not.

This commit makes c53b5788b4
obsolete.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* Revert "dashboards: move `Concurrent inserts` panel to Troubleshooting section"

This reverts commit c53b5788b4.

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-02 21:32:11 +02:00
Aliaksandr Valialkin
859977d591
Revert "lib/promscrape: add metric vm_promscrape_scrapes_skipped_total (#5074)"
This reverts commit 74301cdbf5.

Reason for revert:

vmagent already provides better approach for detecting slow scrape targets via the following query:

    scrape_duration_seconds / scrape_timeout_seconds > 1

This query depends on automatically generated per-target metrics.
See https://docs.victoriametrics.com/vmagent.html#automatically-generated-metrics for more details.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5074
2023-10-02 20:59:56 +02:00
Roman Khavronenko
74301cdbf5
lib/promscrape: add metric vm_promscrape_scrapes_skipped_total (#5074)
* lib/promscrape: add metric `vm_promscrape_scrapes_skipped_total`

add metric `vm_promscrape_scrapes_skipped_total`to show whether vmagent skips the scrapes.
This could happen if vmagent is overloaded or target is responding too slow for configured `scrape_interval`.

The follow-up commit should add a corresponding alerting rule and panel to vmagent dashboard.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* deployment/docker: add `TooManyScrapeSkips` alerting rule for vmagent

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards: add panels `Scrape duration 0.99 quantile` and `Skipped scrapes` to vmagent dashboard

Signed-off-by: hagen1778 <roman@victoriametrics.com>

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-02 17:12:12 +02:00
Zakhar Bessarab
8d99c12a7d
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs (#5048)
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs

It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).

Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-22 13:01:33 +02:00
Aliaksandr Valialkin
30a645cd82
lib/promscrape/discovery/kubernetes: follow-up after 03fece44e0
- Properly update vm_promscrape_discovery_kubernetes_url_watchers
  and vm_promscrape_discovery_kubernetes_group_watchers metrics after config changes

- Properly stop goroutine responsible for recreating scrapeWorks after the corresponding urlWatcher is stopped

- Log the event when urlWatcher is stopped in order to simplify debugging

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-18 23:23:45 +02:00
Aliaksandr Valialkin
03fece44e0
lib/promscrape/discovery/kubernetes: wait for 10 seconds before checking whether the urlWatcher must be stopped
This should prevent from excess urlWatcher churn on config reload, since it leads to removal of all the apiWatchers
before creating new apiWatchers. So, every config reload would lead to stopping of all the previous urlWatchers
and starting new urlWatchers.

The new logic gives 10 seconds for config reload before stopping unused urlWatchers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-18 17:45:12 +02:00
Aliaksandr Valialkin
76af32d869
lib/promscrape/discovery/kubernetes: follow-up after eeb862f3ff
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
  since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-09-18 17:06:39 +02:00
Zakhar Bessarab
eeb862f3ff
lib/promscrape/discovery/kubernetes: fix leaking api watcher (#4861)
* lib/promscrape/discovery/kubernetes: fix leaking api watcher

goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server

This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: clarify comment

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* Apply suggestions from code review

* lib/promscrape/discovery/kubernetes: address review feedback

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-15 19:40:13 +02:00
Aliaksandr Valialkin
edee262ecc
Makefile: update golangci-lint from v1.51.2 to v1.54.2
See https://github.com/golangci/golangci-lint/releases/tag/v1.54.2
2023-09-01 10:16:42 +02:00
Nikolay
00685b627f
lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch … (#4901)
* lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch requests
it must reduce load for kubernetes ETCD servers. Since requests without resourceVersion performs force cache sync at kubernetes API server with ETCD
more info at https://kubernetes.io/docs/reference/using-api/api-concepts/\#semantics-for-watch
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4855

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-08-30 16:03:41 +02:00
hagen1778
5d848363f0
lib/promscrape: follow-up after eabcfc9bcd
`-promscrape.cluster.membersCount` by default should be `1`, like every
single vmagent is a cluster of one member on its own.
The change additionally validates that user can't set `-promscrape.cluster.membersCount`
to value lower than `1`.

eabcfc9bcd
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-08-29 10:04:57 +02:00
Haleygo
eabcfc9bcd
fix clusterMembersCount check (#4900) 2023-08-29 15:58:24 +08:00
Zakhar Bessarab
6e8611f301
lib/promscrape/client: sync timeout for HostClient and http.Client (#4889)
Initially, stream parse mode was reading data from response and parsing it on flight. This was causing longer delay to read the whole response and required increasing timeout value to allow data processing while reading. So that 908e35affd increased timeout value to fix this.

But after 74c00a8762 response in stream parse mode is saved into memory and then parsed eliminating necessity of having timeout value higher that for usual scrape.

Updates: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4847
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-08-25 15:47:11 +02:00
Aliaksandr Valialkin
f1c2508243
lib/promscrape: add -promscrape.cluster.memberLabel command-line flag
This flag allows specifying an additional label to add to all the scraped metrics.
The flag must contain label name to add. The label value will be equal to -promscrape.cluster.memberNum.

This functionality can help when there is a need to differentiate metrics scraped
by distinct vmagent instances in the cluster according to https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247#issuecomment-1692279393
2023-08-24 22:03:54 +02:00
hagen1778
4ebe8bb1d5
app/vmagent: follow-up after 6788704152
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4884
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-08-24 11:36:42 +02:00
Zakhar Bessarab
6788704152
lib/promscrape/client: make User-Agent consistent between fasthttp and native client (#4886)
User agent was not set for native client which resulted in using one provided by Golang.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4884

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-08-24 11:31:13 +02:00
Aliaksandr Valialkin
d18ff993e6
lib/promscrape: add a comment why honor_timestamps is set to false by default
This should prevent from returning it back to true in the future

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4697
2023-07-28 21:36:32 -07:00
Aliaksandr Valialkin
e3ef3df938
lib/promscrape: use local scrape timestamp for scraped metrics unless honor_timestamps: true is set explicitly
This fixes the case with gaps for metrics collected from cadvisor,
which exports invalid timestamps, which break staleness detection at VictoriaMetrics side.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4697 ,
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4697#issuecomment-1654614799
and https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4697#issuecomment-1656540535

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1773
2023-07-28 21:11:26 -07:00
Aliaksandr Valialkin
3d73640815
lib/promscrape/discovery: close unused HTTP connections to service discovery servers
This should prevent from connection leaks

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4724
2023-07-27 14:48:56 -07:00
Aliaksandr Valialkin
49bd2905fa
lib/promscrape: follow-up after 6aa50ca954
- Improve docs
- Hide `debug relabeling` column when -promscrape.dropOriginalLabels command-line flag is set
- Inline the code from the added template functions, since the code is harder to follow
  with the template functions, especially when these functions have misleading names.
  Also, these functions are used only in one place, e.g. they do not reduce the amounts of code.
- Hide `click to show original labels` title at `labels` column when original labels aren't available.
- Show the reason on whey original labels aren't available at /service-discovery page.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4597
2023-07-20 19:14:33 -07:00
Dmytro Kozlov
6aa50ca954
app/vmagent: fix creating target id if --promscrape.dropOriginalLabels flag was used (#4616)
* app/vmagent: fix creating target id if `--promscrape.dropOriginalLabels` flag was used

* app/vmagent: hide links if OriginalLabels was dropped

* app/vmagent: update CHANGELOG.md and added information to the docs

* app/vmagent: fix comments
2023-07-20 10:13:39 +02:00
Aliaksandr Valialkin
140e7b6b74
all: replace atomic.Value with atomic.Pointer[T]
This eliminates the need in .(*T) casting for results obtained from Load()

Leave atomic.Value for map, since atomic.Pointer[map[...]...] makes double pointer to map,
because map is already a pointer type.
2023-07-19 17:42:06 -07:00
Aliaksandr Valialkin
8a07621a0c
lib/promscrape: disable support for service discovery and metrics scrape via http2
Reasons for disabling http2:

- http2 is used very rarely comparing to http for Prometheus metrics exposition and service discovery
- http2 is much harder to debug than http
- http2 has very bad security record because of its complexity - see https://portswigger.net/research/http2

VictoriaMetrics components are compiled with nethttpomithttp2 tag because of these issues.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4283
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4274

This is a follow-up for 72c3cd47eb
2023-07-06 16:03:37 -07:00
Aliaksandr Valialkin
792860db10
lib/promscrape/discoveryutils: re-use checkRedirect function for both client and blockingClient
Also document follow_redirects option at https://docs.victoriametrics.com/sd_configs.html#http-api-client-options

This is a follow-up for b3d0ff463a

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4282
2023-07-06 10:51:33 -07:00
Alexander Marshalov
40d12be607
fixed service name detection for consulagent service discovery in case of a difference in service name and service id (#4390) (#4439)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
2023-06-12 16:16:43 +02:00
Roman Khavronenko
dfe53a36fc
lib/promscrape/discoveryutils: properly check for net.ErrClosed (#4426)
This error may be wrapped in another error, and should normally be tested using
`errors.Is(err, net.ErrClosed)`.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-06-09 09:26:33 +02:00
Roman Khavronenko
3305a6901c
app/vmagent: mention enable_http2 in changelog (#4403)
Follow-up after
72c3cd47eb

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-06-05 16:31:58 +02:00
Haleygo
72c3cd47eb
vmagent:scrape config support enable_http2 (#4295)
app/vmagent: support `enable_http2` in scrape config 

This change adds HTTP2 support for scrape config
and improves compatibility with Prometheus config. 

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4283
2023-06-05 15:56:49 +02:00
Haleygo
b3d0ff463a
vmagent:support follow_redirects on SD level (#4286)
* vmagent:support follow_redirects on SD level

* fix follow_redirects on sd level

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4282
2023-05-26 09:39:45 +02:00
Alexander Marshalov
2e494e2375
fixed typos in documentation and commandline flags descriptions (#4275) 2023-05-10 09:50:41 +02:00
Aliaksandr Valialkin
b9bb64ce55
lib/promscrape/discovery/consulagent: substitute metaPrefix with the __meta_consulagent_ plaintext string
This simplifies future code navigation and search for the specific meta-label starting from __meta_consulagent_* prefix.
For example, `grep __meta_consulagent_namespace` finds the exact place where this label is defined.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3953
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4217
2023-05-08 23:40:13 -07:00
Aliaksandr Valialkin
74155afb71
docs: clarify docs after 5ee344824f
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4183
2023-05-08 16:11:44 -07:00
Aliaksandr Valialkin
ec3943d14a
app/vmselect: small cleanup after 4f3f9950d0
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3807
2023-05-08 14:57:11 -07:00
Alexander Marshalov
8225a48b56
fixed vm_promscrape_config_last_reload_successful metric value recovery after successful reloading with unchanged content (#4260) (#4268)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
2023-05-08 13:32:51 +02:00
Zakhar Bessarab
4e71003620
lib/promscrape/discovery/kubernetes: follow-up for d5e94721db (#4255)
- add changelog reference to an author
- fix tests
- add metadata to match Prometheus behavior

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-05-05 14:41:17 +02:00