Commit graph

71 commits

Author SHA1 Message Date
Aliaksandr Valialkin
d52fd73f18
all: add up to 10% random jitter to the interval between periodic tasks performed by various components
This should smooth CPU and RAM usage spikes related to these periodic tasks,
by reducing the probability that multiple concurrent periodic tasks are performed at the same time.
2024-01-22 18:39:16 +02:00
Hui Wang
49fa92c1d0
lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice (#5557)
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice

Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.

* remove mislead comment

* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640

* wip

* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls

groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.

Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.

P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.

* typo fix

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-01-22 01:33:17 +02:00
Aliaksandr Valialkin
2c7c812a9d
lib/promscrape/discovery/kubernetes: add -promscrape.kubernetes.attachNodeMetadataAll command-line flag
This flag allows setting attach_metadata.node=true for all the kubernetes_sd_configs defined at -promscrape.config

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640

Thanks to wasim-nihal for the initial implementation at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5593
2024-01-22 01:08:52 +02:00
Aliaksandr Valialkin
15dda54e79
lib/promscrape/discovery/kubernetes: propagate possible errors at newAPIWatcher() to the caller
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.

This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
2023-10-27 20:27:58 +02:00
Hui Wang
a37125d043
lib/promscrape/discovery/kubernetes: avoid possible panic if given caFile under kubernetes.SDConfig.HTTPClientConfig is not exist (#5243)
follow up d5a599badc
2023-10-27 20:27:58 +02:00
Aliaksandr Valialkin
20aeb8b65d
lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 14:34:25 +02:00
Aliaksandr Valialkin
f03e81c693
lib/promauth: follow-up for e16d3f5639
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
  don't prevent from processing the corresponding scrape targets after the file becomes correct,
  without the need to restart vmagent.
  Previously scrape targets with invalid TLS CA file or TLS client certificate files
  were permanently dropped after the first attempt to initialize them, and they didn't
  appear until the next vmagent reload or the next change in other places of the loaded scrape configs.

- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
  Previously the old TLS CA was used until vmagent restart.

- Properly handle errors during http request creation for the second attempt to send data to remote system
  at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
  since the returned request is nil on error.

- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
  Previously it could miss details on the source of the request.

- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
  of every http request issued by vmagent during service discovery or target scraping.
  Re-use the HTTP client instead until the corresponding scrape config changes.

- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
  e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
  when auth header cannot be obtained because of temporary error.

- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
  Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
  when scraping big number of targets with the same tls_config.

- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.

- Improve test coverage at lib/promauth

- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
  Previously vmagent was exitting in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
2023-10-26 09:55:47 +02:00
Hui Wang
d7dd7614eb
fix inconsistent behaviors with prometheus when scraping (#5153)
* fix inconsistent behaviors with prometheus when scraping

1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
    1. vmagent remoteWrite;
    2. vmalert datasource/remoteRead/remoteWrite/notifier.

* add changelogs

* address review comments

* fix ut
2023-10-26 08:56:54 +02:00
Zakhar Bessarab
0be8960875
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs (#5048)
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs

It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).

Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
(cherry picked from commit 8d99c12a7d)
2023-09-22 13:02:57 +02:00
Aliaksandr Valialkin
de2b3ff9b0
lib/promscrape/discovery/kubernetes: follow-up after 03fece44e0
- Properly update vm_promscrape_discovery_kubernetes_url_watchers
  and vm_promscrape_discovery_kubernetes_group_watchers metrics after config changes

- Properly stop goroutine responsible for recreating scrapeWorks after the corresponding urlWatcher is stopped

- Log the event when urlWatcher is stopped in order to simplify debugging

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-19 00:44:21 +02:00
Aliaksandr Valialkin
705b31c351
lib/promscrape/discovery/kubernetes: wait for 10 seconds before checking whether the urlWatcher must be stopped
This should prevent from excess urlWatcher churn on config reload, since it leads to removal of all the apiWatchers
before creating new apiWatchers. So, every config reload would lead to stopping of all the previous urlWatchers
and starting new urlWatchers.

The new logic gives 10 seconds for config reload before stopping unused urlWatchers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-19 00:43:26 +02:00
Aliaksandr Valialkin
fe24523e19
lib/promscrape/discovery/kubernetes: follow-up after eeb862f3ff
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
  since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-09-19 00:41:29 +02:00
Zakhar Bessarab
55d25fb844
lib/promscrape/discovery/kubernetes: fix leaking api watcher (#4861)
* lib/promscrape/discovery/kubernetes: fix leaking api watcher

goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server

This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: clarify comment

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* Apply suggestions from code review

* lib/promscrape/discovery/kubernetes: address review feedback

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-18 17:13:55 +02:00
Nikolay
ae85b20c5b
lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch … (#4901)
* lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch requests
it must reduce load for kubernetes ETCD servers. Since requests without resourceVersion performs force cache sync at kubernetes API server with ETCD
more info at https://kubernetes.io/docs/reference/using-api/api-concepts/\#semantics-for-watch
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4855

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-08-30 16:04:14 +02:00
my-git9
7d86c5c94a
chore: Use http constants to replace numbers (#3846)
Signed-off-by: xin.li <xin.li@daocloud.io>
2023-02-22 18:59:32 -08:00
Oleksandr Redko
0e1c395609
app,lib: fix typos in comments (#3804) 2023-02-13 09:32:35 -08:00
Aliaksandr Valialkin
be6da5053f
lib/promscrape: optimize service discovery speed
- Return meta-labels for the discovered targets via promutils.Labels
  instead of map[string]string. This improves the speed of generating
  meta-labels for discovered targets by up to 5x.

- Remove memory allocations in hot paths during ScrapeWork generation.
  The ScrapeWork contains scrape settings for a single discovered target.
  This improves the service discovery speed by up to 2x.
2022-11-29 21:26:23 -08:00
Nikolay
ea0596d9d8
lib/promscrape/discovery/kubernetes: correctly wrap error (#3250)
* lib/promscrape/discovery/kubernetes: correctly wrap error
follow-up after 1304824201

* Update docs/CHANGELOG.md

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2022-10-18 20:40:37 +03:00
Aliaksandr Valialkin
592612b63f
lib/promscrape/discovery/kubernetes: add more context on WatchEvent parse error
This should improve debugging issues with Kubernetes API server
2022-09-13 19:37:40 +03:00
Aliaksandr Valialkin
1905618d10
all: subsitute ioutil.ReadAll with io.ReadAll
ioutil.ReadAll is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is OK to switch from ioutil.ReadAll to io.ReadAll.

This is a follow-up for 02ca2342ab
2022-08-22 00:16:04 +03:00
Aliaksandr Valialkin
06f6de6d47
all: use os.{Read|Write}File instead of ioutil.{Read|Write}File
The ioutil.{Read|Write}File is deprecated since Go1.16 -
see https://tip.golang.org/doc/go1.16#ioutil

VictoriaMetrics needs at least Go1.18, so it is safe to remove ioutil usage
from source code.

This is a follow-up for 02ca2342ab
2022-08-21 23:55:20 +03:00
Aliaksandr Valialkin
c4cc45d7f8
lib/promscrape/discovery/kubernetes: allow attaching node-level labels to role: endpoints and role: endpointlice targets in the same way as Prometheus does
See https://github.com/prometheus/prometheus/pull/10759
2022-07-07 00:36:24 +03:00
Aliaksandr Valialkin
3ae6300497
lib/promauth: add ability to send additional http headers in requests to scrape targets
This solves https://stackoverflow.com/questions/66032498/prometheus-scrape-metric-with-custom-header
2022-06-22 20:40:50 +03:00
Roman Khavronenko
7406665fc3
lib/promscrape/discovery/kubernetes: fixes kubernetes service discovery (#2615)
* lib/promscrape/discovery/kubernetes: properly updates discovered scrape works
previously, added or updated scrapeworks may override previuosly
discovered.
it happens because swosByKey may contain small subset of kubernetes
objects with it's labels.
It happens for objectsUpdated and objectsAdded maps, which include only changed elements

* Properly calculate vm_promscrape_discovery_kubernetes_scrape_works

Co-authored-by: f41gh7 <nik@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2022-05-21 01:17:21 +03:00
Aliaksandr Valialkin
810dd74fb9
lib/promscrape: properly implement ScrapeConfig.clone()
Previously ScrapeConfig.clone() was improperly copying promauth.Secret fields -
their contents was replaced with `<secret>` value.

This led to inability to use passwords and secrets in `-promscrape.config` file.
The bug has been introduced in v1.77.0 in the commit 67b10896d2

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2551
2022-05-07 00:06:19 +03:00
Aliaksandr Valialkin
c2b13e6a04
lib/promscrape/discovery/kubernetes: limit the minimum sleep time between updating dependent ScrapeWork objects
Previously the sleep time could be dropped to nanoseconds, which could result in CPU time waste
2022-04-22 23:15:34 +03:00
Aliaksandr Valialkin
a89e31b304
lib/promscrape/discovery/kubernetes: allow attaching node-level labels and annotations to discovered pod targets in the same way as Prometheus 2.35 does
See https://github.com/prometheus/prometheus/issues/9510
and https://github.com/prometheus/prometheus/pull/10080
2022-04-22 20:15:34 +03:00
Aliaksandr Valialkin
cc6eae6992
lib/promscrape/discovery/kubernetes: improve the performance of urlWatcher.reloadObjects() on multi-CPU systems
Parallelize the generation of ScrapeWork objects there. Previously they were generated in a single goroutine.
2022-04-22 13:23:39 +03:00
Aliaksandr Valialkin
fea9d1e6ee
lib/promscrape/discovery/kubernetes: properly update endpoints and endpointslice objects when the related pod or service objects are updated
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240

This is a follow-up for 2341bd48d7
2022-04-21 13:06:49 +03:00
Aliaksandr Valialkin
e9f08b1e6a
lib/promscrape/discovery/kubernetes: do not pre-allocate memory for ScrapeWork objects
There is high chance that ScrapeWork objects won't be generated because of relabeling
2022-04-20 16:42:41 +03:00
Aliaksandr Valialkin
909a3ee0e4
lib/promscrape: follow-up after 91e290a8ff 2022-04-20 16:12:26 +03:00
Nikolay
429848a67d
lib/promscrape: reduce latency for k8s GetLabels (#2454)
replaces internStringMap with sync.Map - it greatly reduces lock contention
concurently reload scrape work for api watcher - each object labels added by dedicated CPU

changes can be tested with following script https://gist.github.com/f41gh7/6f8f8d8719786aff1f18a85c23aebf70
2022-04-20 16:12:25 +03:00
Aliaksandr Valialkin
d0bac8e224
all: typo fix: Kuberntes -> Kubernetes 2022-04-20 10:51:41 +03:00
Aliaksandr Valialkin
bc18368c15
lib/promscrape/discovery/kubernetes: add the ability to limit service discovery to the current namespace
See https://github.com/prometheus/prometheus/issues/9782 and https://github.com/prometheus/prometheus/pull/9881
2022-01-13 22:44:59 +02:00
Aliaksandr Valialkin
146c14d879 lib/promscrape/discovery/kubernetes: return back support role: endpointslices, since it is used by VictoriaMetrics operator
This is a follow up commit after 31b42b30b6
2021-08-29 12:37:36 +03:00
Aliaksandr Valialkin
ca61d7c82b lib/promscrape/discovery/kubernetes: rename role: endpointslices to role: endpointslice to be consistent with Prometheus
See 2ec6c7dbb8/discovery/kubernetes/kubernetes.go (L99)
2021-08-29 11:23:59 +03:00
Aliaksandr Valialkin
327034b54f lib/promscrape/discovery/kubernetes: use v1 API instead of v1beta1 API for role: ingress and role: endpointslices
This should fix service discovery for these roles in Kubernetes v1.22 and newer versions.
See https://kubernetes.io/docs/reference/using-api/deprecation-guide/#ingress-v122

The corresponding change in Prometheus - https://github.com/prometheus/prometheus/pull/9205
2021-08-29 11:23:58 +03:00
Aliaksandr Valialkin
110a888e39 lib/promscrape/discovery/kubernetes: make golangci-lint happy by removing empty branches 2021-05-20 12:00:17 +03:00
Aliaksandr Valialkin
9d97f44772 lib/promscrape/discovery/kubernetes: reload objects on object parse error
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
2021-05-18 23:27:24 +03:00
Aliaksandr Valialkin
c507faec0b lib/promscrape/discovery/kubernetes: simplify the reload logic for urlWatcher.objectsByKey 2021-05-18 15:41:51 +03:00
Aliaksandr Valialkin
0f54c0121b lib/promscrape/discovery/kubernetes: properly update vm_promscrape_discovery_kubernetes_scrape_works metric
Previously it wasn't descreased during config update.
2021-05-18 15:41:51 +03:00
Aliaksandr Valialkin
9f62d348db lib/promscrape/discovery/kubernetes: log errors and stop service discovery when unexpected updates are received from Kubernetes API server
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
2021-05-18 15:41:51 +03:00
Aliaksandr Valialkin
8764b0ae21 lib/promscrape/discovery/kubernetes: key ScrapeWork objects by urlWatcher instead of namespace
This makes the code less fragile if urlWatcher would depend on additional to namepsace properties.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1170
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
2021-05-17 23:49:48 +03:00
Aliaksandr Valialkin
e08287f017 lib/promscrape: reload auth tokens from files every second
Previously auth tokens were loaded at startup and couldn't be updated without vmagent restart.
Now there is no need in vmagent restart.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1297
2021-05-14 20:03:35 +03:00
Aliaksandr Valialkin
a6cb4f10a7 app/{vmalert,vmauth}: explicitly set MaxIdleConnsPerHost in net/http.Client.Transport
By default MaxIdleConnsPerHost is set to 2. This limits the possibility to re-use http keep-alive connections.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1300
2021-05-14 18:13:34 +03:00
Aliaksandr Valialkin
027607db3e lib/promscrape/discovery/kubernetes: refresh endpoints and endpointslices scrape targets every 5 seconds, since they may depend on changed service and pod objects
This should make endpoints and endpointslices scrape targets eventually consistent with the maximum delay of 5 seconds after the related service or pod object changes.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
2021-05-12 14:12:43 +03:00
Aliaksandr Valialkin
e6c19cb09d lib/promscrape/discovery/kubernetes: start watchers for pods and services before starting watchers for endpoints
This should eliminate possible race when an update on endpoints depends on pods and/or services, which are missing in the cache yet.
This could result in missing targets based on endpoints or endpointslices.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
2021-05-05 12:23:16 +03:00
Aliaksandr Valialkin
421a92983a lib/promscrape/discovery/kubernetes: remove a mutex at urlWatcher - use groupWatcher mutex for accessing all the urlWatcher children
This simplifies the code a bit and reduces the probability of improper mutex handling and deadlocks.
2021-04-29 10:17:45 +03:00
Aliaksandr Valialkin
b3da457629 lib/promscrape/discovery/kubernetes: fix a deadlock introduced in eddba29664
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240

Thanks to @f41gh7 for providing the initial idea for deadlock fix at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/1248
2021-04-27 14:59:56 +03:00
Aliaksandr Valialkin
34321e5f8d lib/promscrape/discovery/kubernetes: refresh role: endpoints targets on service object removal as Prometheus does
This is a follow-up for ae37cfd528

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
2021-04-23 20:27:29 +03:00