This commit adds Kubernetes Native Sidecar support.
It's the special type of init containers, that have restartPolicy == "Always" and continue to run after container initialization.
related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7287
### Describe Your Changes
Currently, vmagent always uses a separate `http.Client` for every group
watcher in Kubernetes SD. With a high number of group watchers this
leads to large amount of opened connections.
This PR adds 2 changes to address this:
- re-use of existing `http.Client` - in case `http.Client` is connecting
to the same API server and uses the same parameters it will be re-used
between group watchers
- HTTP2 support - this allows to reuse connections more efficiently due
to ability of using streaming via existing connections.
See this issue for the details and test results -
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5971
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Previous bugfix at 49f63b2 only partially fixed pagination host validation error.
Before this fix it was:
```
unexpected nextLink host \"management.azure.com\", expecting \"https://management.azure.com\"
```
Now we only check the `Host` without schema.
However, when Azure respond `nextLink` in `Host:Port` format, the
`nextLink` check will fail:
```
unexpected nextLink host \"management.azure.com:443\", expecting \"management.azure.com\"
```
This pull request further relaxes the checks by only checking the
`Hostname`.
---
related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6912
- Obtain IAM token via GCE-like API instead of Amazon EC2 IMDSv2 API,
since it looks like IMDBSv2 API isn't supported by Yandex Cloud
according to https://yandex.cloud/en/docs/security/standard/authentication#aws-token :
> So far, Yandex Cloud does not support version 2, so it is strongly recommended
> to technically disable getting a service account token via the Amazon EC2 metadata service.
- Try obtaining IAM token via GCE-like API at first and then fall back to the deprecated Amazon EC2 IMDBSv1.
This should prevent from auth errors for instances with disabled GCE-like auth API.
This addresses @ITD27M01 concern at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5513#issuecomment-1867794884
- Make more clear the description of the change at docs/CHANGELOG.md , add reference to the related issue.
P.S. This change wasn't tested in prod because I have no access to Yandex Cloud.
It is recommended to test this change by @ITD27M01 and @vmazgo , who filed
the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5513
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6524
'any' type is supported starting from Go1.18. Let's consistently use it
instead of 'interface{}' type across the code base, since `any` is easier to read than 'interface{}'.
This makes easier to read and debug these tests. This also reduces test lines count by 15% from 3K to 2.5K
See https://itnext.io/f-tests-as-a-replacement-for-table-driven-tests-in-go-8814a8b19e9e
While at it, consistently use t.Fatal* instead of t.Error*, since t.Error* usually leads
to more complicated and fragile tests, while it doesn't bring any practical benefits over t.Fatal*.
### Describe Your Changes
related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6041
#### Added
- Added service discovery support for Vultr.
#### Docs
- `CHANGELOG.md`, `sd_configs.md`, `vmagent.md` are updated.
#### Note
- Useful links:
- Vultr API:
https://www.vultr.com/api/#tag/instances/operation/list-instances
- Vultr client SDK: https://github.com/vultr/govultr
- Prometheus SD:
https://github.com/prometheus/prometheus/tree/main/discovery/vultr
---
### Checklist
The following checks are mandatory:
- [X] I have read the [Contributing
Guidelines](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/CONTRIBUTING.md)
- [x] All commits are signed and include `Signed-off-by` line. Use `git
commit -s` to include `Signed-off-by` your commits. See this
[doc](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) about
how to sign your commits.
- [x] Tests are passing locally. Use `make test` to run all tests
locally.
- [x] Linting is passing locally. Use `make check-all` to run all
linters locally.
Further checks are optional for External Contributions:
- [X] Include a link to the GitHub issue in the commit message, if issue
exists.
- [x] Mention the change in the
[Changelog](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/CHANGELOG.md).
Explain what has changed and why. If there is a related issue or
documentation change - link them as well.
Tips for writing a good changelog message::
* Write a human-readable changelog message that describes the problem
and solution.
* Include a link to the issue or pull request in your changelog message.
* Use specific language identifying the fix, such as an error message,
metric name, or flag name.
* Provide a link to the relevant documentation for any new features you
add or modify.
- [ ] After your pull request is merged, please add a message to the
issue with instructions for how to test the fix or try the feature you
added. Here is an
[example](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4048#issuecomment-1546453726)
- [x] Do not close the original issue before the change is released.
Please note, in some cases Github can automatically close the issue once
PR is merged. Re-open the issue in such case.
- [x] If the change somehow affects public interfaces (a new flag was
added or updated, or some behavior has changed) - add the corresponding
change to documentation.
Signed-off-by: Jiekun <jiekun.dev@gmail.com>
- Automatically reload changed TLS root CA pointed by -remoteWrite.tlsCAFile command-line flag
- Automatically reload changed TLS root CA configured via oauth2.tsl_config.ca_file option at -promscrape.config
- Document the change as a feature instead of a bug at docs/CHANGELOG.md
- Simplify the code at lib/promauth, which is responsible for reloading changed TLS root CA files.
- Simplify the usage of lib/promauth.Config.NewRoundTripper() - now it accepts the base http.Transport
instead of a callback, which can change the internal http.Transport.
- Reuse the default tls config if lib/promauth.Config doesn't contain tls-specific configs.
This should reduce memory usage a bit when tls isn't used for scraping big number of targets.
- Do not re-read TLS root CA files on every processed request. Re-read them once per second.
This should reduce CPU usage when scraping big number of targets over https.
- Do not store cert.pem and key.pem files in TestTLSConfigWithCertificatesFilesUpdate, since they can be loaded
from byte slices via crypto/tls.X509KeyPair().
- Remove obsolete comparisons of string representations for authConfig and proxyAuthConfig at areEqualScrapeConfigs().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5725
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2171
* lib/{promauth,promscrape}: automatically refresh root CA certificates after changes on disk
Added a custom `http.RoundTripper` implementation which checks for root CA content changes and updates `tls.Config` used by `http.RoundTripper` after detecting CA change.
Client certificate changes are not tracked by this implementation since `tls.Config` already supports passing certificate dynamically by overriding `tls.Config.GetClientCertificate`.
This change implements dynamic reload of root CA only for streaming client used for scraping. Blocking client (`fasthttp.HostClient`) does not support using custom transport so can't use this implementation.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: update NewRoundTripper API
Update API to allow user to update only parameters required for transport.
Add warning log when reloading Root CA failed.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: fix mutex acquire logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: replace RWMutex with regular mutex to simplify the code
- remove additional mutex used for getRootCABytes - require callee to use mutex
- replace RWMutex with regular mutex
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: refactor
- hold the mutex lock to avoid round tripper being re-created twice
- move recreation logic into separate func to simplify the code
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
This should smooth CPU and RAM usage spikes related to these periodic tasks,
by reducing the probability that multiple concurrent periodic tasks are performed at the same time.
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice
Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.
* remove mislead comment
* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640
* wip
* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls
groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.
Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.
P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.
* typo fix
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Examples:
1) -metricsAuthKey=file:///abs/path/to/file - reads flag value from the given absolute filepath
2) -metricsAuthKey=file://./relative/path/to/file - reads flag value from the given relative filepath
3) -metricsAuthKey=http://some-host/some/path?query_arg=abc - reads flag value from the given url
The flag value is automatically updated when the file contents changes.
- docs/sd_configs.md: moved hetzner_sd_configs docs to the correct place according to alphabetical order of SD names,
document missing __meta_hetzner_role label.
- lib/promscrape/config.go: added missing MustStop() call for Hetzner SD,
and moved the code to the correct place according to alphabetical order of SD names.
- lib/promscrape/discovery/hetzner: properly handle pagination for hloud API responses,
populate missing __meta_hetzner_role label like Prometheus does.
- Properly populate __meta_hetzner_public_ipv6_network label like Prometheus does.
- Remove unused SDConfig.Token.
- Remove "omitempty" annotation from SDConfig.Role field, since this field is mandatory.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3154
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.
This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().
Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216
The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
don't prevent from processing the corresponding scrape targets after the file becomes correct,
without the need to restart vmagent.
Previously scrape targets with invalid TLS CA file or TLS client certificate files
were permanently dropped after the first attempt to initialize them, and they didn't
appear until the next vmagent reload or the next change in other places of the loaded scrape configs.
- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
Previously the old TLS CA was used until vmagent restart.
- Properly handle errors during http request creation for the second attempt to send data to remote system
at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
since the returned request is nil on error.
- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
Previously it could miss details on the source of the request.
- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
of every http request issued by vmagent during service discovery or target scraping.
Re-use the HTTP client instead until the corresponding scrape config changes.
- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
when auth header cannot be obtained because of temporary error.
- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
when scraping big number of targets with the same tls_config.
- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.
- Improve test coverage at lib/promauth
- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
Previously vmagent was exitting in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
* fix inconsistent behaviors with prometheus when scraping
1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
1. vmagent remoteWrite;
2. vmalert datasource/remoteRead/remoteWrite/notifier.
* add changelogs
* address review comments
* fix ut
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs
It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).
Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* lib/promscrape/discovery/kubernetes: fix leaking api watcher
goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server
This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: clarify comment
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Apply suggestions from code review
* lib/promscrape/discovery/kubernetes: address review feedback
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
This eliminates the need in .(*T) casting for results obtained from Load()
Leave atomic.Value for map, since atomic.Pointer[map[...]...] makes double pointer to map,
because map is already a pointer type.