Commit graph

2103 commits

Author SHA1 Message Date
Aliaksandr Valialkin
3a831da481
lib/promscrape: go fmt after d918a7267d 2024-02-12 20:03:19 +02:00
Aliaksandr Valialkin
d918a7267d
lib/promscrape: fix TestGetStaticScrapeWorkSuccess test after ca67926952 2024-02-12 20:00:29 +02:00
Aliaksandr Valialkin
c07c9ad028
lib/mergeset: do not panic on too long items passed to Table.AddItems()
Instead, log a sample of these long items once per 5 seconds into error log,
so users could notice and fix the issue with too long labels or too many labels.

Previously this panic could occur in production when ingesting samples with too long labels.
2024-02-12 19:45:13 +02:00
Aliaksandr Valialkin
246a7758ac
lib/mergeset: prevent from possible too big indexBlockSize panic
This panic could occur when samples with too long label values are ingested into VictoriaMetrics.
This could result in too long fistItem and commonPrefix values at blockHeader (up to 64kb each).
This may inflate the maximum index block size by 4 * maxIndexBlockSize.
2024-02-08 12:58:57 +02:00
Aliaksandr Valialkin
c7b3969b43
lib/storage: move fixupTimestamps() call to Block.Init()
This is a follow-up for 0bf7921721
2024-02-06 22:45:20 +02:00
Zakhar Bessarab
25f83e6c54
lib/storage/raw_row: properly initialize TS for tmp blocks (#5762)
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2024-02-06 22:45:18 +02:00
Roman Khavronenko
ca67926952
lib/promscrape: respect 0 value for series_limit param (#5663)
* lib/promscrape: respect `0` value for `series_limit` param

Respect `0` value for `series_limit` param in `scrape_config`
even if global limit was set via `-promscrape.seriesLimitPerTarget`.
Previously, `0` value will be ignored in favor of `-promscrape.seriesLimitPerTarget`.

This behavior aligns with possibility to override `series_limit` value via
relabeling with `__series_limit__` label.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* Update docs/CHANGELOG.md

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-01-23 13:17:13 +02:00
Aliaksandr Valialkin
48a851d7d1
app/vmselect: handle negative time range start in a generic manner inside NewSearchQuery()
This is a follow-up for cf03e11d89

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5553
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5630
2024-01-22 01:41:24 +02:00
Hui Wang
26980cd76d
lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice (#5557)
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice

Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.

* remove mislead comment

* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640

* wip

* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls

groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.

Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.

P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.

* typo fix

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-01-22 01:35:43 +02:00
Nikolay
c79ef7fbde
app/vmselect: abort streaming connections for vmselect (#5650)
* app/vmselect: abort streaming connections for vmselect
due to streaming nature of export APIs, curl and simmilr tools cannot
detect errors that happened after http.Header with status 200 was
written to it.

This PR tracks if body write was already started and closes connection.

It allows client to detect not expected chunk sequence and return error
to the caller.

Mostly it affects vmselect at cluster version

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5645

* wip

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5645
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5650

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-01-22 00:56:36 +02:00
Aliaksandr Valialkin
abd3641328
lib/storage: do not prefetch metric names for small number of metricIDs
This eliminates prefetchedMetricIDsLock lock contention for queries, which return less than 500 time series.

This is a follow-up for 9d886a2eb0
2024-01-17 13:52:51 +02:00
Aliaksandr Valialkin
25564e66e6
lib/httputils: handle step=undefined query arg as an empty value
This is needed for Grafana, which may send step=undefined
when working with alerting rules and instant queries.
2024-01-17 00:13:25 +02:00
Aliaksandr Valialkin
fc42900eca
lib/pushmetrics: wait until the background goroutines, which push metrics, are stopped at pushmetrics.Stop()
Previously the was a race condition when the background goroutine still could try collecting metrics
from already stopped resources after returning from pushmetrics.Stop().
Now the pushmetrics.Stop() waits until the background goroutine is stopped before returning.

This is a follow-up for https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5549
and the commit fe2d9f6646 .

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5548
2024-01-16 23:18:29 +02:00
hagen1778
07fe834f54
app/all: follow-up after 84d710beab
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5548
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-01-16 23:08:21 +02:00
zhdd99
66bafee651
lib/pushmetrics: fix a panic caused by pushing metrics during the graceful shutdown process of vmstorage nodes. (#5549)
Co-authored-by: zhangdongdong <zhangdongdong@kuaishou.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-01-16 23:06:51 +02:00
Aliaksandr Valialkin
30ae1b7901
lib/storage: follow-up for 4b8088e377
- Clarify the bugfix description at docs/CHANGELOG.md
- Simplify the code by accessing prefetchedMetricIDs struct under the lock
  instead of using lockless access to immutable struct.
  This shouldn't worsen code scalability too much on busy systems with many CPU cores,
  since the code executed under the lock is quite small and fast.
  This allows removing cloning of prefetchedMetricIDs struct every time
  new metric names are pre-fetched. This should reduce load on Go GC,
  since the cloning of uin64set.Set struct allocates many new objects.
2024-01-16 22:41:55 +02:00
Roman Khavronenko
3240fed7d7
lib/storage: properly check for storage/prefetchedMetricIDs cache expiration deadline (#5607)
Before, this cache was limited only by size.
Cache invalidation by time happens with jitter to prevent thundering herd problem.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-01-16 22:39:20 +02:00
Aliaksandr Valialkin
b3b2909925
app/vmselect/promql: follow-up for ce4f26db02
- Document the bugfix at docs/CHANGELOG.md
- Filter out NaN values before sorting as suggested at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5509#discussion_r1447369218
- Revert unrelated changes in lib/filestream and lib/fs
- Use simpler test at app/vmselect/promql/exec_test.go

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5509
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5506
2024-01-16 22:19:04 +02:00
Zongyang
d371848c2f
FIX bottomk doesn't return any data when there are no time range overlap between timeseries (#5509)
* FIX sort order in bottomk

* Add lessWithNaNsReversed for bottomk

* Add ut for TopK

* Move lt from loop

* FIX lint

* FIX lint

* FIX lint

* Mod log format

---------

Co-authored-by: xiaozongyang <xiaozngyang@kanyun.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-01-16 22:18:17 +02:00
Nikolay
08ba65a728
lib/awsapi: properly assume role with webIdentity token (#5495)
* lib/awsapi: properly assume role with webIdentity token
introduce new irsaRoleArn param for config. It's only needed for authorization with webIdentity token.
First credentials obtained with irsa role and the next sts assume call for an actual roleArn made with those credentials.
Common use case for it - cross AWS accounts authorization
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3822

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-12-20 19:12:44 +02:00
Aliaksandr Valialkin
e0902e4928
lib/protoparser/opentelemetry: allow ingesting metrics without resource labels
Some clients may ingest samples via OpenTelemetry protocol without Resource labels.
Previously VictoriaMetrics was silently dropping such samples.

The commit 317834f876 added vm_protoparser_rows_dropped_total{type="opentelemetry",reason="resource_not_set"}
counter for tracking of such dropped samples. See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5459

It is better from usability PoV to accept such samples instead of dropping them and incrementing the corresponding counter.
2023-12-17 19:24:08 +02:00
Zakhar Bessarab
11998ecde9
lib/protoparser/opentelemetry: add metric to track skipped rows without resource (#5459)
Currently, it is impossible to understand why metrics are not ingested when resource is not set by OTEL exporter. Adding metric should simplify debugging and make it improve debuggability.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-12-17 19:22:26 +02:00
Aliaksandr Valialkin
bbe78235d5
vendor: run make vendor-update 2023-12-10 14:34:35 +02:00
Aliaksandr Valialkin
53bd68440c
Makefile: update golangci-lint version from v1.54.2 to v1.55.1
See https://github.com/golangci/golangci-lint/releases/tag/v1.55.1
2023-12-08 18:28:36 +02:00
noodles2hg
591f626a33
lib/streamaggr/streamaggr.go: fix link in error message (#5439) 2023-12-08 18:14:43 +02:00
Aliaksandr Valialkin
e27d69caeb
lib/backup: consistently use path.Join() when constructing paths for s3, gs and azblob
E.g. replace `fs.Dir + filePath` with `path.Join(fs.Dir, filePath)`

The fs.Dir is guaranteed to end with slash - see Init() functions.
The filePath may start with slash. If it starts with slash, then `fs.Dir + filePath` constructs
an incorrect path with double slashes.
path.Join() properly substitutes duplicate slashes with a single slash in this case.

While at it, also substitute incorrect usage of filepath.Join() with path.Join()
for constructing paths to object storage systems, which expect forward slashes in paths.
filepath.Join() substittues forward slashes with backslashes on Windows, so this may break
creating or managing backups from Windows.

This is a follow-up for 0399367be602b577baf6a872ca81bf0f99ba401b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/719
2023-12-04 10:44:52 +02:00
Nikolay
e2f566878f
lib/streamaggr: properly reference slice with labels (#5406)
* lib/streamaggr: properly reference slice with labels
by limiting slice capacity. It must fix issues with slice modification, in case of append new slice will be allocated, instead of modifying refrenced slice
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5402

* Reduce memory allocations when output_relabel_configs adds new labels to output samples

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-29 12:30:00 +02:00
Hui Wang
e2f74f3ba0
lib/protoparser/promremotewrite: fall back to zstd decoding if Snappy-decoding fails (#5344)
This case is possible after the following steps:
1. vmagent successfully performed handshake with the -remoteWrite.url and the remote storage supports zstd-compressed data.
2. remote storage became unavailable or slow to ingest data, vmagent compressed the collected data into blocks with zstd and puts these blocks to persistent queue on disk.
3. vmagent restarts and the remote storage is unavailable during the handshake, then vmagent falls back to Snappy compression.
4. vmagent starts sending zstd-compressed data from persistent queue to the remote storage, while falsely advertizing it sends Snappy-compressed data.
5. The remote storage receives zstd-compressed data and fails unpacking it with Snappy.

The solution is the same as 12cd32fd75, just fall back to zstd decompression if Snappy decompression fails.
2023-11-23 20:46:56 +02:00
Aliaksandr Valialkin
bdaaea6820
lib/snapshot: make golagci-lint happy 2023-11-15 16:06:00 +01:00
Aliaksandr Valialkin
87eeac43cc
lib/querytracer: add missing blank comment line after 3121d76bee 2023-11-15 16:03:46 +01:00
Nikolay
a9ad7f45d4
lib/querytracer: makes package concurrent safe to use (#5322)
* lib/querytracer: makes package concurrent safe to use
it must fix various issues with concurrent code usage.
Especially, when it's not reasonable to wait for all goroutines to be finished

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-14 21:00:01 +01:00
Aliaksandr Valialkin
86fd89e524
lib/protoparser/promremotewrite: fall back to Snappy decoding if zstd decoding fails
This case is possible after the following steps:

1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
   the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
   for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
   while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.

The solution is to just fall back to Snappy decompression if zstd decompression fails.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
2023-11-13 21:26:55 +01:00
Aliaksandr Valialkin
25cf57910d
lib/regexutil: properly handle alternate regexps surrounded by .+ or .*
Previously the following regexps were improperly handled:

  .+foo|bar.+
  .*foo|bar.*

This could lead to unexpected regexp match results.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5297

Thanks to @Haleygo for the initial attempt to fix the issue at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5308
2023-11-13 18:27:15 +01:00
Aliaksandr Valialkin
5e826af497
lib/persistentqueue: properly re-create flock.lock file inside directory if persistent queue is broken.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5249

Thanks to @Sniper91 for the bugreport and initial fix at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5233
2023-10-31 18:36:49 +01:00
Aliaksandr Valialkin
13f40cac61
lib/storage: follow-up for 29cebd82fb
Use atomic.CompareAndSwapUint32() instead of atomic.LoadUint32() followed by atomic.StoreUint32().
This makes the code more clear.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159
2023-10-31 16:28:00 +01:00
Roman Khavronenko
1c83e2c28c
lib/storage: log warning about RO mode only on state change (#5191)
Before, vmstorage would log the same message each second producing excessive
amount of logs.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 16:27:31 +01:00
Aliaksandr Valialkin
67284377f3
lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 14:37:31 +02:00
hagen1778
3362774ab2
lib/storage: follow-up after 188cfe3a85
188cfe3a85

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-27 14:26:36 +02:00
Ilya Trefilov
c25fe07cd1
lib/storage: do not create tsid if metric contains stale marker(#5069) (#5174)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5069
2023-10-27 14:24:42 +02:00
Aliaksandr Valialkin
8be5ebe809
lib/promauth: properly parse string contents for ca, cert and key fields at tls_config
Previously yaml parser wasn't accepting string values for these fields,
because it was mistakenly expecting a list of uint8 values instead.
2023-10-27 14:22:55 +02:00
Hui Wang
27a2e119cf
do not print redundant error logs when failed to scrape consul or no… (#5239)
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
2023-10-27 14:20:11 +02:00
Alexander Marshalov
a1063a5ef3
lib/streamaggr: respect streamAgg.dropInput with empty stream aggr config (#5213)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5207
2023-10-26 09:31:43 +02:00
Zakhar Bessarab
be88c2c85b
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs (#5048)
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs

It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).

Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-22 13:14:59 +02:00
Zakhar Bessarab
ecd523662f
lib/backup: fix issue with inconsistent copying of appliedRetention.txt (#5027)
* lib/backup: fix issue with inconsistent copying of appliedRetention.txt

appliedRetention.txt can be modified in place, so it should be always copied just the same as parts.json

Updates: https://github.com/victoriaMetrics/victoriaMetrics/issues/5005
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* docs: add changelog entry for appliedRetention.txt copying fix

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-21 11:27:38 +02:00
Aliaksandr Valialkin
1dd4c72e27
lib/promscrape/discovery/kubernetes: follow-up after 03fece44e0
- Properly update vm_promscrape_discovery_kubernetes_url_watchers
  and vm_promscrape_discovery_kubernetes_group_watchers metrics after config changes

- Properly stop goroutine responsible for recreating scrapeWorks after the corresponding urlWatcher is stopped

- Log the event when urlWatcher is stopped in order to simplify debugging

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-19 00:44:38 +02:00
Aliaksandr Valialkin
c9f1a573e6
lib/promscrape/discovery/kubernetes: wait for 10 seconds before checking whether the urlWatcher must be stopped
This should prevent from excess urlWatcher churn on config reload, since it leads to removal of all the apiWatchers
before creating new apiWatchers. So, every config reload would lead to stopping of all the previous urlWatchers
and starting new urlWatchers.

The new logic gives 10 seconds for config reload before stopping unused urlWatchers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-19 00:43:43 +02:00
Aliaksandr Valialkin
88e385e273
lib/promscrape/discovery/kubernetes: follow-up after eeb862f3ff
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
  since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-09-19 00:42:24 +02:00
Aliaksandr Valialkin
63384291eb
lib/backup: properly copy parts.json files inside indexdb directory additional to data directory
This is a follow-up for 264ffe3fa1

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5006
2023-09-19 00:39:47 +02:00
Aliaksandr Valialkin
f36d20de08
lib/backup/common: consistently use canonical path with / directory separators at Part.Path
Previously Part.Path could contain `\` directory separators on Windows OS,
which could result in incorrect filepaths generation when making backups at object storage.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4704

This is a follow-up for f2df8ad480
2023-09-19 00:37:21 +02:00
Zakhar Bessarab
9b4b77975b
lib/promscrape/discovery/kubernetes: fix leaking api watcher (#4861)
* lib/promscrape/discovery/kubernetes: fix leaking api watcher

goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server

This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: clarify comment

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* Apply suggestions from code review

* lib/promscrape/discovery/kubernetes: address review feedback

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-18 17:14:08 +02:00