Commit graph

2674 commits

Author SHA1 Message Date
hagen1778
8bb3f2fd43
lib/promscrape: make linter happy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-09-24 15:12:55 +02:00
hagen1778
c7569dac50
lib/promscrape: temporary disable TestClientProxyReadOk
This test is very flaky and prevents other tests from running in CI.
Disabling this test should improve tests quality, since it isn't reliable anyway.

There is a ticket to fix this test - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7062

Once fixed, this test should be uncommented.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-09-24 14:59:25 +02:00
Dmytro Kozlov
cbeb7d50e8
lib/promscrape: show only unhealthy targets if show_only_unhealthy filter is enabled (#6960)
### Describe Your Changes

It is better to show only unhealthy targets instead of all of them when
`show_only_unhealthy` filter is enabled.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3536

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2024-09-24 12:18:24 +02:00
Aliaksandr Valialkin
109772bdc4
lib/cgroup: round GOMAXPROCS to the lower integer value of cpuQuota
Rounding GOMAXPROCS to the upper interger value of cpuQuota increases chances of CPU starvation,
non-optimimal goroutine scheduling and additional CPU overhead related to context switching.

So it is better to round GOMAXPROCS to the lower integer value of cpuQuota.
2024-09-23 16:09:12 +02:00
Artem Fetishev
55febc0920
lib/storage: restore ability to put empty metric ID list into tagFiltersToMetricIDsCache (#7064)
### Describe Your Changes

Currently it the metricID list is empty it won't be mashalled and as the
result won't be put into the tagFiltersToMetricIDsCache which causes the
cache misses for the corresponding tagFilters. In some setups this
causes severe search speed detradation (see #7009).

The empty metric IDs was covered before but then was accidentally
removed in 6c21439.

This PR restores the coverage of this case.

A new unit test can be used as a proof that empty metricID lists are not
added to the cache (just remove the fix in index_db.go and run the test
to see the result)

Also a benchmark has been added to see the implications of the
compression.

```
user@laptop:~/p/github.com/rtm0/VictoriaMetrics/01/src$ go test ./lib/storage/ -run=NONE -bench BenchmarkMarshalUnmarshalMetricIDs --loggerLevel=ERROR
goos: linux
goarch: amd64
pkg: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage
cpu: 13th Gen Intel(R) Core(TM) i7-1355U
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-0-12             3237240               363.5 ns/op               0 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-1-12             2831049               451.8 ns/op               0.4706 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-10-12            1152764              1009 ns/op                 1.667 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-100-12            297055              3998 ns/op                 5.755 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-1000-12            31172             34566 ns/op                 8.484 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-10000-12            4900            289659 ns/op                 9.416 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-100000-12            447           2341173 ns/op                 9.456 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-1000000-12            42          24926928 ns/op                 9.468 compression-rate
BenchmarkMarshalUnmarshalMetricIDs/numMetricIDs-10000000-12            5         204098872 ns/op                 9.467 compression-rate
PASS
ok      github.com/VictoriaMetrics/VictoriaMetrics/lib/storage  15.018s
```

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-09-20 17:21:53 +02:00
Aliaksandr Valialkin
787b9cd9a0
lib/storage: improve performance for indexSearch.containsTimeRange()
The indexSearch.containsTimeRange() function is called for the current indexDB and the previous indexDB
every time when searching for metricIDs by label filters. This function consumes a lot of additional CPU time
for cases when queries with lightweight label filters are sent to VictoriaMetrics at high rate (e.g. thousands of RPS),
like in the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7009 .

Optimize indexSearch.containsTimeRange() function in the following ways:

- Unconditionally return true if this function is called for the current indexDB, since there are very high
  chances that the current indexDB contains the data with timestamps in the requested time range.

- Cache the minimum timestamp, which is missing in the indexed data for the previous indexDB.
  This is safe to do, since the previous indexDB is readonly.
  This optimization eliminates potentially slow lookup in the previous indexDB for typical
  use cases when the requested time range is close to the current time.
2024-09-20 13:07:20 +02:00
Aliaksandr Valialkin
6f61e9d49d
lib/storage: simplify indexDB.doExtDB() usage by removing the returned value
Previously indexDB.doExtDB() was returning boolean value, which was indicating whether f callback was called.
There is no need in returning this boolean value, since the f callback can determine on itself whether it was called.

This simplifies the code a bit.

While at it, document indexDB.doExtDB().
2024-09-20 11:59:57 +02:00
Roman Khavronenko
218c533874
lib/storage: follow-up after d8f8822fa5 (#7036)
Make function name and comments more clear.

d8f8822fa5

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2024-09-20 11:50:47 +02:00
Aliaksandr Valialkin
a3d8077959
lib/logstorage: make sure that getCommonTokens returns common tokens in the original order of tokens inside tokenSets arg
This fixes flaky test TestGetCommonTokensForOrFilters:

    filter_or_test.go:143: unexpected tokens for field "_msg"; got ["foo" "bar"]; want ["bar" "foo"]
2024-09-19 15:59:48 +02:00
Roman Khavronenko
e115b85770
lib/logger: increase default value of -loggerMaxArgLen cmd-line fla… (#7008)
…g from 1e3 to 5e3

This should improve visibility on errors produced by very long queries.

The change is classified as BUG in order to port it to LTS releases.

### Describe Your Changes

Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Mathias Palmersheim <mathias@victoriametrics.com>
2024-09-19 14:29:18 +02:00
Nikolay
d8f8822fa5
lib/storage: consistently check for missing metricID index records (#6967)
* Previously, only metricID->metricName missing index records were
tracked with deadline But it was possible a case for missing
metricID->TSID index records. IndexDB metrics fix exposed misleading
metric for such missing records.

* This commit adds check for metricID->TSID missing index records. And
delete missing metricID entry if it hit 60 second deadline.

Related issue
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6931

Signed-off-by: f41gh7 <nik@victoriametrics.com>
2024-09-16 10:05:08 +02:00
Nikolay
264c2ec6bd
lib/fs: properly call windows APIs (#6998)
Previously we manually imported system windows DDLs
and made direct syscall.

 But golang exposes syscall wrappers with sys/windows package.
It seems, that direct syscall was broken at 1.23 golang release. It was
`GetDiskFreeSpace` syscall in our case.

This commit replaces all manual syscalls with wrappers

Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6973

Related golang issue:
https://github.com/golang/go/issues/69029

Signed-off-by: f41gh7 <nik@victoriametrics.com>
2024-09-13 12:22:25 +02:00
Aliaksandr Valialkin
657988ac3a
app/vlselect: consistently reuse the original query timestamp when executing /select/logsql/query with positive limit=N query arg
Previously the query could return incorrect results, since the query timestamp was updated with every Query.Clone() call
during iterative search for the time range with up to limit=N rows.

While at it, optimize queries, which find low number of matching logs, while spend a lot of CPU time for searching
across big number of logs. The optimization reduces the upper bound of the time range to search if the current time range
contains zero matching rows.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6785
2024-09-08 14:32:23 +02:00
Aliaksandr Valialkin
45a3713bdb
lib/logstorage: preserve the order of tokens to check against bloom filters in AND filters
Previously tokens from AND filters were extracted in random order. This could slow down
checking them agains bloom filters if the most specific tokens go at the beginning of the AND filters.
Preserve the original order of tokens when matching them against bloom filters,
so the user could control the performance of the query by putting the most specific AND filters
at the beginning of the query.

While at it, add tests for getCommonTokensForAndFilters() and getCommonTokensForOrFilters().

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6554
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6556
2024-09-08 12:27:30 +02:00
Aliaksandr Valialkin
eaee2d7db4
lib/logstorage: improve error logging for incorrect queries passed to /select/logsql/stats_query and /select/logsql/stats_query_range functions 2024-09-08 11:24:44 +02:00
Aliaksandr Valialkin
1cd06ace5a
lib/logstorage: properly extract common tokens from unsupported OR filters
Previously the following query could miss rows matching !bar if these rows do not contain foo:

   foo OR !bar

This is because of incorrect detection of common tokens for OR filters - all the unsupported filters
were skipped (including the NOT filter (aka `!`)), while in this case zero common tokens must be returned.

While at it, move repetiteve code in TestFilterAnd and TestFilterOr into f function.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6554
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6556
2024-09-08 11:14:55 +02:00
Aliaksandr Valialkin
0a40064a6f
app/vlselect: add /select/logsql/stats_query_range endpoint for building time series panels in VictoriaLogs plugin for Grafana
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6943
Updates https://github.com/VictoriaMetrics/victorialogs-datasource/issues/61
2024-09-07 00:41:47 +02:00
Aliaksandr Valialkin
c9bb4ddeed
app/vlselect: add /select/logsql/stats_query endpoint, which is going to be used by vmalert
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6942
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6706
2024-09-06 23:06:43 +02:00
Aliaksandr Valialkin
00e7d5add3
lib/logstorage: substitute | operator with or operator at math pipe
This is needed for avoiding confusion between the `|` operator at `math` pipe and `|` pipe delimiter.
For example, the following query was parsed unexpectedly:

   * | math foo / bar | fields x

as

   * | math foo / (bar | fields) as x

Substituting `|` with `or` inside `math` pipe fixes this ambiguity.
2024-09-06 22:44:14 +02:00
Artem Fetishev
a5424e95b3
lib/storage: adds metrics that count records that failed to insert
### Describe Your Changes

Add storage metrics that count records that failed to insert:

- `RowsReceivedTotal`: the number of records that have been received by
the storage from the clients
- `RowsAddedTotal`: the number of records that have actually been
persisted. This value must be equal to `RowsReceivedTotal` if all the
records have been valid ones. But it will be smaller otherwise. The
values of the metrics below should provide the insight of why some
records hasn't been added
-   `NaNValueRows`: the number of records whose value was `NaN`
- `StaleNaNValueRows`: the number of records whose value was `Stale NaN`
- `InvalidRawMetricNames`: the number of records whose raw metric name
has failed to unmarshal.

The following metrics existed before this PR and are listed here for
completeness:

- `TooSmallTimestampRows`: the number of records whose timestamp is
negative or is older than retention period
- `TooBigTimestampRows`: the number of records whose timestamp is too
far in the future.
- `HourlySeriesLimitRowsDropped`: the number of records that have not
been added because the hourly series limit has been exceeded.
- `DailySeriesLimitRowsDropped`: the number of records that have not
been added because the daily series limit has been exceeded.

---
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
2024-09-06 17:57:21 +02:00
Aliaksandr Valialkin
0205170409
lib/logstorage: consistently use nsecsPerDay constant and remove nsecPerDay constant 2024-09-06 16:17:04 +02:00
Aliaksandr Valialkin
258ccfb953
lib/logstorage: pre-calculate hashes from tokens used in bloom filter search
Previously per-token hashes for per-block bloom filters were re-calculated on every scanned block.
This could be slow when the number of tokens is big or when the number of blocks to scan is big.
Pre-calculate hashes for bloom filters and then use them for searching in bloom filters.
This improves performance by 2.5x for in(...) filters with many values to search inside `in()`.
2024-09-05 19:44:17 +02:00
Zhu Jiekun
c193e6d43e
lib/discovery/azure: fix host check in next link in Azure SD (#6915)
Previous bugfix at 49f63b2 only partially fixed pagination host validation error.

 Before this fix it was:
```
unexpected nextLink host \"management.azure.com\", expecting \"https://management.azure.com\"
```

Now we only check the `Host` without schema. 

However, when Azure respond `nextLink` in `Host:Port` format, the
`nextLink` check will fail:
```
unexpected nextLink host \"management.azure.com:443\", expecting \"management.azure.com\"
```

This pull request further relaxes the checks by only checking the
`Hostname`.

---

 related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6912
2024-09-05 16:48:09 +02:00
Artem Fetishev
39294b4919
lib/storage: do not drop stale NaN samples (#6936)
This patch reverts 1fd3385

After discussing it we've come to conclusion that this is a valid
behavior which can be avoided by deleting the time series only once the
corresponding stale NaNs have been received.

On the other hand, the fix leads to lost stale NaNs in some rare but
valid use cases. For example:

- In a cluster configuration the samples for a given time series are
normally sent to the same vmstorage replica. However, wminsert may
reroute the samples to another replica because the original one is down
or is overloaded. In this case the stale NaN may end up on a replica
that has no data for that time series, but we still want to record that
sample.

Thus, reverting that fix.

---

related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5069

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
2024-09-05 16:45:09 +02:00
Hui Wang
b48f5f3e59
lib/storage: fix metric vm_object_references{type="indexdb"} (#6937)
follow up
4ecc370acb

### Describe Your Changes

Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
2024-09-05 16:42:49 +02:00
Aliaksandr Valialkin
49e57ea80e
lib/logstorage: delete unused function - bloomfilter.containsAny 2024-09-05 16:21:06 +02:00
Aliaksandr Valialkin
2dd845fa53
lib/logstorage: properly fix incorrect extraction of common tokens for OR filters at distinct log fields
Previously (f1:foo OR f2:bar) was incorrectly returning `foo` token for `f1` and `bar` token for `f2`.
These tokens were used for checking against bloom filter for every data block, so the data block,
which didn't contain simultaneously `foo` token for `f1` field and `bar` token for `f2` field, was skipped.
This was incorrect, since such a block may contain logs matching the original OR filter.

The fix is to return common tokens from `OR`-delimted filters only if these tokens exist at EVERY such filter
for the given field name. If some `OR`-delimited filter misses the given field name, then `OR`-delimited filters
do not contain common tokens, which could be used for checking against bloom filter.

While at it, add more tests covering various edge cases for filters delimited by AND and OR.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6554
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6556
2024-09-05 14:29:50 +02:00
f41gh7
7b0aaf1ea2
follow-up after 01430a155c
* properly check SeverityNumber at FormatSeverity function
 it could be negative, which could cause panic for victorialogs
2024-09-04 15:36:34 +02:00
Andrii Chubatiuk
01430a155c
vlinsert: added opentelemetry logs support
Commit adds the following changes:

* Adds support of OpenTelemetry logs for Victoria Logs with protobuf encoded messages

*  json encoding is not supported for the following reasons:
   - It brings a lot of fragile code, which works inefficiently.
   - json encoding is impossible to use with language SDK.

* splits metrics and logs structures at lib/protoparser/opentelemetry/pb package.

* adds docs with examples for opentelemetry logs.

---
Related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4839

Co-authored-by: AndrewChubatiuk <andrew.chubatiuk@gmail.com>
Co-authored-by: f41gh7 <nik@victoriametrics.com>
2024-09-03 20:12:05 +02:00
rtm0
4df243d530
lib/storage: improve the message of the tooManyTimeseries error (#6893)
### Describe Your Changes

This is a follow-up for #6836. Per @valyala's
[comment](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6836#discussion_r1730291704),
the error message does not reflect which flag needs to be adjusted.

### Checklist

The following checks are **mandatory**:

- [x ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
2024-09-03 10:28:03 +02:00
jackyin
975ed27a76
lib/logstorage: and filter results in unexpected response (#6556)
fix #6554
andfilter shouldn't return orfilter field which result in bloomfilter
return false.

---------

Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-09-03 10:17:44 +02:00
rtm0
2c856c6951
tests: check Metrics.RowsAddedTotal in unit tests (#6895)
### Describe Your Changes

This is a follow-up PR: Unit tests introduced in #6872 can now use
RowsAddedTotal counter whose scope was fixed in #6841.

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2024-08-30 14:31:15 +02:00
Roman Khavronenko
f586082520
attempt to fix flaky TestClientProxyReadOk (#6899)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-08-30 13:23:32 +02:00
dufucun
95bafc8caf
tests: fix slice init length (#6897)
### Describe Your Changes

fix slice init length

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

Signed-off-by: dufucun <dufuchun@sohu.com>
2024-08-30 10:55:25 +02:00
rtm0
334cd92a6c
testing: allow disabling fsync to make tests run faster (#6871)
### Describe Your Changes

fsync() ensures that the data is written to disk. In production this is
needed for data durability. However, during the development, when the
unit tests are run, this level of durability is not needed. Therefore
fsync() can be disabled which will makes test runs two times faster.

The disabling is done by setting the `DISABLE_FSYNC_FOR_TESTING`
environment variable. The valid values for this variable are the same as
the values of the arg of `go doc strconv.ParseBool`:

```
1, t, T, TRUE, true, True, 0, f, F, FALSE, false, False.
```

Any other value means `false`.

The variable is set for all test build targets. Compare running times:

Build Target | DISABLE_FSYNC_FOR_TESTING=0 | DISABLE_FSYNC_FOR_TESTING=1
----------------- | ------------------------------------------------ |
-------------------------------------------------
make test | 1m5s  | 0m22s
make test-race | 3m1s | 1m42s
make test-pure | 1m7s | 0m20s
make test-full | 1m21s | 0m32s
make test-full-386 | 1m42s | 0m36s

When running tests for a given package, fsync can be disabled as
follows:

```shell
DISABLE_FSYNC_FOR_TESTING=1 go test ./lib/storage
```

Disabling fsync() is intended for testing purposes only and the name of
the variables reflects that.

What could also have been done but haven't:

- lib/filestream/filestream.go: `Writer.MustFlush()` also uses f.Sync()
but nothing has been done to it, because the Writer.MustFlush() is not
used anywhere in the VM codebase. A side question: what is the general
policy for the unused code?
- lib/filestream/filestream.go: Writer.Write() calls `adviceDontNeed()`
which calls unix.Fdatasync(). Disabling it could potentially improve
running time, but running tests with this code disabled has shown
otherwise.

### Checklist

The following checks are **mandatory**:

- [ x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
2024-08-30 10:54:46 +02:00
Nikolay
4ecc370acb
lib/storage: properly add previous indexDB metrics (#6890)
Previously, some extIndexDB metrics were not registered. It resulted
into missing metrics, if metric value was added to the extIndexDB. It's
a usual case for search requests at both indexes.

 Current commit updates all metrics from extIndexDB according to the
current IndexDB. It must fix such cases

Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6868

### Describe Your Changes

Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
2024-08-28 11:14:28 +02:00
rtm0
9fcfba3927
lib/storage: properly handle maxMetrics limit at metricID search
`TL;DR` This PR improves the metric IDs search in IndexDB:

- Avoid seaching for metric IDs twice when `maxMetrics` limit is
exceeded
- Use correct error type for indicating that the `maxMetrics` limit is
exceded
- Simplify the logic of deciding between per-day and global index search

A unit test has been added to ensure that this refactoring does not
break anything.

---

Function calls before the fix:

```
idb.searchMetricIDs
    |__ is.searchMetricIDs
        |__ is.searchMetricIDsInternal
            |__ is.updateMetricIDsForTagFilters
                |__ is.tryUpdatingMetricIDsForDateRange
                |                       |
                |__ is.getMetricIDsForDateAndFilters
```

- `searchMetricIDsInternal` searches metric IDs for each filter set. It
maintains a metric ID set variable which is updated every time the
`updateMetricIDsForTagFilters` function is called. After each successful
call, the function checks the length of the updated metric ID set and if
it is greater than `maxMetrics`, the function returns `too many
timeseries` error.
- `updateMetricIDsForTagFilters` uses either per-day or global index to
search metric IDs for the given filter set. The decision of which index
to use is made is made within the `tryUpdatingMetricIDsForDateRange`
function and if it returns `fallback to global search` error then the
function uses global index by calling `getMetricIDsForDateAndFilters`
with zero date.
- `tryUpdatingMetricIDsForDateRange` first checks if the given time
range is larger than 40 days and if so returns `fallback to global
search` error. Otherwise it proceeds to searching for metric IDs within
that time range by calling `getMetricIDsForDateAndFilters` for each
date.
- `getMetricIDsForDateAndFilters` searches for metric IDs for the given
date and returns `fallback to global search` error if the number of
found metric IDs is greater than `maxMetrics`.

Problems with this solution:

1. The `fallback to global search` error returned by
`getMetricIDsForDateAndFilters` in case when maxMetrics is exceeded is
misleading.
2. If `tryUpdatingMetricIDsForDateRange` proceeds to date range search
and returns `fallback to global search` error (because
`getMetricIDsForDateAndFilters` returns it) then this will trigger
global search in `updateMetricIDsForTagFilters`. However the global
search uses the same maxMetrics value which means this search is
destined to fail too. I.e. the same search is performed twice and fails
twice.
3. `too many timeseries` error is already handled in
`searchMetricIDsInternal` and therefore handing this error in
`updateMetricIDsForTagFilters` is redundant
4. updateMetricIDsForTagFilters is a better place to make a decision on
whether to use per-day or global index.

Solution:

1.  Use a dedicated error for `too many timeseries` case
2. Handle `too many timeseries` error in  `searchMetricIDsInternal` only
3. Move the per-day or global search decision from
`tryUpdatingMetricIDsForDateRange` to `updateMetricIDsForTagFilters` and
remove `fallback to global search` error.


---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2024-08-27 21:39:03 +02:00
rtm0
eef6943084
lib/storage: properly register index records with RegisterMetricNames
Once the timeseries is in tsidCache, new entries won't be created in
per-day index because the RegisterMetricNames() code does consider
different dates for the same timeseries. So this case has been added.

The same bug exists for AddRows() but it is not manifested because the
index entries are finally created in updatePerDateData().

RegisterMetricNames also updated to increase the newTimeseriesCreated
counter because it actually creates new time series in index.

A unit tests has been added that check all possible data patterns
(different metric names and dates) and code branches in both
RegisterMetricNames and AddRows. The total number of new unit tests is
around 100 which increaded the running time of storage tests by 50%. 

---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
2024-08-27 21:33:53 +02:00
rtm0
30f98916f9
Move rowsAddedTotal counter to Storage (#6841)
### Describe Your Changes

Reduced the scope of rowsAddedTotal variable from global to Storage.

This metric clearly belongs to a given Storage object as it counts the
number of records added by a given Storage instance.
Reducing the scope improves the incapsulation and allows to reset this
variable during the unit tests (i.e. every time a new Storage object is
created by a test, that object gets a new variable).



Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
2024-08-27 21:30:37 +02:00
Zhu Jiekun
e97e966f82
lib/promrelabel: follow-up for 8958cecad6
In the previous commit 8958cecad6
the default ports (80/443) were removed for both the `scrapeURL` and
`instance` label values for those targets without a port in
`__address__`. Different values in the `instance` label generate new
time series.

This commit reverts the changes made to the `instance` label. Now,
for those targets:
- `scrapeURL` will remain unchanged.
- The `instance` label value will include the default port.

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6792
2024-08-27 13:04:26 +02:00
Nikolay
9feee15493
lib/promscrape: fixes proxy autorization (#6783)
* Adds custom dial func for HTTP-Connect and socks5 proxy tunnels.
  Standard golang http.transport exposes GetProxyConnectHeader function,
  but it doesn't allow to use separate tls config for proxy.
  It also not possible to enforce HTTP-Connect with standard http lib.
* For http scrape targets, by default http.Transport.Proxy function must
  be used. Since it has special case with full uri forward.
* Adds proxy.URL json methods that allow to properly copy internal
fields, like User/Password.
It should fix bug with proxy_url. When credentials specified at URL was
ignored.
* Adds tests for scrape client proxy requests

related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6771
2024-08-19 22:31:18 +02:00
Zhu Jiekun
723d834c1a
lib/promrelabel: stop adding default port 80/433 to address label
*  It was necessary to add default ports for fasthttp client. After migration to the std.httpclient it's no longer needed.
* An additional configuration is required at proxy servers with implicitly set 80/443 ports to the host header (such as HA proxy.

It's expected that after upgrade __address_ label may change. But it should be rare case. 80/443 ports are not widely used at monitoring ecosystem. And it shouldn't have much impact. 

Related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6792

Co-authored-by: Nikolay <nik@victoriametrics.com>
2024-08-19 22:28:49 +02:00
hagen1778
febba3971b
make go vet happy
Address `non-constant format string in call` check:
https://github.com/golang/go/issues/60529

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-08-19 21:15:33 +02:00
Roman Khavronenko
e58dde6925
lib/httputils: parse URL before creating HTTP transport (#6820)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6740

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-08-16 11:32:04 +02:00
Hui Wang
62d19369a3
stream aggregation: do not allow to enable -stream.keepInput and `k… (#6723)
…eep_metric_names` options in stream aggregation config together

With aggregated data and raw data under the same metric, results would
be confusing.

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-08-13 08:54:35 -04:00
Zhu Jiekun
9e2bd82376
app/vmagent: fixes azure service discovery pagination
Azure API response with link to the next page was incorrectly validate. Validation used url.Host header to match configure API URL.


https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6784
2024-08-09 15:22:47 +02:00
Zakhar Bessarab
cb00b4b00f
lib/backup/s3remote: add retryer configuration (#6747)
### Describe Your Changes

This helps to improve reliability of performing backups in environments
with unreliable connection and tolerate temporary errors at S3 provider
side.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6732

Default retry timeout is up to 3 minutes to make this consistent with
the same configuration for GCS:
a05317f61f/lib/backup/gcsremote/gcs.go (L70-L76)

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2024-08-07 16:55:29 +02:00
Roman Khavronenko
f28f496a9d
lib/bytesutil: smooth buffer growth rate (#6761)
Before, buffer growth was always x2 of its size, which could lead to
excessive memory usage when processing big amount of data.
For example, scraping a target with hundreds of MBs in response could
result into hih memory spikes in vmagent because buffer has to double
its size to fit the response. See
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6759

The change smoothes out the growth rate, trading higher allocation rate
for lower mem usage at certain conditions.

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-08-07 16:49:43 +02:00
hagen1778
1154f90d2d
lib/mergeset: fix typos in comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-08-07 15:54:15 +02:00
Aliaksandr Valialkin
04981c7a7f
lib/streamaggr: remove resetState arg from aggrState.flushState()
The resetState arg was used only for the BenchmarkAggregatorsFlushInternalSerial benchmark.
This benchmark was testing aggregate state flush performance by keeping the same state across flushes.
The benhmark didn't reflect the performance and scalability of stream aggregation in production,
while it led to non-trivial code changes related to resetState arg handling.

So let's drop the benchmark together with all the code related to resetState handling,
in order to simplify the code at lib/streamaggr a bit.

Thanks to @AndrewChubatiuk for the original idea at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6314
2024-08-07 11:39:14 +02:00
Aliaksandr Valialkin
86c7afd126
lib/streamaggr: consistently use the same timestamp across all the output aggregated samples in a single aggregation interval
Prevsiously every aggregation output was using its own timestamp for the output aggregated samples
in a single aggregation interval. This could result in unexpected inconsitent timesetamps for the output
aggregated samples.

This commit consistently uses the same timestamp across all the output aggregated samples.
This commit makes sure that the duration between subsequent timestamps strictly equals
the configured aggregation interval.

Thanks to @AndrewChubatiuk for the original idea at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6314
This commit should help https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4580
2024-08-07 11:39:13 +02:00
Anzor
994796367b
app/vmagent: read __sample_limit__ from labels (#6665) (#6666)
By introducing this feature, users will have the ability to customize
the sampleLimit parameter on a per-target basis, providing more
flexibility and control over the job execution behavior.
2024-08-07 09:36:14 +02:00
hagen1778
f283126084
fix typos in comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-08-06 14:54:49 +02:00
Zakhar Bessarab
9877a5e7d5
app/{vminsert,vmagent}: add healthcheck for influx ingestion endpoints (#6749)
### Describe Your Changes

This is useful for clients which validate InfluxDB is available before
data ingestion can be started.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6653

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-08-05 09:34:54 +02:00
Juraj Bubniak
11c0b05e8a
lib/backup/s3remote: fix typos (#6694)
Fixes a few typos in errors in lib/backup/s3remote package.
2024-07-29 14:18:31 +02:00
jackyin
e5d279bb71
lib/netutil: validate TLS cert and key files immediately (#6621)
Validate files specified via `-tlsKeyFile` and `-tlsCertFile` cmd-line flags on the process start-up. Previously, validation happened on the first connection accepted by HTTP server.

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6608

---------

Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-07-29 13:58:53 +02:00
Aliaksandr Valialkin
8551fbe9f3
Revert "refactor(vmstorage): Refactor the code to reduce the time complexity of MustAddRows and improve readability (#6629)"
This reverts commit e280d90e9a.

Reason for revert: the updated code doesn't improve the performance of table.MustAddRows for the typical case
when rows contain timestamps belonging to ptws[0].

The performance may be improved in theory for the case when all the rows belong to partiton other than ptws[0],
but this partition is automatically moved to ptws[0] by the code at lines
6aad1d43e9/lib/storage/table.go (L287-L298) ,
so the next time the typical case will work.

Also the updated code makes the code harder to follow, since it introduces an additional level of indirection
with non-trivial semantics inside table.MustAddRows - the partition.TimeRangeInPartition() function.
This function needs to be inspected and understood when reading the code at table.MustAddRows().
This function depends on minTsInRows and maxTsInRows vars, which are defined and initialized
many lines above the partition.TimeRangeInPartition() call. This complicates reading and understanding
the code even more.

The previous code was using clearer loop over rows with the clear call to partition.HasTimestamp()
for every timestamp in the row. The partition.HasTimestamp() call is used in the table.MustAddRows()
function multiple times. This makes the use of partition.HasTimestamp() call more consistent,
easier to understand and easier to maintain comparing to the mix of partition.HasTimestamp() and partition.TimeRangeInPartition()
calls.

Aslo, there is no need in documenting some hardcore software engineering refactoring at docs/CHANGLELOG.md,
since the docs/CHANGELOG.md is intended for VictoriaMetrics users, who may not know software engineering.
The docs/CHANGELOG.md must document user-visible changes, and the docs must be concise and clear for VictoriaMetrics users.
See https://docs.victoriametrics.com/contributing/#pull-request-checklist for more details.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6629
2024-07-25 14:32:09 +02:00
Ruixiang Tan
e280d90e9a
refactor(vmstorage): Refactor the code to reduce the time complexity of MustAddRows and improve readability (#6629)
### Describe Your Changes
The original logic is not only highly complex but also poorly readable,
so it can be modified to increase readability and reduce time
complexity.


---------

Co-authored-by: Zhu Jiekun <jiekun@victoriametrics.com>
2024-07-25 08:55:12 +02:00
Aliaksandr Valialkin
65ce4e30ab
lib/backup/azremote: follow-up for 5fd3aef549
- Mention that credentials can be configured via env variables at both vmbackup and vmrestore docs.

- Make clear that the AZURE_STORAGE_DOMAIN env var is optional at https://docs.victoriametrics.com/vmbackup/#providing-credentials-via-env-variables

- Use string literals as is for env variable names instead of indirecting them via string constants.
  This makes easier to read and understand the code. These environment variable names aren't going to change
  in the future, so there is no sense in hiding them under string constants with some other names.

- Refer to https://docs.victoriametrics.com/vmbackup/#providing-credentials-via-env-variables in error messages
  when auth creds are improperly configured. This should simplify figuring out how to fix the error.

- Simplify the code a bit at FS.newClient(), so it is easier to follow it now.
  While at it, remove the check when superflouos environment variables are set, since it is too fragile
  and it looks like it doesn't help properly configuring vmbackup / vmrestore.

- Remove envLookuper indirection - just use 'func(name string) (string, bool)' type inline.
  This simplifies code reading and understanding.

- Split TestFSInit() into TestFSInit_Failure() and TestFSInit_Success(). This simplifies the test code,
  so it should be easier to maintain in the future.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6518
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5984
2024-07-17 17:55:06 +02:00
Aliaksandr Valialkin
eaed0465d2
all: substitute double "the the" with "the"
This is a follow-up for 8786a08d27

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6600
2024-07-17 14:28:12 +02:00
Aliaksandr Valialkin
9c4b0334f2
all: consistently use stringsutil.JSONString() for formatting JSON strings with fmt.* functions instead of using "%q" formatter
The %q formatter may result in incorrectly formatted JSON string if the original string
contains special chars such as \x1b . They must be encoded as \u001b , otherwise the resulting JSON string
cannot be parsed by JSON parsers.

This is a follow-up for c0caa69939

See https://github.com/VictoriaMetrics/victorialogs-datasource/issues/24
2024-07-17 13:52:13 +02:00
Aliaksandr Valialkin
8ff051b287
lib/protoparser/graphite: use Regex.ReplaceAllLiteralString instead of Regex.ReplaceAllString for the case when the replacement cannot contain placeholders for capturing groups
This is a follow-up for 74affa3aec
2024-07-17 13:01:05 +02:00
Aliaksandr Valialkin
74affa3aec
lib/protoparser/graphite: follow-up for 476faf5578
- Clarify the description of -graphite.sanitizeMetricName command-line flag at README.md
- Do not sanitize tag values - only metric names and tag names must be sanitized,
  since they are treated specially by Grafana. Grafana doesn't apply any restrictions on tag values.
- Properly replace more than two consecutive dots with a single dot.
- Disallow unicode letters in metric names and tag names, since neither Prometheus nor Grafana
  do not support them.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6489
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6077
2024-07-17 12:41:55 +02:00
Aliaksandr Valialkin
58a757cd01
lib: consistently use regexp.Regexp.ReplaceAllLiteralString instead of regexp.Regexp.ReplaceAllString in places where the replacement cannot contain matching group placeholders 2024-07-17 12:41:54 +02:00
rtm0
bdc0e688e8
Fix inconsistent error handling in Storage.AddRows() (#6583)
### Describe Your Changes

`Storage.AddRows()` returns an error only in one case: when
`Storage.updatePerDateData()` fails to unmarshal a `metricNameRaw`. But
the same error is treated as a warning when it happens inside
`Storage.add()` or returned by `Storage.prefillNextIndexDB()`.

This commit fixes this inconsistency by treating the error returned by
`Storage.updatePerDateData()` as a warning as well. As a result
`Storage.add()` does not need a return value anymore and so doesn't
`Storage.AddRows()`.

Additionally, this commit adds a unit test that checks all cases that
result in a row not being added to the storage.



---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2024-07-17 12:07:14 +02:00
Aliaksandr Valialkin
c1e32f4517
lib/promrelabel: add test for IfExpression.String() function
While at it, simplify this function a bit after the commit 861852f262

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6462
2024-07-16 18:31:05 +02:00
Aliaksandr Valialkin
4304950391
lib/promscrape/discovery/yandexcloud: follow-up for 070abe5c71
- Obtain IAM token via GCE-like API instead of Amazon EC2 IMDSv2 API,
  since it looks like IMDBSv2 API isn't supported by Yandex Cloud
  according to https://yandex.cloud/en/docs/security/standard/authentication#aws-token :

  > So far, Yandex Cloud does not support version 2, so it is strongly recommended
  > to technically disable getting a service account token via the Amazon EC2 metadata service.

- Try obtaining IAM token via GCE-like API at first and then fall back to the deprecated Amazon EC2 IMDBSv1.
  This should prevent from auth errors for instances with disabled GCE-like auth API.
  This addresses @ITD27M01 concern at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5513#issuecomment-1867794884

- Make more clear the description of the change at docs/CHANGELOG.md , add reference to the related issue.

P.S. This change wasn't tested in prod because I have no access to Yandex Cloud.
It is recommended to test this change by @ITD27M01 and @vmazgo , who filed
the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5513

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6524
2024-07-16 17:58:40 +02:00
Aliaksandr Valialkin
57000f5105
lib/promscrape: follow-up for 1e83598be3
- Clarify that the -promscrape.maxScrapeSize value is used for limiting the maximum
  scrape size if max_scrape_size option isn't set at https://docs.victoriametrics.com/sd_configs/#scrape_configs

- Fix query example for scrape_response_size_bytes metric at https://docs.victoriametrics.com/vmagent/#automatically-generated-metrics

- Mention about max_scrape_size option at the -help description for -promscrape.maxScrapeSize command-line flag

- Treat zero value for max_scrape_size option as 'no scrape size limit'

- Change float64 to int type for scrapeResponseSize struct fields and function args, since response size cannot be fractional

- Optimize isAutoMetric() function a bit

- Sort auto metrics in alphabetical order in isAutoMetric() and in scrapeWork.addAutoMetrics() functions
  for better maintainability in the future

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6434
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6429
2024-07-16 12:38:21 +02:00
Aliaksandr Valialkin
7a3394bbe1
Revert "lib/protoparser/opentelemetry/firehose: escape requestID before returning it to user (#6451)"
This reverts commit cd1aca217c.

Reason for revert: this commit has no sense, since the firehose response has application/json content-type,
so it must contain JSON-encoded timestamp and requestId fields according to https://docs.aws.amazon.com/firehose/latest/dev/httpdeliveryrequestresponse.html#responseformat .
HTML-escaping the requestId field may break the response, so the client couldn't correctly recognize the html-escaped requestId.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6451
2024-07-16 09:49:19 +02:00
Aliaksandr Valialkin
233e5f0a9e
lib/httpserver: skip basic auth check for additional request paths, which should call httpserver.CheckAuthFlag()
This is a follow-up for 61dce6f2a1

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6338
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6329
2024-07-16 01:00:45 +02:00
Aliaksandr Valialkin
784327ea30
lib/uint64set: optimize Set.Has() for nil Set - it should be inlined now
This makes unnecessary the checkDeleted variable at lib/storage/index_db.go

This is a follow-up for b984f4672e
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6342
2024-07-15 23:59:20 +02:00
Aliaksandr Valialkin
832e088659
lib/mergeset: properly update TableMetrics.TooLongItemsDroppedTotal inside Table.UpdateMetrics
Substitute '+=' with '=', since tooLongItemsTotal is global counter, which doesn't belong to the Table struct.

This is a follow-up for 69d244e6fb
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6297
2024-07-15 23:39:10 +02:00
Aliaksandr Valialkin
a468a6e985
lib/{httputils,netutil}: move httputils.GetStatDialFunc to netutil.NewStatDialFunc
- Rename GetStatDialFunc to NewStatDialFunc, since it returns new function with every call
- NewStatDialFunc isn't related to http in any way, so it must be moved from lib/httputils to lib/netutil
- Simplify the implementation of NewStatDialFunc by removing sync.Map from there.
- Use netutil.NewStatDialFunc at app/vmauth and lib/promscrape/discoveryutils
- Use gauge instead of counter type for *_conns metric

This is a follow-up for d7b5062917
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6299
2024-07-15 23:02:34 +02:00
Aliaksandr Valialkin
ad367c17bf
lib/streamaggr/streamaggr.go: typo fix after 5e29ef5ed5: IgnoredNaNSamples -> ignoredNaNSamples 2024-07-15 21:58:56 +02:00
Aliaksandr Valialkin
db557b86ee
app/vmagent/remotewrite: follow-up for f153f54d11
- Move the remaining code responsible for stream aggregation initialization from remotewrite.go to streamaggr.go .
  This improves code maintainability a bit.

- Properly shut down streamaggr.Aggregators initialized inside remotewrite.CheckStreamAggrConfigs().
  This prevents from potential resource leaks.

- Use separate functions for initializing and reloading of global stream aggregation and per-remoteWrite.url stream aggregation.
  This makes the code easier to read and maintain. This also fixes INFO and ERROR logs emitted by these functions.

- Add an ability to specify `name` option in every stream aggregation config. This option is used as `name` label
  in metrics exposed by stream aggregation at /metrics page. This simplifies investigation of the exposed metrics.

- Add `path` label additionally to `name`, `url` and `position` labels at metrics exposed by streaming aggregation.
  This label should simplify investigation of the exposed metrics.

- Remove `match` and `group` labels from metrics exposed by streaming aggregation, since they have little practical applicability:
  it is hard to use these labels in query filters and aggregation functions.

- Rename the metric `vm_streamaggr_flushed_samples_total` to less misleading `vm_streamaggr_output_samples_total` .
  This metric shows the number of samples generated by the corresponding streaming aggregation rule.
  This metric has been added in the commit 861852f262 .
  See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6462

- Remove the metric `vm_streamaggr_stale_samples_total`, since it is unclear how it can be used in practice.
  This metric has been added in the commit 861852f262 .
  See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6462

- Remove Alias and aggrID fields from streamaggr.Options struct, since these fields aren't related to optional params,
  which could modify the behaviour of the constructed streaming aggregator.
  Convert the Alias field to regular argument passed to LoadFromFile() function, since this argument is mandatory.

- Pass Options arg to LoadFromFile() function by reference, since this structure is quite big.
  This also allows passing nil instead of Options when default options are enough.

- Add `name`, `path`, `url` and `position` labels to `vm_streamaggr_dedup_state_size_bytes` and `vm_streamaggr_dedup_state_items_count` metrics,
  so they have consistent set of labels comparing to the rest of streaming aggregation metrics.

- Convert aggregator.aggrStates field type from `map[string]aggrState` to `[]aggrOutput`, where `aggrOutput` contains the corresponding
  `aggrState` plus all the related metrics (currently only `vm_streamaggr_output_samples_total` metric is exposed with the corresponding
  `output` label per each configured output function). This simplifies and speeds up the code responsible for updating per-output
  metrics. This is a follow-up for the commit 2eb1bc4f81 .
  See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6604

- Added missing urls to docs ( https://docs.victoriametrics.com/stream-aggregation/ ) in error messages. These urls help users
  figuring out why VictoriaMetrics or vmagent generates the corresponding error messages. The urls were removed for unknown reason
  in the commit 2eb1bc4f81 .

- Fix incorrect update for `vm_streamaggr_output_samples_total` metric in flushCtx.appendSeriesWithExtraLabel() function.
  While at it, reduce memory usage by limiting the maximum number of samples per flush to 10K.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5467
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6268
2024-07-15 20:24:01 +02:00
Aliaksandr Valialkin
202e5704e6
vendor: update github.com/VictoriaMetrics/metrics from v1.34.1 to v1.35.0
Fix potential memory leaks across VictoriaMetrics codebase after metrics.UnregisterSet(s) call
because of missing s.UnregisterAllMetrics() call.

This is a follow-up for 6a6e34ab8e . It is OK if some vmauth metrics
aren't visible for a few microseconds when the previous metrics are unregistered and new metrics
weren't registered yet.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6247
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4690
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6252
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5805
2024-07-15 10:43:37 +02:00
Aliaksandr Valialkin
c995ccad93
lib/{storage,mergeset}: do not allow setting dataFlushInterval to values smaller than pending{Items,Rows}FlushInterval
Pending rows and items unconditionally remain in memory for up to pending{Items,Rows}FlushInterval,
so there is no any sense in setting dataFlushInterval (the interval for guaranteed flush of in-memory data to disk)
to values smaller than pending{Items,Rows}FlushInterval, since this doesn't affect the interval
for flushing pending rows and items from memory to disk.

This is a follow-up for 4c80b17027

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6221
2024-07-15 10:08:15 +02:00
Aliaksandr Valialkin
48ec66883a
lib/streamaggr: consistently use alphabetical order of benchmarked stream aggregation outputs 2024-07-15 09:53:19 +02:00
Aliaksandr Valialkin
5354374b62
lib/streamaggr: follow-up for 9c3d44c8c9
- Consistently enumerate stream aggregation outputs in alphabetical order across the source code and docs.
  This should simplify future maintenance of the corresponding code and docs.

- Fix the link to `rate_sum()` at `see also` section of `rate_avg()` docs.

- Make more clear the docs for `rate_sum()` and `rate_avg()` outputs.

- Encapsulate output metric suffix inside rateAggrState. This eliminates possible bugs related
  to incorrect suffix passing to newRateAggrState().

- Rename rateAggrState.total field to less misleading rateAggrState.increase name, since it calculates
  counter increase in the current aggregation window.

- Set rateLastValueState.prevTimestamp on the first sample in time series instead of the second sample.
  This makes more clear the code logic.

- Move the code for removing outdated entries at rateAggrState into removeOldEntries() function.
  This make the code logic inside rateAggrState.flushState() more clear.

- Do not write output sample with zero value if there are no input series, which could be used
  for calculating the rate, e.g. if only a single sample is registered for every input series.

- Do not take into account input series with a single registered sample when calculating rate_avg(),
  since this leads to incorrect results.

- Move {rate,total}AggrState.flushState() function to the end of rate.go and total.go files, so they look more similar.
  This shuld simplify future mantenance.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6243
2024-07-15 08:40:09 +02:00
Aliaksandr Valialkin
0145b65f25
app/vmagent/remotewrite: follow-up for 87fd400dfc
- Drop samples and return true from remotewrite.TryPush() at fast path when all the remote storage
  systems are configured with the disabled on-disk queue, every in-memory queue is full
  and -remoteWrite.dropSamplesOnOverload is set to true. This case is quite common,
  so it should be optimized. Previously additional CPU time was spent on per-remoteWriteCtx
  relabeling and other processing in this case.

- Properly count the number of dropped samples inside remoteWriteCtx.pushInternalTrackDropped().
  Previously dropped samples were counted only if -remoteWrite.dropSamplesOnOverload flag is set.
  In reality, the samples are dropped when they couldn't be sent to the queue because in-memory queue is full
  and on-disk queue is disabled.
  The remoteWriteCtx.pushInternalTrackDropped() function is called by streaming aggregation for pushing
  the aggregated data to the remote storage. Streaming aggregation cannot wait until the remote storage
  processes pending data, so it drops aggregated samples in this case.

- Clarify the description for -remoteWrite.disableOnDiskQueue command-line flag at -help output,
  so it is clear that this flag can be set individually per each -remoteWrite.url.

- Make the -remoteWrite.dropSamplesOnOverload flag global. If some of the remote storage systems
  are configured with the disabled on-disk queue, then there is no sense in keeping samples
  on some of these systems, while dropping samples on the remaining systems, since this
  will result in global stall on the remote storage system with the disabled on-disk queue
  and with the -remoteWrite.dropSamplesOnOverload=false flag. vmagent will always return false
  from remotewrite.TryPush() in this case. This will result in infinite duplicate samples
  written to the remaining remote storage systems. That's why the -remoteWrite.dropSamplesOnOverload
  is forcibly set to true if more than one -remoteWrite.disableOnDiskQueue flag is set.
  This allows proceeding with newly scraped / pushed samples by sending them to the remaining
  remote storage systems, while dropping them on overloaded systems with the -remoteWrite.disableOnDiskQueue flag set.

- Verify that the remoteWriteCtx.TryPush() returns true in the TestRemoteWriteContext_TryPush_ImmutableTimeseries test.

- Mention in vmagent docs that the -remoteWrite.disableOnDiskQueue command-line flag can be set individually per each -remoteWrite.url.
  See https://docs.victoriametrics.com/vmagent/#disabling-on-disk-persistence

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6248
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6065
2024-07-13 02:25:19 +02:00
Aliaksandr Valialkin
0078399788
app/vmalert: switch from table-driven tests to f-tests
This makes test code more clear and reduces the number of code lines by 500.
This also simplifies debugging tests. See https://itnext.io/f-tests-as-a-replacement-for-table-driven-tests-in-go-8814a8b19e9e

While at it, consistently use t.Fatal* instead of t.Error* across tests, since t.Error*
requires more boilerplate code, which can result in additional bugs inside tests.
While t.Error* allows writing logging errors for the same, this doesn't simplify fixing
broken tests most of the time.

This is a follow-up for a9525da8a4
2024-07-12 22:41:11 +02:00
hagen1778
2f65956259
lib/streamaggr: add missing test cases
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-07-12 11:06:45 +02:00
Hui Wang
2eb1bc4f81
vmagent: fix vm_streamaggr_flushed_samples_total counter (#6604)
We use `vm_streamaggr_flushed_samples_total` to show the number of
produced samples by aggregation rule, previously it was overcounted, and
doesn't account for `output_relabel_configs`.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6462

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-07-12 10:56:07 +02:00
hagen1778
03e4c5c19c
lib/bakcup/azremote: follow-up after 5fd3aef549
Simplify tests by converting them to f-tests.

5fd3aef549
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2024-07-10 13:06:27 +02:00
justinrush
5fd3aef549
lib/backup: add support for Azure Managed Identity (#6518)
### Describe Your Changes

These changes support using Azure Managed Identity for the `vmbackup`
utility. It adds two new environment variables:

* `AZURE_USE_DEFAULT_CREDENTIAL`: Instructs the `vmbackup` utility to
build a connection using the [Azure Default
Credential](https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity@v1.5.2#NewDefaultAzureCredential)
mode. This causes the Azure SDK to check for a variety of environment
variables to try and make a connection. By default, it tries to use
managed identity if that is set up.

This will close
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5984

### Checklist

The following checks are **mandatory**:

- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

### Testing

However you normally test the `vmbackup` utility using Azure Blob should
continue to work without any changes. The set up for that is environment
specific and not listed out here.

Once regression testing has been done you can set up [Azure Managed
Identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview)
so your resource (AKS, VM, etc), can use that credential method. Once it
is set up, update your environment variables according to the updated
documentation.

I added unit tests to the `FS.Init` function, then made my changes, then
updated the unit tests to capture the new branches.

I tested this in our environment, but with SAS token auth and managed
identity and it works as expected.

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Justin Rush <jarush@epic.com>
Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-07-10 11:52:05 +02:00
Aliaksandr Valialkin
ac06569c49
app/vlinsert/loki: use easyproto instead for parsing Loki protobuf messages 2024-07-10 03:05:17 +02:00
Aliaksandr Valialkin
aa9bb99527
lib/logstorage: drop all the pipes from the query when calculating the number of matching logs at /select/logsql/hits API 2024-07-10 00:39:28 +02:00
Aliaksandr Valialkin
3c02937a34
all: consistently use 'any' instead of 'interface{}'
'any' type is supported starting from Go1.18. Let's consistently use it
instead of 'interface{}' type across the code base, since `any` is easier to read than 'interface{}'.
2024-07-10 00:20:37 +02:00
Aliaksandr Valialkin
a9525da8a4
lib: consistently use f-tests instead of table-driven tests
This makes easier to read and debug these tests. This also reduces test lines count by 15% from 3K to 2.5K
See https://itnext.io/f-tests-as-a-replacement-for-table-driven-tests-in-go-8814a8b19e9e

While at it, consistently use t.Fatal* instead of t.Error*, since t.Error* usually leads
to more complicated and fragile tests, while it doesn't bring any practical benefits over t.Fatal*.
2024-07-09 22:40:50 +02:00
Aliaksandr Valialkin
35b3b95cbc
lib/promscrape/discovery/vultr: follow-up after 17e3d019d2
- Sort the discovered labels in alphabetical order at https://docs.victoriametrics.com/sd_configs/#vultr_sd_configs
- Rename VultrConfigs to VultrSDConfigs to be consistent with the naming for other SD configs.
- Prepare query arg filters for `list instances API` at newAPIConfig() instead of passing them in a separate listParams struct.
  This simplifies the code a bit.
- Return error when bearer token isn't set at vultr_sd_configs, since this token is mandatory
  according to https://docs.victoriametrics.com/sd_configs/#vultr_sd_configs
- Remove unused fields from the parsed response from Vultr list instances API in order to simplify the code a bit.
- Remove double logging of errors inside getInstances() function, since these errors must be already logged by the caller.
- Simplify tests, so they are easier to maintain.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6041
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6068
2024-07-05 17:40:03 +02:00
Aliaksandr Valialkin
c0caa69939
lib/logstorage: use quicktemplate.AppendJSONString instead of strconv.AppendQuote for encoding JSON strings
The strconv.AppendQuote improperly encodes special chars such as \x1b . They must be encoded as \u001b .

See https://github.com/VictoriaMetrics/victorialogs-datasource/issues/24
2024-07-05 01:22:23 +02:00
Aliaksandr Valialkin
2da7dfc754
Revert c6c5a5a186 and b2765c45d0
Reason for revert:

There are many statsd servers exist:

- https://github.com/statsd/statsd - classical statsd server
- https://docs.datadoghq.com/developers/dogstatsd/ - statsd server from DataDog built into DatDog Agent ( https://docs.datadoghq.com/agent/ )
- https://github.com/avito-tech/bioyino - high-performance statsd server
- https://github.com/atlassian/gostatsd - statsd server in Go
- https://github.com/prometheus/statsd_exporter - statsd server, which exposes the aggregated data as Prometheus metrics

These servers can be used for efficient aggregating of statsd data and sending it to VictoriaMetrics
according to https://docs.victoriametrics.com/#how-to-send-data-from-graphite-compatible-agents-such-as-statsd (
the https://github.com/prometheus/statsd_exporter can be scraped as usual Prometheus target
according to https://docs.victoriametrics.com/#how-to-scrape-prometheus-exporters-such-as-node-exporter ).

Adding support for statsd data ingestion protocol into VictoriaMetrics makes sense only if it provides
significant advantages over the existing statsd servers, while has no significant drawbacks comparing
to existing statsd servers.

The main advantage of statsd server built into VictoriaMetrics and vmagent - getting rid of additional statsd server.
The main drawback is non-trivial and inconvenient streaming aggregation configs, which must be used for the ingested statsd metrics (
see https://docs.victoriametrics.com/stream-aggregation/ ). These configs are incompatible with the configs for standalone statsd servers.
So you need to manually translate configs of the used statsd server to stream aggregation configs when migrating
from standalone statsd server to statsd server built into VictoriaMetrics (or vmagent).

Another important drawback is that it is very easy to shoot yourself in the foot when using built-in statsd server
with the -statsd.disableAggregationEnforcement command-line flag or with improperly configured streaming aggregation.
In this case the ingested statsd metrics will be stored to VictoriaMetrics as is without any aggregation.
This may result in high CPU usage during data ingestion, high disk space usage for storing all the unaggregated
statsd metrics and high CPU usage during querying, since all the unaggregated metrics must be read, unpacked and processed
during querying.

P.S. Built-in statsd server can be added to VictoriaMetrics and vmagent after figuring out more ergonomic
specialized configuration for aggregating of statsd metrics. The main requirements for this configuration:

- easy to write, read and update (ideally it should work out of the box for most cases without additional configuration)
- hard to misconfigure (e.g. hard to shoot yourself in the foot)

It would be great if this configuration will be compatible with the configuration of the most widely used statsd server.

In the mean time it is recommended continue using external statsd server.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6265
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5053
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5052
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/206
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4600
2024-07-03 23:51:56 +02:00
Aliaksandr Valialkin
d8c7cc266b
lib/promscrape: use prompbmarshal.MustParsePromMetrics function at parseData() test function
The prompbmarshal.MustParsePromMetrics function has been added in the commit cc4d57d650
2024-07-03 16:08:13 +02:00
Aliaksandr Valialkin
bb00bae353
Revert "Exemplar support (#5982)"
This reverts commit 5a3abfa041.

Reason for revert: exemplars aren't in wide use because they have numerous issues which prevent their adoption (see below).
Adding support for examplars into VictoriaMetrics introduces non-trivial code changes. These code changes need to be supported forever
once the release of VictoriaMetrics with exemplar support is published. That's why I don't think this is a good feature despite
that the source code of the reverted commit has an excellent quality. See https://docs.victoriametrics.com/goals/ .

Issues with Prometheus exemplars:

- Prometheus still has only experimental support for exemplars after more than three years since they were introduced.
  It stores exemplars in memory, so they are lost after Prometheus restart. This doesn't look like production-ready feature.
  See 0a2f3b3794/content/docs/instrumenting/exposition_formats.md (L153-L159)
  and https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage

- It is very non-trivial to expose exemplars alongside metrics in your application, since the official Prometheus SDKs
  for metrics' exposition ( https://prometheus.io/docs/instrumenting/clientlibs/ ) either have very hard-to-use API
  for exposing histograms or do not have this API at all. For example, try figuring out how to expose exemplars
  via https://pkg.go.dev/github.com/prometheus/client_golang@v1.19.1/prometheus .

- It looks like exemplars are supported for Histogram metric types only -
  see https://pkg.go.dev/github.com/prometheus/client_golang@v1.19.1/prometheus#Timer.ObserveDurationWithExemplar .
  Exemplars aren't supported for Counter, Gauge and Summary metric types.

- Grafana has very poor support for Prometheus exemplars. It looks like it supports exemplars only when the query
  contains histogram_quantile() function. It queries exemplars via special Prometheus API -
  https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars - (which is still marked as experimental, btw.)
  and then displays all the returned exemplars on the graph as special dots. The issue is that this doesn't work
  in production in most cases when the histogram_quantile() is calculated over thousands of histogram buckets
  exposed by big number of application instances. Every histogram bucket may expose an exemplar on every timestamp shown on the graph.
  This makes the graph unusable, since it is litterally filled with thousands of exemplar dots.
  Neither Prometheus API nor Grafana doesn't provide the ability to filter out unneeded exemplars.

- Exemplars are usually connected to traces. While traces are good for some

I doubt exemplars will become production-ready in the near future because of the issues outlined above.

Alternative to exemplars:

Exemplars are marketed as a silver bullet for the correlation between metrics, traces and logs -
just click the exemplar dot on some graph in Grafana and instantly see the corresponding trace or log entry!
This doesn't work as expected in production as shown above. Are there better solutions, which work in production?
Yes - just use time-based and label-based correlation between metrics, traces and logs. Assign the same `job`
and `instance` labels to metrics, logs and traces, so you can quickly find the needed trace or log entry
by these labes on the time range with the anomaly on metrics' graph.
2024-07-03 15:30:21 +02:00
Aliaksandr Valialkin
cc4d57d650
app/vmagent/remotewrite,lib/streamaggr: re-use common code in tests after 879771808b
- Export streamaggr.LoadFromData() function, so it could be used in tests outside the lib/streamaggr package.
  This allows removing a hack with creation of temporary files at TestRemoteWriteContext_TryPush_ImmutableTimeseries.

- Move common code for mustParsePromMetrics() function into lib/prompbmarshal package,
  so it could be used in tests for building []prompbmarshal.TimeSeries from string.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6205
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6206
2024-07-03 15:21:36 +02:00
Aliaksandr Valialkin
f17b408643
lib/streamaggr: follow-up for the commit c0e4ccb7b5
- Clarify docs for `Ignore aggregation intervals on start` feature.

- Make more clear the code dealing with ignoreFirstIntervals at aggregator.runFlusher() functions.
  It is better from readability and maintainability PoV using distinct a.flush() calls
  for distinct cases instead of merging them into a single a.flush() call.

- Take into account the first incomplete interval when tracking the number of skipped aggregation intervals,
  since this behaviour is easier to understand by the end users.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6137
2024-07-02 21:24:50 +02:00
Andrii Chubatiuk
476faf5578
lib/protoparser/graphite: added -graphite.sanitizeMetricName flag (#6489)
### Describe Your Changes

Added flag to sanitize graphite metrics
fixes #6077

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-07-02 14:56:41 +02:00
Aliaksandr Valialkin
3b6c78c26c
lib/logstorage: allow writing after N in front of before N at stream_context pipe 2024-07-02 01:38:20 +02:00
Andrii Chubatiuk
861852f262
lib/streamaggr: added stale samples metric, added metrics labels (#6462)
### Describe Your Changes

- added stale metrics counters for input and output samples
- added labels for aggregator metrics =>
`name="{rwctx}:{aggrId}:{aggrSuffix}"`
   - rwctx - global or number starting from 1
   - aggrid - aggregator id starting from 1
   - aggrSuffix - <interval>_(by|without)_label1_label2_labeln
   e.g: `name="global:1:1m_without_instance_pod"`

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2024-07-01 14:56:17 +02:00
Aliaksandr Valialkin
6bb66cb3e9
lib/logstorage: properly search for the surrounding logs in stream_context pipe
The set of log fields in the found logs may differ from the set of log fields present in the log stream.
So compare only the log fields in the found logs when searching for the matching log entry in the log stream.

While at it, return _stream field in the delimiter log entry, since this field is used by VictoriaLogs Web UI
for grouping logs by log streams.
2024-07-01 02:29:50 +02:00