lib/storage: properly handle maxMetrics limit at metricID search

`TL;DR` This PR improves the metric IDs search in IndexDB:

- Avoid seaching for metric IDs twice when `maxMetrics` limit is
exceeded
- Use correct error type for indicating that the `maxMetrics` limit is
exceded
- Simplify the logic of deciding between per-day and global index search

A unit test has been added to ensure that this refactoring does not
break anything.

---

Function calls before the fix:

```
idb.searchMetricIDs
    |__ is.searchMetricIDs
        |__ is.searchMetricIDsInternal
            |__ is.updateMetricIDsForTagFilters
                |__ is.tryUpdatingMetricIDsForDateRange
                |                       |
                |__ is.getMetricIDsForDateAndFilters
```

- `searchMetricIDsInternal` searches metric IDs for each filter set. It
maintains a metric ID set variable which is updated every time the
`updateMetricIDsForTagFilters` function is called. After each successful
call, the function checks the length of the updated metric ID set and if
it is greater than `maxMetrics`, the function returns `too many
timeseries` error.
- `updateMetricIDsForTagFilters` uses either per-day or global index to
search metric IDs for the given filter set. The decision of which index
to use is made is made within the `tryUpdatingMetricIDsForDateRange`
function and if it returns `fallback to global search` error then the
function uses global index by calling `getMetricIDsForDateAndFilters`
with zero date.
- `tryUpdatingMetricIDsForDateRange` first checks if the given time
range is larger than 40 days and if so returns `fallback to global
search` error. Otherwise it proceeds to searching for metric IDs within
that time range by calling `getMetricIDsForDateAndFilters` for each
date.
- `getMetricIDsForDateAndFilters` searches for metric IDs for the given
date and returns `fallback to global search` error if the number of
found metric IDs is greater than `maxMetrics`.

Problems with this solution:

1. The `fallback to global search` error returned by
`getMetricIDsForDateAndFilters` in case when maxMetrics is exceeded is
misleading.
2. If `tryUpdatingMetricIDsForDateRange` proceeds to date range search
and returns `fallback to global search` error (because
`getMetricIDsForDateAndFilters` returns it) then this will trigger
global search in `updateMetricIDsForTagFilters`. However the global
search uses the same maxMetrics value which means this search is
destined to fail too. I.e. the same search is performed twice and fails
twice.
3. `too many timeseries` error is already handled in
`searchMetricIDsInternal` and therefore handing this error in
`updateMetricIDsForTagFilters` is redundant
4. updateMetricIDsForTagFilters is a better place to make a decision on
whether to use per-day or global index.

Solution:

1.  Use a dedicated error for `too many timeseries` case
2. Handle `too many timeseries` error in  `searchMetricIDsInternal` only
3. Move the per-day or global search decision from
`tryUpdatingMetricIDsForDateRange` to `updateMetricIDsForTagFilters` and
remove `fallback to global search` error.

---------

Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
This commit is contained in:
rtm0 2024-08-27 22:39:03 +03:00 committed by f41gh7
parent b51c6bf75d
commit 4c31a6a1fc
No known key found for this signature in database
GPG key ID: 4558311CF775EC72
3 changed files with 270 additions and 30 deletions

View file

@ -46,6 +46,7 @@ See also [LTS releases](https://docs.victoriametrics.com/lts-releases/).
* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert/), [vmctl](https://docs.victoriametrics.com/vmctl/) and snapshot API: verify correctness of URLs provided via cmd-line flags before executing HTTP requests. See [this](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6740) issue for details.
* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert): reduce memory usage when parsing responses with big number of metrics in response. The memory usage was increased in [v1.102.0-rc1](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.102.0-rc1) after attempt to reduce CPU usage for heavy loaded vmalerts.
* BUGFIX: all VictoriaMetrics components: forcefully set owner/group for release tars to 1000:1000. This helps to avoid unpacking [issues](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6788) on systems with limitations around UID:GID configuration. See [this pull request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6846).
* BUGFIX: [Single-node VictoriaMetrics](https://docs.victoriametrics.com/) and `vmstorage` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/cluster-victoriametrics/): Removes the fallback to global index search when the search using per-day index fails due to too many time series found (the global index will fail anyway with the same error and so the fallback is not needed and only slows down the search). See [this](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6836) for details.
* BUGFIX: [Single-node VictoriaMetrics](https://docs.victoriametrics.com/) and `vmstorage` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/cluster-victoriametrics/): fix metric names registering in the per-day index for new dates for existing time series when making calls to `/tags/tagSeries` and `/tags/tagMultiSeries` handlers of [Grpahite API](https://docs.victoriametrics.com/#graphite-api-usage). See [this](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6872/) for details.
## [v1.102.1](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.102.1)

View file

@ -2365,6 +2365,12 @@ func (is *indexSearch) searchMetricIDs(qt *querytracer.Tracer, tfss []*TagFilter
return sortedMetricIDs, nil
}
func errTooManyTimeseries(maxMetrics int) error {
return fmt.Errorf("the number of matching timeseries exceeds %d; "+
"either narrow down the search or increase -search.max* command-line flag values at vmselect; "+
"see https://docs.victoriametrics.com/#resource-usage-limits", maxMetrics)
}
func (is *indexSearch) searchMetricIDsInternal(qt *querytracer.Tracer, tfss []*TagFilters, tr TimeRange, maxMetrics int) (*uint64set.Set, error) {
qt = qt.NewChild("search for metric ids: filters=%s, timeRange=%s, maxMetrics=%d", tfss, &tr, maxMetrics)
defer qt.Done()
@ -2401,32 +2407,30 @@ func (is *indexSearch) searchMetricIDsInternal(qt *querytracer.Tracer, tfss []*T
return nil, err
}
if metricIDs.Len() > maxMetrics {
return nil, fmt.Errorf("the number of matching timeseries exceeds %d; either narrow down the search "+
"or increase -search.max* command-line flag values at vmselect; see https://docs.victoriametrics.com/#resource-usage-limits", maxMetrics)
return nil, errTooManyTimeseries(maxMetrics)
}
}
return metricIDs, nil
}
const maxDaysForPerDaySearch = 40
func (is *indexSearch) updateMetricIDsForTagFilters(qt *querytracer.Tracer, metricIDs *uint64set.Set, tfs *TagFilters, tr TimeRange, maxMetrics int) error {
err := is.tryUpdatingMetricIDsForDateRange(qt, metricIDs, tfs, tr, maxMetrics)
if err == nil {
// Fast path: found metricIDs by date range.
return nil
}
if !errors.Is(err, errFallbackToGlobalSearch) {
return err
minDate := uint64(tr.MinTimestamp) / msecPerDay
maxDate := uint64(tr.MaxTimestamp-1) / msecPerDay
if minDate <= maxDate && maxDate-minDate <= maxDaysForPerDaySearch {
// Fast path - search metricIDs by date range in the per-day inverted
// index.
is.db.dateRangeSearchCalls.Add(1)
qt.Printf("search metric ids in the per-day index")
return is.updateMetricIDsForDateRange(qt, metricIDs, tfs, minDate, maxDate, maxMetrics)
}
// Slow path - fall back to search in the global inverted index.
qt.Printf("cannot find metric ids in per-day index; fall back to global index")
// Slow path - search metricIDs in the global inverted index.
qt.Printf("search metric ids in the global index")
is.db.globalSearchCalls.Add(1)
m, err := is.getMetricIDsForDateAndFilters(qt, 0, tfs, maxMetrics)
if err != nil {
if errors.Is(err, errFallbackToGlobalSearch) {
return fmt.Errorf("the number of matching timeseries exceeds %d; either narrow down the search "+
"or increase -search.max* command-line flag values at vmselect; see https://docs.victoriametrics.com/#resource-usage-limits", maxMetrics)
}
return err
}
metricIDs.UnionMayOwn(m)
@ -2605,18 +2609,7 @@ func (is *indexSearch) updateMetricIDsForOrSuffix(prefix []byte, metricIDs *uint
return loopsCount, nil
}
var errFallbackToGlobalSearch = errors.New("fall back from per-day index search to global index search")
const maxDaysForPerDaySearch = 40
func (is *indexSearch) tryUpdatingMetricIDsForDateRange(qt *querytracer.Tracer, metricIDs *uint64set.Set, tfs *TagFilters, tr TimeRange, maxMetrics int) error {
is.db.dateRangeSearchCalls.Add(1)
minDate := uint64(tr.MinTimestamp) / msecPerDay
maxDate := uint64(tr.MaxTimestamp-1) / msecPerDay
if minDate > maxDate || maxDate-minDate > maxDaysForPerDaySearch {
// Too much dates must be covered. Give up, since it may be slow.
return errFallbackToGlobalSearch
}
func (is *indexSearch) updateMetricIDsForDateRange(qt *querytracer.Tracer, metricIDs *uint64set.Set, tfs *TagFilters, minDate, maxDate uint64, maxMetrics int) error {
if minDate == maxDate {
// Fast path - query only a single date.
m, err := is.getMetricIDsForDateAndFilters(qt, minDate, tfs, maxMetrics)
@ -2781,7 +2774,7 @@ func (is *indexSearch) getMetricIDsForDateAndFilters(qt *querytracer.Tracer, dat
}
if m.Len() >= maxDateMetrics {
// Too many time series found for the given (date). Fall back to global search.
return nil, errFallbackToGlobalSearch
return nil, errTooManyTimeseries(maxDateMetrics)
}
metricIDs = m
qt.Printf("found %d metric ids", metricIDs.Len())

View file

@ -1162,6 +1162,35 @@ func testGenerateMetricRowsForTenant(accountID, projectID uint32, rng *rand.Rand
}
func testGenerateMetricRows(rng *rand.Rand, rows uint64, timestampMin, timestampMax int64) []MetricRow {
return testGenerateMetricRowsWithPrefix(rng, rows, "metric", TimeRange{timestampMin, timestampMax})
}
func testGenerateMetricRowsWithPrefixForTenantID(rng *rand.Rand, accountID, projectID uint32, rows uint64, prefix string, tr TimeRange) []MetricRow {
var mrs []MetricRow
var mn MetricName
mn.Tags = []Tag{
{[]byte("job"), []byte("webservice")},
{[]byte("instance"), []byte("1.2.3.4")},
}
for i := 0; i < int(rows); i++ {
mn.AccountID = accountID
mn.ProjectID = projectID
mn.MetricGroup = []byte(fmt.Sprintf("%s_%d", prefix, i))
metricNameRaw := mn.marshalRaw(nil)
timestamp := rng.Int63n(tr.MaxTimestamp-tr.MinTimestamp) + tr.MinTimestamp
value := rng.NormFloat64() * 1e6
mr := MetricRow{
MetricNameRaw: metricNameRaw,
Timestamp: timestamp,
Value: value,
}
mrs = append(mrs, mr)
}
return mrs
}
func testGenerateMetricRowsWithPrefix(rng *rand.Rand, rows uint64, prefix string, tr TimeRange) []MetricRow {
var mrs []MetricRow
var mn MetricName
mn.Tags = []Tag{
@ -1171,9 +1200,9 @@ func testGenerateMetricRows(rng *rand.Rand, rows uint64, timestampMin, timestamp
for i := 0; i < int(rows); i++ {
mn.AccountID = uint32(rand.Intn(2))
mn.ProjectID = uint32(rand.Intn(3))
mn.MetricGroup = []byte(fmt.Sprintf("metric_%d", i))
mn.MetricGroup = []byte(fmt.Sprintf("%s_%d", prefix, i))
metricNameRaw := mn.marshalRaw(nil)
timestamp := rng.Int63n(timestampMax-timestampMin) + timestampMin
timestamp := rng.Int63n(tr.MaxTimestamp-tr.MinTimestamp) + tr.MinTimestamp
value := rng.NormFloat64() * 1e6
mr := MetricRow{
@ -1629,6 +1658,223 @@ func testCountAllMetricNames(s *Storage, accountID, projectID uint32, tr TimeRan
return len(names)
}
func TestStorageSearchMetricNames_TooManyTimeseries(t *testing.T) {
defer testRemoveAll(t)
const (
numDays = 100
numRows = 10
)
rng := rand.New(rand.NewSource(1))
var (
days []TimeRange
mrs []MetricRow
)
for i := range numDays {
day := TimeRange{
MinTimestamp: time.Date(2000, 1, i+1, 0, 0, 0, 0, time.UTC).UnixMilli(),
MaxTimestamp: time.Date(2000, 1, i+1, 23, 59, 59, 999, time.UTC).UnixMilli(),
}
days = append(days, day)
prefix1 := fmt.Sprintf("metric1_%d", i)
mrs = append(mrs, testGenerateMetricRowsWithPrefixForTenantID(rng, 0, 0, numRows, prefix1, day)...)
prefix2 := fmt.Sprintf("metric2_%d", i)
mrs = append(mrs, testGenerateMetricRowsWithPrefixForTenantID(rng, 0, 0, numRows, prefix2, day)...)
}
type options struct {
path string
filters []string
tr TimeRange
maxMetrics int
wantErr bool
wantCount int
}
f := func(opts *options) {
t.Helper()
s := MustOpenStorage(t.Name()+"/"+opts.path, 0, 0, 0)
defer s.MustClose()
s.AddRows(mrs, defaultPrecisionBits)
s.DebugFlush()
var tfss []*TagFilters
for _, filter := range opts.filters {
filter := fmt.Sprintf("%s.*", filter)
tfs := NewTagFilters(0, 0)
if err := tfs.Add(nil, []byte(filter), false, true); err != nil {
t.Fatalf("unexpected error in TagFilters.Add: %v", err)
}
tfss = append(tfss, tfs)
}
names, err := s.SearchMetricNames(nil, tfss, opts.tr, opts.maxMetrics, noDeadline)
gotErr := err != nil
if gotErr != opts.wantErr {
t.Errorf("SeachMetricNames(%v, %v, %d): unexpected error: got %v, want error to happen %v", []any{
tfss, &opts.tr, opts.maxMetrics, err, opts.wantErr,
}...)
}
if got := len(names); got != opts.wantCount {
t.Errorf("SeachMetricNames(%v, %v, %d): unexpected metric name count: got %d, want %d", []any{
tfss, &opts.tr, opts.maxMetrics, got, opts.wantCount,
}...)
}
}
// Using one filter to search metric names within one day. The maxMetrics
// param is set to match exactly the number of time series that match the
// filter within that time range. Search operation must complete
// successfully.
f(&options{
path: "OneDay/OneTagFilter/MaxMetricsNotExeeded",
filters: []string{"metric1"},
tr: days[0],
maxMetrics: numRows,
wantCount: numRows,
})
// Using one filter to search metric names within one day. The maxMetrics
// param is less than the number of time series that match the filter
// within that time range. Search operation must fail.
f(&options{
path: "OneDay/OneTagFilter/MaxMetricsExeeded",
filters: []string{"metric1"},
tr: days[0],
maxMetrics: numRows - 1,
wantErr: true,
})
// Using two filters to search metric names within one day. The maxMetrics
// param is set to match exactly the number of time series that match the
// two filters within that time range. Search operation must complete
// successfully.
f(&options{
path: "OneDay/TwoTagFilters/MaxMetricsNotExeeded",
filters: []string{"metric1", "metric2"},
tr: days[0],
maxMetrics: numRows * 2,
wantCount: numRows * 2,
})
// Using two filters to search metric names within one day. The maxMetrics
// param is less than the number of time series that match the two filters
// within that time range. Search operation must fail.
f(&options{
path: "OneDay/TwoTagFilters/MaxMetricsExeeded",
filters: []string{"metric1", "metric2"},
tr: days[0],
maxMetrics: numRows*2 - 1,
wantErr: true,
})
// Using one filter to search metric names within two days. The maxMetrics
// param is set to match exactly the number of time series that match the
// filter within that time range. Search operation must complete
// successfully.
f(&options{
path: "TwoDays/OneTagFilter/MaxMetricsNotExeeded",
filters: []string{"metric1"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[1].MaxTimestamp,
},
maxMetrics: numRows * 2,
wantCount: numRows * 2,
})
// Using one filter to search metric names within two days. The maxMetrics
// param is less than the number of time series that match the filter
// within that time range. Search operation must fail.
f(&options{
path: "TwoDays/OneTagFilter/MaxMetricsExeeded",
filters: []string{"metric1"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[1].MaxTimestamp,
},
maxMetrics: numRows*2 - 1,
wantErr: true,
})
// Using two filters to search metric names within two days. The maxMetrics
// param is set to match exactly the number of time series that match the
// two filters within that time range. Search operation must complete
// successfully.
f(&options{
path: "TwoDays/TwoTagFilters/MaxMetricsNotExeeded",
filters: []string{"metric1", "metric2"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[1].MaxTimestamp,
},
maxMetrics: numRows * 4,
wantCount: numRows * 4,
})
// Using two filters to search metric names within two days. The maxMetrics
// param is less than the number of time series that match the two filters
// within that time range. Search operation must fail.
f(&options{
path: "TwoDays/TwoTagFilters/MaxMetricsExeeded",
filters: []string{"metric1", "metric2"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[1].MaxTimestamp,
},
maxMetrics: numRows*4 - 1,
wantErr: true,
})
// Using one filter to search metric names within the time range of 41 days.
// This time range corresponds to the day difference of 40 days, which is
// the max day difference when the per-day index is still used for
// searching. The maxMetrics param is set to match exactly the number of
// time series that match the filter within that time range. Search
// operation must complete successfully.
f(&options{
path: "40Days/OneTagFilter/MaxMetricsNotExeeded",
filters: []string{"metric1"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[40].MaxTimestamp,
},
maxMetrics: numRows * 41,
wantCount: numRows * 41,
})
// Using one filter to search metric names within the time range of 42 days.
// This time range corresponds to the day difference of 41 days, which is
// longer than than 40 days. In this case, the search is performed using
// global index instead of per-day index and the metric names will be
// searched within the entire retention period. The maxMetrics parameter,
// however, is set to the number of time series within the 42 days. The
// search must fail because the number of metrics will be much larger.
f(&options{
path: "MoreThan40Days/OneTagFilter/MaxMetricsExeeded",
filters: []string{"metric1"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[41].MaxTimestamp,
},
maxMetrics: numRows * 42,
wantErr: true,
})
// To fix the above case, the maxMetrics must be adjusted to be not less
// than the number of time series within the entire retention period.
f(&options{
path: "MoreThan40Days/OneTagFilter/MaxMetricsNotExeeded",
filters: []string{"metric1"},
tr: TimeRange{
MinTimestamp: days[0].MinTimestamp,
MaxTimestamp: days[41].MaxTimestamp,
},
maxMetrics: numRows * numDays,
wantCount: numRows * numDays,
})
}
// testCountAllMetricIDs is a test helper function that counts the IDs of
// all time series within the given time range.
func testCountAllMetricIDs(s *Storage, tr TimeRange) int {