app/vmselect/netstorage: add support for the ability to set cross-group replication factor at vmselect

The cross-group replication factor can be set via `-globalReplicationFactor` command-line flag at vmselect.
In this case vmselect continues returning full responses if up to globalReplicationFactor-1 groups are unavailable.

See https://docs.victoriametrics.com/cluster-victoriametrics/#vmstorage-groups-at-vmselect for details.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6054
This commit is contained in:
Aliaksandr Valialkin 2024-04-19 00:09:15 +02:00
parent e138f4827e
commit 90794e84bc
No known key found for this signature in database
GPG key ID: 52C003EE2BCDB9EB
4 changed files with 124 additions and 60 deletions

View file

@ -773,38 +773,60 @@ These issues are addressed by [vmagent](https://docs.victoriametrics.com/vmagent
`vmselect` can be configured to query multiple distinct groups of `vmstorage` nodes with individual `-replicationFactor` per each group.
The following format for `-storageNode` command-line flag value should be used for assigning a particular `addr` of `vmstorage` to a particular `groupName` -
`-storageNode=groupName/addr`. For example, the following command runs `vmselect`, which continues returning full responses if up to one node per each group is temporarily unavailable
`-storageNode=groupName/addr`. The `groupName` can contain arbitrary value. The only rule is that every `vmstorage` group must have an unique name.
For example, the following command runs `vmselect`, which continues returning full responses if up to one node per each group is temporarily unavailable
because the given `-replicationFactor=2` is applied individually per each group:
```
/path/to/vmselect \
-replicationFactor=2 \
-storageNode=group1/host1 \
-storageNode=group1/host2 \
-storageNode=group1/host3 \
-storageNode=group2/host4 \
-storageNode=group2/host5 \
-storageNode=group2/host6 \
-storageNode=group3/host7 \
-storageNode=group3/host8 \
-storageNode=group3/host9
-storageNode=g1/host1,g1/host2,g1/host3 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-storageNode=g3/host7,g3/host8,g3/host9
```
It is possible to specify distinct `-replicationFactor` per each group via the following format - `-replicationFactor=groupName:rf`.
For example, the following command runs `vmselect`, which uses `-replicationFactor=3` for the `group1`, while it uses `-replicationFactor=1` for the `group2`:
It is possible specifying distinct `-replicationFactor` per each group via the following format - `-replicationFactor=groupName:rf`.
For example, the following command runs `vmselect`, which uses `-replicationFactor=3` for the group `g1`, `-replicationFactor=2` for the group `g2`
and `-replicationFactor=1` for the group `g3`:
```
/path/to/vmselect \
-replicationFactor=group1:3 \
-storageNode=group1/host1 \
-storageNode=group1/host2 \
-storageNode=group1/host3 \
-replicationFactor=group2:1 \
-storageNode=group2/host4 \
-storageNode=group2/host5 \
-storageNode=group2/host6
-replicationFactor=g1:3 \
-storageNode=g1/host1,g1/host2,g1/host3 \
-replicationFactor=g2:2 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-replicationFactor=g3:1 \
-storageNode=g3/host4,g3/host5,g3/host6
```
If every ingested sample is replicated across multiple `vmstorage` groups, then pass `-globalReplicationFactor=N` command-line flag to `vmselect`,
so it could continue returning full responses if up to `N-1` `vmstorage` groups are temporarily unavailable.
For example, the following command runs `vmselect`, which continues returning full responses if any number of `vmstorage` nodes
in a single `vmstorage` group are temporarily unavailable:
```
/path/to/vmselect \
-globalReplicationFactor=2 \
-storageNode=g1/host1,g1/host2,g1/host3 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-storageNode=g3/host7,g3/host8,g3/host9
```
It is OK to mix `-replicationFactor` and `-globalReplicationFactor`. For example, the folling command runs `vmselect`, which continues returning full responses
if any number of `vmstorage` nodes in a single `vmstorage` group are temporarily unavailable and the remaining groups contain up to two unavailable `vmstorage` node:
```
/path/to/vmselect \
-globalReplicationFactor=2 \
-replicationFactor=3 \
-storageNode=g1/host1,g1/host2,g1/host3 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-storageNode=g3/host7,g3/host8,g3/host9
```
See also [multi-level cluster setup](#multi-level-cluster-setup).
## Helm
Helm chart simplifies managing cluster version of VictoriaMetrics in Kubernetes.
@ -1360,6 +1382,8 @@ Below is the output for `/path/to/vmselect -help`:
Flag value can be read from the given file when using -flagsAuthKey=file:///abs/path/to/file or -flagsAuthKey=file://./relative/path/to/file . Flag value can be read from the given http/https url when using -flagsAuthKey=http://host/path or -flagsAuthKey=https://host/path
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default, mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-globalReplicationFactor int
How many copies of every ingested sample is available across vmstorage groups. vmselect continues returning full responses when up to globalReplicationFactor-1 vmstorage groups are temporarily unavailable. See https://docs.victoriametrics.com/cluster-victoriametrics/#vmstorage-groups-at-vmselect . See also -replicationFactor (default 1)
-http.connTimeout duration
Incoming connections to -httpListenAddr are closed after the configured timeout. This may help evenly spreading load among a cluster of services behind TCP-level load balancer. Zero value disables closing of incoming connections (default 2m0s)
-http.disableResponseCompression
@ -1461,7 +1485,7 @@ Below is the output for `/path/to/vmselect -help`:
Supports an array of values separated by comma or specified via multiple flags.
Value can contain comma inside single-quoted or double-quoted string, {}, [] and () braces.
-replicationFactor array
How many copies of every time series is available on the provided -storageNode nodes. vmselect continues returning full responses when up to replicationFactor-1 vmstorage nodes are temporarily unavailable during querying. See also -search.skipSlowReplicas (default 1)
How many copies of every ingested sample is available across -storageNode nodes. vmselect continues returning full responses when up to replicationFactor-1 vmstorage nodes are temporarily unavailable. See also -globalReplicationFactor and -search.skipSlowReplicas (default 1)
Supports an array of `key:value` entries separated by comma or specified via multiple flags.
-search.cacheTimestampOffset duration
The maximum duration since the current time for response data, which is always queried from the original raw data, without using the response cache. Increase this value if you see gaps in responses due to time synchronization issues between VictoriaMetrics and data sources (default 5m0s)

View file

@ -35,9 +35,12 @@ import (
)
var (
replicationFactor = flagutil.NewDictInt("replicationFactor", 1, "How many copies of every time series is available on the provided -storageNode nodes. "+
"vmselect continues returning full responses when up to replicationFactor-1 vmstorage nodes are temporarily unavailable during querying. "+
"See also -search.skipSlowReplicas")
globalReplicationFactor = flag.Int("globalReplicationFactor", 1, "How many copies of every ingested sample is available across vmstorage groups. "+
"vmselect continues returning full responses when up to globalReplicationFactor-1 vmstorage groups are temporarily unavailable. "+
"See https://docs.victoriametrics.com/cluster-victoriametrics/#vmstorage-groups-at-vmselect . See also -replicationFactor")
replicationFactor = flagutil.NewDictInt("replicationFactor", 1, "How many copies of every ingested sample is available across -storageNode nodes. "+
"vmselect continues returning full responses when up to replicationFactor-1 vmstorage nodes are temporarily unavailable. "+
"See also -globalReplicationFactor and -search.skipSlowReplicas")
skipSlowReplicas = flag.Bool("search.skipSlowReplicas", false, "Whether to skip -replicationFactor - 1 slowest vmstorage nodes during querying. "+
"Enabling this setting may improve query speed, but it could also lead to incomplete results if some queried data has less than -replicationFactor "+
"copies at vmstorage nodes. Consider enabling this setting only if all the queried data contains -replicationFactor copies in the cluster")
@ -1870,6 +1873,7 @@ func (snr *storageNodesRequest) collectResults(partialResultsCounter *metrics.Co
groupsCount := sns[0].group.groupsCount
resultsCollectedPerGroup := make(map[*storageNodesGroup]int, groupsCount)
errsPartialPerGroup := make(map[*storageNodesGroup][]error)
groupsPartial := make(map[*storageNodesGroup]struct{})
for range sns {
// There is no need in timer here, since all the goroutines executing the f function
// passed to startStorageNodesRequest must be finished until the deadline.
@ -1895,6 +1899,12 @@ func (snr *storageNodesRequest) collectResults(partialResultsCounter *metrics.Co
errsPartialPerGroup[group] = append(errsPartialPerGroup[group], err)
if snr.denyPartialResponse && len(errsPartialPerGroup[group]) >= group.replicationFactor {
groupsPartial[group] = struct{}{}
if len(groupsPartial) < *globalReplicationFactor {
// Ignore this error, since the number of groups with partial results is smaller than the globalReplicationFactor.
continue
}
// Return the error to the caller if partial responses are denied
// and the number of partial responses for the given group reach its replicationFactor,
// since this means that the response is partial.
@ -1932,36 +1942,41 @@ func (snr *storageNodesRequest) collectResults(partialResultsCounter *metrics.Co
}
// Verify whether the full result can be returned
isFullResponse := true
failedGroups := 0
for g, errsPartial := range errsPartialPerGroup {
if len(errsPartial) >= g.replicationFactor {
isFullResponse = false
break
failedGroups++
}
}
if isFullResponse {
// Assume that the result is full if the the number of failing vmstorage nodes
// is smaller than the replicationFactor per each group.
if failedGroups < *globalReplicationFactor {
// Assume that the result is full if the the number of failed groups is smaller than the globalReplicationFactor.
return false, nil
}
// Verify whether there is at least a single node per each group, which successfully returned result,
// in order to return partial result.
// Verify whether at least a single node per each group successfully returned result in order to be able returning partial result.
missingGroups := 0
var firstErr error
for g, errsPartial := range errsPartialPerGroup {
if len(errsPartial) == g.nodesCount {
// All the vmstorage nodes at the given group g returned error.
// Return only the first error, since it has no sense in returning all errors.
// Returns 503 status code for partial response, so the caller could retry it if needed.
err := &httpserver.ErrorWithStatusCode{
Err: errsPartial[0],
StatusCode: http.StatusServiceUnavailable,
missingGroups++
if firstErr == nil {
// Return only the first error, since it has no sense in returning all errors.
firstErr = errsPartial[0]
}
return false, err
}
if len(errsPartial) > 0 {
partialErrorsLogger.Warnf("%d out of %d vmstorage nodes at group %q were unavailable during the query; a sample error: %s", len(errsPartial), len(sns), g.name, errsPartial[0])
}
}
if missingGroups >= *globalReplicationFactor {
// Too many groups contain all the non-working vmstorage nodes.
// Returns 503 status code, so the caller could retry it if needed.
err := &httpserver.ErrorWithStatusCode{
Err: firstErr,
StatusCode: http.StatusServiceUnavailable,
}
return false, err
}
// Return partial results.
// This allows continuing returning responses in the case

View file

@ -30,6 +30,7 @@ See also [LTS releases](https://docs.victoriametrics.com/lts-releases/).
## tip
* FEATURE: [VictoriaMetrics cluster](https://docs.victoriametrics.com/cluster-victoriametrics/): add support for fault domain awareness to `vmselect`. It can be configured to return full responses if up to `-globalReplicationFactor - 1` fault domains (aka `vmstorage` groups) are unavailable. See [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6054) and [these docs](https://docs.victoriametrics.com/cluster-victoriametrics/#vmstorage-groups-at-vmselect).
* FEATURE: all VictoriaMetrics [enterprise](https://docs.victoriametrics.com/enterprise/) components: add support for automatic issuing of TLS certificates for HTTPS server at `-httpListenAddr` via [Let's Encrypt service](https://letsencrypt.org/). See [these docs](https://docs.victoriametrics.com/#automatic-issuing-of-tls-certificates) and [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5949).
* FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent/): support [DNS SRV](https://en.wikipedia.org/wiki/SRV_record) addresses in `-remoteWrite.url` command-line option and in scrape target urls. For example, `-remoteWrite.url=http://srv+victoria-metrics/api/v1/write` automatically resolves the `victoria-metrics` DNS SRV to a list of hostnames with TCP ports and then sends the collected metrics to these TCP addresses. See [these docs](https://docs.victoriametrics.com/vmagent/#srv-urls) and [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6053).
* FEATURE: [vmauth](https://docs.victoriametrics.com/vmauth/): support automatic discovering and load balancing for TCP addresses behind DNS SRV addresses. These addresses can be put inside `url_prefix` urls in the form `http://srv+addr/path`, where the `addr` is the [DNS SRV](https://en.wikipedia.org/wiki/SRV_record) address, which is automatically resolved to hostnames with TCP ports. See [these docs](https://docs.victoriametrics.com/vmauth/#srv-urls) for details.

View file

@ -784,38 +784,60 @@ These issues are addressed by [vmagent](https://docs.victoriametrics.com/vmagent
`vmselect` can be configured to query multiple distinct groups of `vmstorage` nodes with individual `-replicationFactor` per each group.
The following format for `-storageNode` command-line flag value should be used for assigning a particular `addr` of `vmstorage` to a particular `groupName` -
`-storageNode=groupName/addr`. For example, the following command runs `vmselect`, which continues returning full responses if up to one node per each group is temporarily unavailable
`-storageNode=groupName/addr`. The `groupName` can contain arbitrary value. The only rule is that every `vmstorage` group must have an unique name.
For example, the following command runs `vmselect`, which continues returning full responses if up to one node per each group is temporarily unavailable
because the given `-replicationFactor=2` is applied individually per each group:
```
/path/to/vmselect \
-replicationFactor=2 \
-storageNode=group1/host1 \
-storageNode=group1/host2 \
-storageNode=group1/host3 \
-storageNode=group2/host4 \
-storageNode=group2/host5 \
-storageNode=group2/host6 \
-storageNode=group3/host7 \
-storageNode=group3/host8 \
-storageNode=group3/host9
-storageNode=g1/host1,g1/host2,g1/host3 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-storageNode=g3/host7,g3/host8,g3/host9
```
It is possible to specify distinct `-replicationFactor` per each group via the following format - `-replicationFactor=groupName:rf`.
For example, the following command runs `vmselect`, which uses `-replicationFactor=3` for the `group1`, while it uses `-replicationFactor=1` for the `group2`:
It is possible specifying distinct `-replicationFactor` per each group via the following format - `-replicationFactor=groupName:rf`.
For example, the following command runs `vmselect`, which uses `-replicationFactor=3` for the group `g1`, `-replicationFactor=2` for the group `g2`
and `-replicationFactor=1` for the group `g3`:
```
/path/to/vmselect \
-replicationFactor=group1:3 \
-storageNode=group1/host1 \
-storageNode=group1/host2 \
-storageNode=group1/host3 \
-replicationFactor=group2:1 \
-storageNode=group2/host4 \
-storageNode=group2/host5 \
-storageNode=group2/host6
-replicationFactor=g1:3 \
-storageNode=g1/host1,g1/host2,g1/host3 \
-replicationFactor=g2:2 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-replicationFactor=g3:1 \
-storageNode=g3/host4,g3/host5,g3/host6
```
If every ingested sample is replicated across multiple `vmstorage` groups, then pass `-globalReplicationFactor=N` command-line flag to `vmselect`,
so it could continue returning full responses if up to `N-1` `vmstorage` groups are temporarily unavailable.
For example, the following command runs `vmselect`, which continues returning full responses if any number of `vmstorage` nodes
in a single `vmstorage` group are temporarily unavailable:
```
/path/to/vmselect \
-globalReplicationFactor=2 \
-storageNode=g1/host1,g1/host2,g1/host3 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-storageNode=g3/host7,g3/host8,g3/host9
```
It is OK to mix `-replicationFactor` and `-globalReplicationFactor`. For example, the folling command runs `vmselect`, which continues returning full responses
if any number of `vmstorage` nodes in a single `vmstorage` group are temporarily unavailable and the remaining groups contain up to two unavailable `vmstorage` node:
```
/path/to/vmselect \
-globalReplicationFactor=2 \
-replicationFactor=3 \
-storageNode=g1/host1,g1/host2,g1/host3 \
-storageNode=g2/host4,g2/host5,g2/host6 \
-storageNode=g3/host7,g3/host8,g3/host9
```
See also [multi-level cluster setup](#multi-level-cluster-setup).
## Helm
Helm chart simplifies managing cluster version of VictoriaMetrics in Kubernetes.
@ -1371,6 +1393,8 @@ Below is the output for `/path/to/vmselect -help`:
Flag value can be read from the given file when using -flagsAuthKey=file:///abs/path/to/file or -flagsAuthKey=file://./relative/path/to/file . Flag value can be read from the given http/https url when using -flagsAuthKey=http://host/path or -flagsAuthKey=https://host/path
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default, mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-globalReplicationFactor int
How many copies of every ingested sample is available across vmstorage groups. vmselect continues returning full responses when up to globalReplicationFactor-1 vmstorage groups are temporarily unavailable. See https://docs.victoriametrics.com/cluster-victoriametrics/#vmstorage-groups-at-vmselect . See also -replicationFactor (default 1)
-http.connTimeout duration
Incoming connections to -httpListenAddr are closed after the configured timeout. This may help evenly spreading load among a cluster of services behind TCP-level load balancer. Zero value disables closing of incoming connections (default 2m0s)
-http.disableResponseCompression
@ -1472,7 +1496,7 @@ Below is the output for `/path/to/vmselect -help`:
Supports an array of values separated by comma or specified via multiple flags.
Value can contain comma inside single-quoted or double-quoted string, {}, [] and () braces.
-replicationFactor array
How many copies of every time series is available on the provided -storageNode nodes. vmselect continues returning full responses when up to replicationFactor-1 vmstorage nodes are temporarily unavailable during querying. See also -search.skipSlowReplicas (default 1)
How many copies of every ingested sample is available across -storageNode nodes. vmselect continues returning full responses when up to replicationFactor-1 vmstorage nodes are temporarily unavailable. See also -globalReplicationFactor and -search.skipSlowReplicas (default 1)
Supports an array of `key:value` entries separated by comma or specified via multiple flags.
-search.cacheTimestampOffset duration
The maximum duration since the current time for response data, which is always queried from the original raw data, without using the response cache. Increase this value if you see gaps in responses due to time synchronization issues between VictoriaMetrics and data sources (default 5m0s)