datadb.rb contains logRows shards, which weren't freed up after the data ingestion
for the given per-day datadb is stopped. This leads to slow memory leak when VictoriaLogs runs
for multiple days without restarts. Avoid this memory leak by freeing up the logRows shards
after converting them to in-memory parts. Re-use the freed up logRows shards via a pool in order
to reduce the pressure on GC.
(cherry picked from commit ec6f33f526)
Use multiple independent logRows shards for storing the pending log entries before converting them to searchable parts.
Every shard is protected by its own mutex, so multiple CPU cores may add multiple log rows into datadb at the same time.
This increases the performance of BenchmarkStorageMustAddRows/rowsPerInsert-1, which ingests log rows own-by-one
from concurrently running goroutines, by 2x.
(cherry picked from commit 8ad81220d3)
This commit modifies the logging behavior for client network errors
(e.g., EOFs, timeouts) during the handshake process. They are now logged
as warnings instead of errors, as they are not actionable from the
server’s perspective. Here's some examples of such errors.
Timeouts during the initial read phase:
2025-04-09T07:08:59.323Z error
VictoriaMetrics/lib/vmselectapi/server.go:204 cannot perform
vmselect handshake with client "<REDACTED>": cannot read hello: cannot
read message with size 11: read tcp4 <REDACTED>-><REDACTED>: i/o
timeout; read only 0 bytes
EOFs occurring later in the handshake process:
2025-04-08T18:01:30.783Z error
VictoriaMetrics/lib/vmselectapi/server.go:204 cannot perform
vmselect handshake with client "<REDACTED>": cannot read isCompressed
flag: cannot read message with size 1: EOF; read only 0 bytes
By logging these as warnings, we reduce noise in error logs while
preserving valuble information for debug.
Previously, if `cpu.max` file has only `max` resource defined without
`period`, it was parsed incorrectly and silently drop error. While this
syntax is valid and actually used by some container runtimes. If period
is not defined, default value for it 100_000 must be used.
This commit fixes parsing function by using default value for period.
In addition, it adds zero value check, which fixes possible panic if
period has 0 value.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8808
This reverts commit fa6a32a39d.
Reason for revert: the broken tests were fixed on GOARCH=386 by skipping the check for the state size
after improting the state of stats function, since the state size depends on the hardware architecture.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8710
### Describe Your Changes
HTTP/2 support is used by some S3-compatible storage providers, so
disabling it default leads to unexpected errors when trying to connect
to S3 endpoint.
For example, using MinIO as S3 storage backend: `net/http: HTTP/1.x
transport connection broken: malformed HTTP response`.
HTTP/2 was enabled by default previously, but while fixing inconsistency
e5f4826 commit disabled this by default.
cc: @valyala
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
### Describe Your Changes
Fixes which are required in order to build FIPS-compliant binaries.
These changes were originally added for enterprise version and synced to
opensource for consistency and easier maintenance.
- consistently use `hash/fnv` at `app/vmalert` when calculating
checksums. Usage of md5 is not allowed in FIPS mode.
- increase encryption keys size used in testing in order to allow tests
to successfully run in FIPS mode
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
### Describe Your Changes
Previously, it was possible to use any UTF-8 string to specify list of
labels. While this makes it easier to use it also leads to unexpected
parsing results in some cases (see
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8584 as an
example).
Enforce specifying metric in format {label="value"...} in order to avoid
issues with unexpected parsing results.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit b1523f650d)
This reduces the overhead needed for converting the ingested log entries to searchable in-memory parts
when small number of log entries are passed to Storage.MustAddRows().
The BenchmarkStorageMustAddRows shows up to 10x performance increase for rowsPerInsert=1,
up to 5x performance increase for rowsPerInsert=10 and up to 2x performance increase for rowsPerInsert=100.
This should reduce CPU usage during data ingestion when every request contains small number of rows.
(cherry picked from commit 5491d54c11)
Previously, metric names stats API had a false assumption, that max
size of metric name is 256 byte. But this is configurable parameter with
4096 bytes max size. It triggered errors during API requests.
This commit replaces hard-coded 256 byte limit with common constant:
maxLabelValueSize. It has 16 MB limit.
In addition, this commit adds check for metric name stats tracker,
if metric name size exceeds default buffer limit, it will be allocated
directly on heap. It must be rare case, since most metric names has
16-64 byte size.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8759
Do not reset wc.labels in order to properly keep track of the number of used labels for the scrape,
and properly re-use the same number of wc.labels on subsequent scrapes.
See 12f26668a6 (r155481168)
### Describe Your Changes
fixes#8707
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Fix tests by adding accountID and projectID.
The tests were cherry-picked from master and were failing to build because
cluster version requires accountID and projectID.
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
Using this package lets to manipulate time. In this particular case, it
lets to advance the time 61 second forward instantly.
A few side changes were necessary:
- Do not use fasttime in unit tests. The fasttime package starts a
goroutine outside the test bubble which causes the clock to be real, not
fake.
- Stop the time.Ticker explicitly and also stop idbNext. These two
create goroutines with infinite loops which causes the unit tests that
use synctest to hang forever. All goroutines created inside the bubble
must exit in order for the syntest to finish.
- synctest is an experimental package and requires an environment
variable to be set. The Makefile was changed to set it.
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
### Describe Your Changes
The write concurrency limiter in ReadUncompressedData was previously
removed in
22d1b916bf
to avoid suboptimal behavior in certain scenarios. However, follow-up
reports—including issue
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8674 and
production feedback from VictoriaMetrics Cloud—indicated a noticeable
degradation in performance after its removal.
To mitigate these regressions, this commit reintroduces the concurrency
limiter. A long-term, more optimal solution will be explored separately
in issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8728.
TODO:
* [x] Changelog
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 231810fe49)
This commit adds new fields - `requestsCount` and `lastRequestTimestamp`
to series count be metric names stats.
It allows to display an additional stats at explore cardinality page.
Stats will only be added if `storage.trackMetricNameStats` flag is set.
This change requires an update to RPC protocol in order to properly
marshal data.
In addition, this commit adds integration tests to TSDB stats API.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6145
This change unblocks testing pipelines in CI for other contributions.
The tests are commented because I don't have full understanding of
fixing them.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8710
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Under heavy load, vmagent's wirte concurrency limiter
(2ab53acce4/lib/writeconcurrencylimiter/concurrencylimiter.go (L111))
queues incoming requests. If a client's timeout is shorter than the wait
time in the
queue, the client may close the connection before vmagent starts
processing it. When vmagent then tries to read the request body, it
encounters an ambiguous `unexpected EOF` error
(https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8675).
This commit adds more context to such errors to help users diagnose and
resolve
the issue when it's related to vmagent's own load and queuing behavior.
Possible user actions include:
- Lowering `-insert.maxQueueDuration` below the client's timeout.
- Increasing the client-side timeout, if applicable.
- Scaling up vmagent (e.g., adding more CPU resources).
- Increasing `-maxConcurrentInserts` if CPU capacity allows.
Steps to reproduce:
https://gist.github.com/makasim/6984e20f57bfd944411f56a7ebe5b6bf
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
This commit improves how vmagent selects the remote write protocol.
Previously, vmagent [performed a handshake
probe](0ff1a3b154/lib/protoparser/protoparserutil/vmproto_handshake.go (L11))
at
[startup](0ff1a3b154/app/vmagent/remotewrite/client.go (L173)):
- If the probe succeeded, it used the VictoriaMetrics (VM) protocol.
- If the probe failed, it downgraded to the Prometheus protocol.
- No protocol changes occurred after the initial probe at runtime.
However, this approach had limitations:
- If vmstorage was unavailable during vmagent startup, vmagent would
immediately downgrade to the Prometheus protocol, leading to higher
network usage unitl vmagent restarted. This case has been reported in
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7615.
- If the remote write server was updated or downgraded (e.g., during a
fallback or migration), vmagent would not detect the protocol change. It
would continue retrying failed requests and eventually drop them.
Require a restart of vmagent to pick up the new protocol.
This commit introduces a more adaptive mechanism.
vmagent always starts with the VM protocol and downgrades to the
Prometheus protocol only if an unsupported media type or bad request
response is received.
When this happens, the protocol is downgraded for all future requests.
In-flight requests are re-packed from Zstd to Snappy and retried
immediately.
Snappy-encoded requests are dropped if an unsupported media type or bad
request is received (no retrying).
Additionally, the in-memory and persisted queues could mix snappy and
zstd encoded blocks. The proper encoding is decided before sending by
encoding.IsZstd function.
TODO:
* [x] Add tests
* [x] Update documentation
* [x] Changelog
* [x] Research on
[content-type](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/8462#issuecomment-2786918054),
[accept-encoding](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/8462#issuecomment-2786923382)
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7615#top
issue.
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
'authKey' is well-known url and form param for VictoriaMetrics
components authorization. Previously, it could be printed into stdout
via httpserver error logger. It makes this authKey insecure and hard to
use.
This commit prevents from logging authKey defined at PostForm or as part
of url.Query.
It's recommneded to transfer authKey via PostForm and it should be
implemented at separate PRs.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5973
---------
Signed-off-by: f41gh7 <nik@victoriametrics.com>
Previously, if ProfileName is set to empty value (as default). AWS s3
lib ignored any profile config defined with `-configProfilePath`.
This commit correctly configure client options and set profile name only
if it's set to non-empty value.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/8668
When creating and listing snapshots, panic instead of returning an error
since errors are not recoverable anyway.
Also do not cleanup the filesystem on panic. Leave as is for further
manual inspection.
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
This reduces memory usage when reading large response bodies because the underlying buffer
doesn't need to be re-allocated during the read of large response body in the buffer.
Also decompress response body under the processScrapedDataConcurrencyLimitCh .
This reduces CPU usage and RAM usage a bit when scraping thousands of targets.
This reduces memory usage for vmagent when scraping big number of targets at the cost of slightly higher CPU usage.
The increased CPU usage can be decreased by disabling tracking of stale markers either via -promscrape.noStaleMarkers
command-line flag or via `no_stale_markers: true` option at the scrape config pointed by -promscrape.config command-line flag.
See https://docs.victoriametrics.com/vmagent/#prometheus-staleness-markers
The pool is used mostly for obtaining byte buffers for responses from scrape targets.
There are no responses smaller than 256 bytes in practice, so there is no sense in maintaining
pools for byte slices up to 64 and 128 bytes.
Previously the maxLabelsLen could be updated with smaller value after it is updated to bigger value by concurrently running goroutines.
Prevent this by loading the latest maxLabelsLen value and updating it only if it is smaller than the current len(wc.labels)
before the exit from callback passed to stream.Parse.
While at it, return early from the callback on the sample_limit exceeding error,
since the rest of the code in the callback becomes no-op after wc.reset().
This simplifies following the logic in the code a bit.
Also remove outdated misleading comment in front of sw.pushData() call inside callbacks passed to stream.Parse.
This comment has no sense after every callback start working with its own goroutine-local wc.