app/vmagent/remotewrite: fix vmagent panic on shutdown
Currently, when vmagent is stopping it first flushes pending series in remote write context and proceeds to stop streaming aggregation. This leads to streaming aggregation being unable to write results into pending timeseries (since it is already nil) and panic.
This can lead to losing some aggregation results being lost almost silently.
The fix is reordering flow to first stop streaming aggregation and flush all pending time series after that.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmctl: add verbose output for docker installations or when TTY isn't available
* app/vmctl: fix tests
* app/vmctl: make vmctl interactive if no tty
* app/vmctl: cleanup
* app/vmctl: add comment
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmalert: fix nil map assignment
The storage instance with nil map params was created for remote-read purposes.
And before change 7a9ae9de0d this map was ignored in ApplyParams.
Now, it started to be used and vmalert panics in runtime.
The fix properly inits map for at `NewVMStorage` and verifies it is not nil
on assignment in `ApplyParams`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add to changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The purpose of the change is too highlight what HA pair is
and how deduplication needs identical labels to be present
in raw samples.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4367
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts the following commits:
- e0e16a2d36
- 2ce02a7fe6
The reason for revert: the updated logic breaks assumptions made
when fixing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698 .
For example, if a time series stop receiving new samples during the first
day after the indexdb rotation, there are chances that the time series
won't be registered in the new indexdb. This is OK until the next indexdb
rotation, since the time series is registered in the previous indexdb,
so it can be found during queries. But the time series will become invisible
for search after the next indexdb rotation, while its data is still there.
There is also incompletely solved issue with the increased CPU and disk IO resource
usage just after the indexdb rotation. There was an attempt to fix it, but it didn't fix
it in full, while introducing the issue mentioned above. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
TODO: to find out the solution, which simultaneously solves the following issues:
- increased memory usage for setups high churn rate and long retention (e.g. what the reverted commit does)
- increased CPU and disk IO usage during indexdb rotation ( https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 )
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
Possible solution - to create the new indexdb in one hour before the indexdb rotation
and to gradually pre-populate it with the needed index data during the last hour before indexdb rotation.
Then the new indexdb will contain all the needed data just after the rotation,
so it won't trigger increased CPU and disk IO.
- Document the change at docs/CHANGELOG.md
- Clarify comments for non-trivial code touched by the commit
- Improve the logic behind maybeCreateIndexes():
- Correctly create per-day indexes if the indexdb rotation is performed during
the first hour or the last hour of the day by UTC.
Previously there was a possibility of missing index entries on that day.
- Increase the duration for creating new indexes in the current indexdb for up to 22 hours
after indexdb rotation. This should reduce the increased resource usage
after indexdb rotation.
It is safe to postpone index creation for the current day until the last hour
of the current day after indexdb rotation by UTC, since the corresponding (date, ...)
entries exist in the previous indexdb.
- Search for TSID by (date, MetricName) in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
- Search for (date, metricID) entries in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
* added backup locking/unlocking against retention policy to vmbackupmanager
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added docs for new commands
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comments
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* feat: improvement of the top queries page
* vmui/docs: enhancements to top queries page
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmui: change default font size to 14px for better readability
vmui: fix bug with missing text on buttons in safari
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* app/vmui: added Labels with the highest number of unique values
* app/vmui: cleanup
* app/vmui: cleanup
* app/vmui: add table description
* app/vmui: fix comment, updated CHANGELOG.md
* app/vmui: disable links
* app/vmui: added actions to the table, it will show values for selected label with the highest number of series
* app/vmui: fix comment
* vmalert: expand rule groups on anchor click
before, anchor click was only updating the URL.
To expand the group, user had to click on rule's block.
Now, group will toggle automatically.
* vmalert: allow filtering group in web UI
The new filter allows to filter groups and rules within
groups by: errors only or noMatch only.
The filtering supposed to help navigating big numbers of groups/rules.
Filtering is reflected in URL, so can be shared as a link.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Without reset, labels duplicates could have been added during stream aggregation.
Since `ctx.Labels` is reused during processing of many series, each series will
add its labels to the context. Even if the same labels were already addeded on prev
iteration. Now, we reset `ctx.Labels` on each iteration to contain so labels from
different series didn't interfere.
This could have cause exceeding of the limit on number of labels per pushed time series.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4277
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmalert: detect alerting rules which don't match any series at all
vmalert starts to understand /query responses which contain object:
```
"stats":{"seriesFetched": "42"}
```
If object is present, vmalert parses it and populates a new field
`SeriesFetched`. This field is then used to populate the new metric
`vmalert_alerting_rules_last_evaluation_series_fetched` and to
display warnings in the vmalert's UI.
If response doesn't contain the new object (Prometheus or
VictoriaMetrics earlier than v1.90), then `SeriesFetched=nil`.
In this case, UI will contain no additional warnings.
And `vmalert_alerting_rules_last_evaluation_series_fetched` will
be set to `-1`. Negative value of the metric will help to compile
correct alerting rule in follow-up.
Thanks for the initial implementation to @Haleygo
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4056
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4039
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It appears that 90% usage for anonymous mem usage
is already concerning. So we lowering the threshold to 80%.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
When using `retentionTimezoneOffset` and having local timezone being more than 4 hours different from UTC indexdb retention calculation could return negative value. This caused indexdb rotation to get in loop.
Fix calculation of offset to use `retentionTimezoneOffset` value properly and add test to cover all legit timezone configs.
See:
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4207
- https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4206
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Windows doesn't allow to remove dir with opened files. Usually it's a case for snapshots, hard cannot be removed if file is openned.
With this change, dir will be renamed and properly deleted at the next process start.
It's recommended to restart vmstorage/vmsingle for snapshots deletion completion periodically.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* lib/promscrape/discovery/kubernetes: add common labels to all ports discovered from endpoints
Sets
`__meta_kubernetes_endpoints_name` and `__meta_kubernetes_namespace` labels to all ports of pod.
Prometheus sets those labels to all ports in pod (0ab9553611/discovery/kubernetes/endpoints.go (L267C15-L269)) even if port is not matching any service.
See: #4154
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix test for updated discovery logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Properly return empty slices instead of nil for `/api/v1/rules` and `/api/v1/alerts` API handlers.
This improves compatibility with Grafana.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4221
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Supports using `**` for `-rule` and `-rule.templates`: `dir/**/*.tpl` loads contents of dir and all subdirectories recursively.
See: #4041
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Artem Navoiev <tenmozes@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* lib/promscrape: adds filter for consul_sd_configs:
it allows advanced filtering for consul service discovery requests
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4183
* typo fix
* removes deprecation mentions since it's not relevant
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Templating of `-external.alert.source` is not expected to have access to the query which was causing runtime error when query function was passed as nil.
See: #4181
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmagent,lib/persistentqueue: show warning message if `--remoteWrite.maxDiskUsagePerURL` flag lower than 500MB
* app/vmagent,lib/persistentqueue: linter fix
* app/vmagent,lib/persistentqueue: fix comment
* feat: display heatmap in the explore metrics (#4111)
* fix: correct calc step for heatmap
* fix: remove spaces in the result of getDurationFromMilliseconds
* feat: add button "show today" to date picker
* feat: add comparison with the prev day (#3967)
* vmui/docs: add comparison of data to cardinality page
* feat: add WithTemplate page
* app/vmselect/prometheus: enable json mode for expand with expr API
* app/vmselect/prometheus: enable CORS and add content type
* feat: add api for expand with templates
* fix: remove console from useExpandWithExprs
* app/vmselect/prometheus: fix escaping
* vmui: integrate WITH template
* app/vmctl: check content type instead of form param
* fix: add content-type for fetch with-exprs
* fix: add a header to the server's response that allows the "Content-Type" header
* app/vmctl: added comment and cleanup
* app/vmctl: use format query param
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
* app/vmctl: add support for the different time format in the native binary protocol
* app/vmctl: update flag description, update CHANGELOG.md
* app/vmctl: add comment to exported function
* lib/httpserver: introduce `-http.maxConcurrentRequests` command-line flag
Introduce `-http.maxConcurrentRequests` command-line flag to protect
VM components from resource exhaustion during unexpected spikes of HTTP requests.
By default, the new flag's value is set to 0 which means no limits are applied.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/httpserver: mention http.maxConcurrentRequests in docs
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- added info about metric `vm_vminsert_metrics_read_total`,
- small doc refactoring
- and added make-command for running docs in docker.
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* vmalert: retry datasource requests with EOF or unexpected EOF errors
Retry failed read request on the closed connection one more time.
This may improve rules execution reliability when connection
between vmalert and datasource closes unexpectedly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix old tests
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This handler will instruct search engines that indexing is not allowed for the content exposed to the internet. This should help to address issues like #4128 when instances are exposed to the internet without authentication.
"How do We Keep Metrics for a Long Time in VictoriaMetrics" article is referenced twice in "Third-party articles and slides about VictoriaMetrics" section
Improperly configured -bigMergeConcurrency command-line flag usually leads to uncontrolled
growth of unmerged parts, which, in turn, increases CPU usage and query durations.
So it is better deprecating this flag. In rare cases -smallMergeConcurrency command-line flag
can be used instead for controlling the concurrency of background merges.
This makes it easier to understand exact point in time which is included in this backup.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The logic employed for re-using the previously loaded scrape target was broken initially.
The commit cc0427897c tried to fix it, but the new logic
became too complex and fragile. So it is better to just remove this logic,
since the targets from temporarily broken file should be eventually loaded on next
attempts every -promscrape.fileSDCheckInterval
This also allows removing fragile hacks around __vm_filepath label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3989
* lib/promscrape: fix the problem with scrape work duplicates when file_sd_config can't be read
* lib/promscrape: clarified comment
* lib/promscrape: made better approach to handle a problem with growing []*ScrapeWork on each error when loading config
* lib/promscrape: added CHANGELOG.md
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add tips for working with the graph and legend
* feat: add the ability to collapse the legend
* vmui/docs: add the ability to collapse the legend
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: check for free disk space before opening tables
We check for free disk space before call to `openTable`,
so `Storage` can be set to ReadOnly before mergeWorkers start.
Before the change, there was a chance that merges will start
even if Storage has to start in ReadOnly mode because of
`-storage.minFreeDiskSpaceBytes` limit.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4023
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/storage: chore
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update lib/storage/storage.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Make sure that the last successfully loaded config is used on hot-reload failure
- Properly cleanup resources occupied by already initialized aggregators
when the current aggregator fails to be initialized
- Expose distinct vmagent_streamaggr_config_reload* metrics per each -remoteWrite.streamAggr.config
This should simplify monitoring and debugging failed reloads
- Remove race condition at app/vminsert/common.MustStopStreamAggr when calling sa.MustStop() while sa
could be in use at realoadSaConfig()
- Remove lib/streamaggr.aggregator.hasState global variable, since it may negatively impact scalability
on system with big number of CPU cores at hasState.Store(true) call inside aggregator.Push().
- Remove fine-grained aggregator reload - reload all the aggregators on config change instead.
This simplifies the code a bit. The fine-grained aggregator reload may be returned back
if there will be demand from real users for it.
- Check -relabelConfig and -streamAggr.config files when single-node VictoriaMetrics runs with -dryRun flag
- Return back accidentally removed changelog for v1.87.4 at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3639
Verifying status code helps to avoid misleading errors caused by attempt to parse unsuccessful response.
Related issue: #4034
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>