github-mirrors/VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-12-11 14:53:49 +00:00

Author	SHA1	Message	Date
Hui Wang	9616814728	vmalert: integrate with victorialogs (#7255 ) address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6706. See https://github.com/VictoriaMetrics/VictoriaMetrics/blob/vmalert-support-vlog-ds/docs/VictoriaLogs/vmalert.md. Related fix https://github.com/VictoriaMetrics/VictoriaMetrics/pull/7254. Note: in this pull request, vmalert doesn't support [backfilling](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/vmalert-support-vlog-ds/docs/VictoriaLogs/vmalert.md#rules-backfilling) for rules with a customized time filter. It might be added in the future, see [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7289) for details. Feature can be tested with image `victoriametrics/vmalert:heads-vmalert-support-vlog-ds-0-g420629c-scratch`. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com> (cherry picked from commit `68bad22fd2`)	2024-10-29 16:32:00 +01:00
Nikolay	5025ede7bc	lib/mergeset: adds tracking for indexdb records drop (#6297 ) It allows to create alert for possible item drops at indexdb. It may happen, if ingested metric size exceeds max indexdb item size. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com> (cherry picked from commit `69d244e6fb`) Signed-off-by: hagen1778 <roman@victoriametrics.com>	2024-05-24 16:08:34 +02:00
Corporte Gadfly	916170cca2	deployment: minor grammatical fixes in alert descriptions (#6199 ) (cherry picked from commit `8bca4d2de4`)	2024-04-30 10:30:17 +02:00
Hui Wang	0cd0ddc1c1	deployment/dashboards: fix `Storage full ETA` panels (#5747 ) During background downsampling, rate(vm_deduplicated_samples_total{type="merge"}) could be much bigger than rate(vm_rows_added_to_storage_total) and it could last quite some time, which causes negative values of Storage full ETA and confuses users, see playground. Instead of trying to get more accurate results during downsampling, I think it's ok to ignore vm_deduplicated_samples_total at all, it's more reasonable to see Storage full ETA increase after downsampling. --------- Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2024-02-08 12:54:31 +02:00
hagen1778	343176438e	deployment/alerts: add `job` label to `DiskRunsOutOfSpace` alerting rule So it is easier to understand to which installation the triggered instance belongs. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2024-01-16 22:21:34 +02:00
hagen1778	383dce7201	alerts: simplify aggregation of alerting rules This is follow-up after `75196d7234` It updates some of the alerting rules to remove unnecessary aggregations. It keeps aggregations for expressions which are using multiple time series filters to make sure their label will match. Signed-off-by: hagen1778 <roman@victoriametrics.com> (cherry picked from commit `8fb68152e6`)	2023-12-11 15:38:16 +01:00
hagen1778	f2b06484f2	alerts: move `ConcurrentFlushesHitTheLimit` alert to health alerts The `ConcurrentFlushesHitTheLimit` could be related to components like vminsert, vmstorage, vm-single-node and vmagent. Moving this alert to the `health` section of alerts will be benefitial for all components and will remove the duplicates from single/cluster alerts. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-08-11 04:39:28 -07:00
Aliaksandr Valialkin	531b35b6c0	docs/Troubleshooting.md: document an additional case, which could result in slow inserts If `-cacheExpireDuration` is lower than the interval between ingested samples for the same time series, then vm_slow_row_inserts_total` metric is increased. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183	2023-03-20 14:33:27 -07:00
Aliaksandr Valialkin	b275983403	lib/writeconcurrencylimiter: improve the logic behind -maxConcurrentInserts limit Previously the -maxConcurrentInserts was limiting the number of established client connections, which write data to VictoriaMetrics. Some of these connections could be idle. Such connections do not consume big amounts of CPU and RAM, so there is a little sense in limiting the number of such connections. So now the -maxConcurrentInserts command-line option limits the number of concurrently executed insert requests, not including idle connections. It is recommended removing -maxConcurrentInserts command-line option, since the default value for this option should work good for most cases.	2023-01-06 22:07:16 -08:00
Roman Khavronenko	70c3664b40	alerts: remove `show_at` label for RequestErrorsToAPI alert (#3455 ) Alert `RequestErrorsToAPI` could be permanently triggered due to mistakes in clients configuration. However, such requests are unlikely to cause VM health state change. So there is no need in displaying this alert because there will be no correlation caused by it. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-12-07 09:48:59 -08:00
Aliaksandr Valialkin	d2e34b8052	{dashboards,alerts}: subtitute `{type="indexdb"}` with `{type=~"indexdb.*"}` inside queries after `8189770c50` Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337	2022-12-05 16:00:42 -08:00
Roman Khavronenko	b2f45b4856	dashboards: update VM single dash (#3400 ) The change list is the following: * bump Grafana version to 9.2.6; * replace old "Graph" panel with "TimeSeries" panel; * show % usage of Mem and CPU additionally to of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-11-29 20:39:05 -08:00
Zakhar Bessarab	434b00cee8	docker-compose: move `TooManyLogs` into `vm-health` alerts set (#3199 )	2022-10-05 22:42:31 +03:00
Roman Khavronenko	f772ee8326	deployment/docker: move cluster compose env to master branch (#3130 ) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-09-21 12:03:10 +03:00
Roman Khavronenko	23e85e0fc5	vmagent: expose metric `vmagent_remotewrite_queues` (#2871 ) The new metric `vmagent_remotewrite_queues` exports a static value of number of configured remote write queus. This metric is useful to calculate total saturation per each configured URL with given number of queues. See corresponding changes to vmagent alerts and dashboard. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-07-18 14:41:04 +03:00
Roman Khavronenko	a42063909f	alerts: correct expression for `DiskRunsOutOfSpaceIn3Days` (#2856 ) The negative value for ETA can happen when deduplication is enabled and `rate` over `vm_deduplicated_samples_total` becomes bigger than actual ingestion rate. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-07-12 14:14:47 +02:00
Yurii Kravets	14397ba23e	Changed the level type in alerts.yml for TooManyLogs alert (#2759 ) alerts: filter out non error log messages for `TooManyLogs` Info and Warn error levels aren't always a result of malfunctioning or faulty state. So we filter them out.	2022-06-20 16:45:52 +02:00
Roman Khavronenko	3458a3d593	Monitoring cluster (#2191 ) * dashboards: add `CPU percentage` panel for cluster dashboards The new panel `CPU percentage` was added instead if adding a limit to the existing `CPU` panel because dasbhoard may display big number of components each with own limits. The separate panel should provide a clear display of CPU load. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards: sync vmagent and vmalert changes from single version Signed-off-by: hagen1778 <roman@victoriametrics.com> * docker: remove unsupported param from vmagent config Signed-off-by: hagen1778 <roman@victoriametrics.com> * alerts: add `TooHighCPUUsage` alert for all VM components Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-02-15 11:57:58 +02:00
Roman Khavronenko	ada18cd963	Dashboards vmagent updates (#1973 ) * dashboards/vmagent: shuffle panels for better visibility More important error/dropped panels were moved higher on the main row. Network usage panel moved to Resource usage row. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate New panels are supposed to show top 5 jobs or targets which generate the most of the churn rate. They were placed into a new row "Troubleshooting". Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add panels for showing persistent queue saturation New panels were added to Torubleshooting row to show the persistent queue saturation. The corresponding alerts were added and linked to these panels as well. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped" New alert suppose to send a notification when vmagent starts to drop data blocks rejected by configured remote write destiantion. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2021-12-20 12:19:17 +02:00
Aliaksandr Valialkin	e4ebcebc8a	deployment/docker/alerts.yml: formatting fixes after `865a60f13e`	2021-10-19 09:00:05 +03:00
Yurii Kravets	34f52de3a5	Update alerts.yml Added Series Limit day\hour alerts	2021-10-19 09:00:05 +03:00
Roman Khavronenko	18313f3f8e	Cluster dashboard update (#1594 ) * dashboards: sync `vmagent` updates from master branch * dashboards: add new `Storage connection saturation` panel for cluster dashboard * dashboards: add new cluster alert for corresponding `Storage connection saturation` panel	2021-09-01 17:05:17 +03:00
Roman Khavronenko	af8c1feddb	Single dashboards upd (#1593 ) * dasbhoard: replace `null` datasources null datasource value may confuse Grafana and make it drop panel query in some versions. * docker: bump grafana image version * dashboards: add URL variable selector to vmagent dashboard * dashboards: add new panel `Remote write connection saturation` to vmagent dashboard * alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard * dashboards: add "Logging rate" panel to vmagent dashboard	2021-09-01 12:24:55 +03:00
Max Golionko	738741ab0d	rename group for cluster (#1546 ) rename group for cluster, so that they not overlap when you have vmsingle and vmcluster deployed alongside	2021-08-18 16:03:04 +03:00
Roman Khavronenko	d63842cdbe	Cluster alerts (#1513 ) * alerts: move `ProcessNearFDLimits` to `vm-health` group since it is relevant for all services * alerts: add new `TooHighMemoryUsage` alerting rule	2021-08-02 17:54:24 +03:00
Roman Khavronenko	ce3f087d46	alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1435 )	2021-07-07 00:47:08 +03:00
k1rk	c6c789db8f	rename serviceHealth group name to vm-health (#1360 ) this causes conflicts in `victoria-metrics-k8s-stack` chart =)	2021-06-09 02:26:21 +03:00
Aliaksandr Valialkin	1c09e71f5b	app/vminsert: add `-disableRerouting` command-line flag for disabling re-routing if some vmstorage nodes have lower performance than the others Refactor the rerouting mechanism and make it more resilient to cases when some of vmstorage nodes are temporarily unavailable. Reduce the probability of rerouting storm. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1054 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1165	2021-06-04 04:33:52 +03:00
Roman Khavronenko	c6fc3fa94d	alerts: make alerting rule `RPCErrors` compatible with PromQL (#1204 ) Original query can't be executed via PromQL which results in error if expression is evaluated by Prometheus. The new expression is compatible with both engines.	2021-04-13 08:10:23 +03:00
Roman Khavronenko	c4f6b79d76	alerts: add `ServiceDown` alert to detect "dead" services (#1196 )	2021-04-08 18:23:10 +03:00
Roman Khavronenko	51faea5e4b	deployment: add vmalert+alertmanager services and list of default alerts for cluster version (#1187 )	2021-04-05 22:29:04 +03:00

31 commits