Lowering threshold from 50% to 5% will be more sufficient
for discovering un-healthy system state. It also goes in
sync with alert definition in cluster branch.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: plot cpu limits for vmagent, vmalert and vm-single dashboards
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* alerts: add `TooHighCPUUsage` alert for all VM components
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: bump components version requirements
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: shuffle panels for better visibility
More important error/dropped panels were moved higher on the main row.
Network usage panel moved to Resource usage row.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate
New panels are supposed to show top 5 jobs or targets which generate the most
of the churn rate. They were placed into a new row "Troubleshooting".
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add panels for showing persistent queue saturation
New panels were added to Torubleshooting row to show the persistent queue
saturation. The corresponding alerts were added and linked to these
panels as well.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped"
New alert suppose to send a notification when vmagent starts to drop
data blocks rejected by configured remote write destiantion.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dasbhoard: replace `null` datasources
null datasource value may confuse Grafana and make it drop panel query in some
versions.
* docker: bump grafana image version
* dashboards: add URL variable selector to vmagent dashboard
* dashboards: add new panel `Remote write connection saturation` to vmagent dashboard
* alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard
* dashboards: add "Logging rate" panel to vmagent dashboard
Alert `TooHighChurnRate24h` suppose to cover cases when churn rate
is low but results in multiple times higher number than total
number of active series.
* dashboard: update single node dashboard
* add panel `Open FDs` for file descriptors metrics;
* add panel `Disk writes/reads` to show the real read/write
load on storage layer;
* add `process_resident_memory_bytes` metric to memory usage panel;
* add stats panel to show available CPUs, memory and disk space;
* rm flags panel since it didn't prove its usefulness.
* alerts: add alert for reaching FDs limit
List contains examples for the alerting rules which might be executed
via `vmalert` to track the health state of VM components. It is assumed
that list will be revised and calibrated for each system individually.