dashboards: cluster dashboard update (#3380)

The purpose of the update is to make the dash more usable for large installations with many instances. Panels which showed metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg aggregations in % instead. This supposed to help immediately to identify resource shortage and remain usable for small and big installations. For cases when detailed info is needed, to the bottom of the dashboard a new row `Drilldown` was added. Panels like Mem or CPU now contain a `data-link` named `Drilldown` (cis shown on line click) which takes user to more detailed panel. The change list is the following: * bump Grafana version to 9.1.0; * replace old "Graph" panel with "TimeSeries" panel; * improve Uptime panel to show number of instances per job; * show % usage of Mem and CPU instead of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add `Drilldown` section for detailed resource usage; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts-cluster.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>
2025-03-11 15:34:56 +00:00 · 2022-11-24 03:03:25 +01:00 · 2022-11-24 03:03:25 +01:00 · 3407006cdb
commit 3407006cdb
parent 04bb2e14dd
2 changed files with 6671 additions and 6432 deletions
--- a/dashboards/victoriametrics-cluster.json
+++ b/dashboards/victoriametrics-cluster.json
--- a/deployment/docker/alerts-cluster.yml
+++ b/deployment/docker/alerts-cluster.yml
@ -54,6 +54,7 @@ groups:
        for: 15m
        labels:
          severity: warning
+          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=52&var-instance={{ $labels.instance }}"
          summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"
@ -72,6 +73,7 @@ groups:
        for: 15m
        labels:
          severity: warning
+          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=44&var-instance={{ $labels.instance }}"
          summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"
@ -83,6 +85,7 @@ groups:
        for: 15m
        labels:
          severity: warning
+          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=133&var-instance={{ $labels.instance }}"
          summary: "vmstorage on instance {{ $labels.instance }} is constantly hitting concurrent flushes limit"
@ -179,6 +182,7 @@ groups:
        for: 15m
        labels:
          severity: warning
+          show_at: dashboard
        annotations:
          dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=139&var-instance={{ $labels.instance }}"
          summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"
@ -186,3 +190,4 @@ groups:
            is saturated by more than 90% and vminsert won't be able to keep up.\n
            This usually means that more vminsert or vmstorage nodes must be added to the cluster in order to increase
            the total number of vminsert -> vmstorage links."
+