Commit graph

563 commits

Author SHA1 Message Date
Aliaksandr Valialkin
d297b65089 lib/storage: add vm_cache_size_bytes{type="storage/hour_metric_ids"} metric 2019-11-13 20:26:05 +02:00
Aliaksandr Valialkin
31376fd353 deployment/docker: update docker image tag from v1.29.2-cluster to v1.29.3-cluster 2019-11-13 18:32:08 +02:00
Aliaksandr Valialkin
494ad0fdb3 lib/storage: remove inmemory index for recent hour, since it uses too much memory
Production workload shows that the index requires ~4Kb of RAM per active time series.
This is too much for high number of active time series, so let's delete this index.

Now the queries should fall back to the index for the current day instead of the index
for the recent hour. The query performance for the current day index should be good enough
given the 100M rows/sec scan speed per CPU core.
2019-11-13 18:08:58 +02:00
Aliaksandr Valialkin
90bde025f0 deployment/docker: update image tag from v1.29.0-cluster to v1.29.2-cluster 2019-11-13 15:24:44 +02:00
Aliaksandr Valialkin
633dd81bb5 lib/storage: add -disableRecentHourIndex flag for disabling inmemory index for recent hour
This may be useful for saving RAM on high number of time series aka high cardinality
2019-11-13 15:10:12 +02:00
Aliaksandr Valialkin
f1620ba7c0 lib/storage: fix inmemory inverted index issues found in v1.29
Issues fixed:
- Slow startup times. Now the index is loaded from cache during start.
- High memory usage related to superflouos index copies every 10 seconds.
2019-11-13 13:35:38 +02:00
Aliaksandr Valialkin
87b39222be Revert "lib/fs: do not postpone directory removal on NFS error"
This reverts commit 21aeb02b46649ac9906cb37733f7b155a77a0db9.
2019-11-12 16:29:50 +02:00
Mike Poindexter
955a592106 Add test for invalid caching of tsids (#232)
* Add test for invalid caching of tsids

* Clean up error handling
2019-11-12 15:52:46 +02:00
Roman Khavronenko
ce8cc76a42 add links and fix cache metric name (#233) 2019-11-12 15:06:56 +02:00
Aliaksandr Valialkin
6afb7a50a9 deployment/docker: upgrade Grafana release from v6.4.3 to v6.4.4 2019-11-12 03:50:54 +02:00
Aliaksandr Valialkin
5b677a57e3 deployment/docker: upgrade Go from v1.13.4 to v1.13.4 2019-11-12 03:49:07 +02:00
Aliaksandr Valialkin
d420871d79 deployment/docker: upgrade docker image tag from v1.28.3-cluster to v1.29.0-cluster 2019-11-12 03:44:45 +02:00
Aliaksandr Valialkin
584d8362c8 deployment: update Prometheus from v2.13.0 to v2.14.0 2019-11-12 03:43:59 +02:00
Roman Khavronenko
828f0a2a4b prepare dashboard for external sharing (#231) 2019-11-12 00:23:24 +02:00
Oleg Kovalov
74ba42d111 fix misspelled words (#229) 2019-11-12 00:18:24 +02:00
Aliaksandr Valialkin
c48e39eea9 lib/storage: add tests for dateMetricIDCache 2019-11-11 13:21:05 +02:00
Aliaksandr Valialkin
bdc9045485 README.md: mention that replication doesnt save from disaster 2019-11-11 00:58:08 +02:00
Aliaksandr Valialkin
01801e9e03 dashboards: there will no 1.28.4 release. It will be 1.29.0 2019-11-10 22:05:10 +02:00
Aliaksandr Valialkin
6bdde0d6d4 lib/storage: eliminate data race when updating lastSyncTime in dateMetricIDCache.Has 2019-11-10 22:04:23 +02:00
Roman Khavronenko
7247a7862d add description, churn rate panel, storage.ingestion rate panel (#228) 2019-11-10 20:32:10 +02:00
Aliaksandr Valialkin
5f52eb7653 lib/fs: do not postpone directory removal on NFS error
Continue trying to remove NFS directory on temporary errors for up to a minute.

The previous async removal process breaks in the following case during VictoriaMetrics start

- VictoriaMetrics opens index, finds incomplete merge transactions and starts replaying them.
- The transaction instructs removing old directories for parts, which were already merged into bigger part.
- VictoriaMetrics removes these directories, but their removal is delayed due to NFS errors.
- VictoriaMetrics scans partition directory after all the incomplete merge transactions are finished
  and finds directories, which should be removed, but weren't still removed due to NFS errors.
- VictoriaMetrics panics when it finds unexpected empty directory.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/162
2019-11-10 13:27:16 +02:00
Aliaksandr Valialkin
9ea2bd822e lib/storage: implement per-day inverted index 2019-11-10 00:20:32 +02:00
Aliaksandr Valialkin
5d8de72414 app/vmrestore: the upcoming release would be 1.29.0 2019-11-10 00:20:18 +02:00
Aliaksandr Valialkin
dea2f3efed lib/storage: use specialized cache for (date, metricID) entries
This improves ingestion performance.
2019-11-09 23:09:18 +02:00
Aliaksandr Valialkin
9a43902bd8 lib/storage: remove unused code from getMetricIDsForTimeRange: it is expected that time range is always non-zero 2019-11-09 19:03:51 +02:00
Aliaksandr Valialkin
c16e17dede lib/storage: properly set time range when deleting time series 2019-11-09 18:50:02 +02:00
Aliaksandr Valialkin
8126007c15 lib/storage: obtain all the time series ids from (tag->metricIDs) rows instead of (metricID->TSID) rows, since this much faster 2019-11-09 18:04:26 +02:00
Aliaksandr Valialkin
50773348d3 lib/storage: small code prettifying 2019-11-09 14:01:24 +02:00
Aliaksandr Valialkin
44fa8226df lib/uint64set: remove superflouos check for item existence before deleting it in Set.Subtract 2019-11-09 14:01:24 +02:00
Aliaksandr Valialkin
0bc54c23ce lib/storage: inmemoryInvertedIndex prettifying 2019-11-09 14:01:24 +02:00
Aliaksandr Valialkin
46e67bb78c lib/storage: export vm_new_timeseries_created_total metric for determining time series churn rate 2019-11-08 19:58:21 +02:00
Aliaksandr Valialkin
0063c857f5 lib/storage: add inmemory inverted index for the last hour
It should improve performance for `last N hours` dashboards with update intervals smaller than 1 hour.
2019-11-08 19:37:46 +02:00
Aliaksandr Valialkin
33abbec6b4 app/vmselect/promql: adjust memory limits calculations for incremental aggregate functions
Incremental aggregate functions don't keep all the selected time series in memory -
they keep only up to GOMAXPROCS time series for incremental aggregations.

Take into account that the number of time series in RAM can be higher if they are split
into many groups with `by (...)` or `without (...)` modifiers.

This should reduce the number of `not enough memory for processing ... data points` false
positive errors.
2019-11-08 19:37:43 +02:00
Aliaksandr Valialkin
7d7fbf890e app/{vmbackup,vmrestore}: add vmbackup and vmrestore tools for creating backups on s3 or gcs from instant snapshots
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/203
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/38
2019-11-07 21:26:43 +02:00
Roman Khavronenko
4e7a2a41a4 Cluster dashboard (#222)
* add dashboard for cluster version

* fix queries and panels

* review fixes

* use resident memory for memory usage panel

* fix job selectors
2019-11-07 12:09:27 +02:00
Aliaksandr Valialkin
89c03a5464 lib/storage: populate partition names from both small and big directories
Certain partition directories may be missing after restoring from backups
if they had no data. Re-create such directories on start.
2019-11-06 19:50:21 +02:00
Aliaksandr Valialkin
1c777e0245 lib/storage: substitute error message about unsorted items in the index block after metricIDs merge with counter
The origin of the error has been detected and documented in the code,
so it is enough to export a counter for such errors at `vm_index_blocks_with_metric_ids_incorrect_order_total`,
so it could be monitored and alerted on high error rates.

Export also the counter for processed index blocks with metricIDs - `vm_index_blocks_with_metric_ids_processed_total`,
so its' rate could be compared to `rate(vm_index_blocks_with_metric_ids_incorrect_order_total)`.
2019-11-06 14:32:41 +02:00
Aliaksandr Valialkin
c567a4353a lib/storage: take into account the requested time range when caching TSIDs for the given tag filters 2019-11-06 14:32:41 +02:00
Aliaksandr Valialkin
c6564c5d26 lib/storage: dump incorrectly sorted items on a single line; this should simplify error reporting 2019-11-05 18:41:50 +02:00
Aliaksandr Valialkin
2ef5082ead deployment/docker: update docker images from v1.28.2-cluster to v1.28.3-cluster 2019-11-05 18:08:50 +02:00
Aliaksandr Valialkin
a10c4cad85 lib/storage: return back finalPartsToMerge from 2 to 3 in order to prevent from excessive merges in old partitions 2019-11-05 17:28:57 +02:00
Aliaksandr Valialkin
e5b1fa0c38 lib/storage: separate the max inverted index scan loops per metric into fast and slow loops
Slow loops could require seeks and expensive regexp matching, while fast loops just scans
all the metricIDs for the given `tag=value` prefix. So these operations must have separate
max loops multiplier.
2019-11-05 17:28:57 +02:00
Aliaksandr Valialkin
f93c4f2493 lib/storage: skip repeated useless work when intersection of metricIDs with the given filter is too expensive
This should improve performance for query filters over big number of time series.
2019-11-05 14:35:55 +02:00
Aliaksandr Valialkin
f48e97263c lib/storage: reduce the maximum inverted index scans before giving up to label filters matching by metric name
The new value reduces the amount of wasted work during index scans over big number of time series.
2019-11-05 14:35:53 +02:00
Aliaksandr Valialkin
d2f688c550 lib/storage: try potentially faster tag filters at first, then apply slower tag filters
The fastest tag filters are non-negative non-regexp, since they are the most specific.
The slowest tag filters are negative regexp, since they require scanning
all the entries for the given label.
2019-11-05 14:35:48 +02:00
Aliaksandr Valialkin
a72b22a8b1 Makefile: add pprof-cpu rule for inspecting CPU profiles with PPROF_FILE=/path/to/cpu.pprof make pprof-cpu 2019-11-04 12:43:57 +02:00
Aliaksandr Valialkin
2a38d30f93 lib/storage: pass pointer to MetricName in Fatalf, so it is properly detected as an interface with String() method
This fixes lint errors
2019-11-04 01:06:45 +02:00
Artem Navoiev
e05500cbd4 add unittests for bytesutil and storage (#221) 2019-11-04 00:57:24 +02:00
Aliaksandr Valialkin
f5fbc3ffd7 lib/{storage,uint64set}: add Set.Union() function and use it 2019-11-04 00:48:32 +02:00
Aliaksandr Valialkin
23e078261e lib/storage: tune the returned value from adjustMaxMetricsAdaptive 2019-11-04 00:45:28 +02:00