github-mirrors/VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-12-11 14:53:49 +00:00

Author	SHA1	Message	Date
Aliaksandr Valialkin	27a50b9551	lib/logstorage: print column names to the error log message when the number of unique columns in a block exceeds the limit This should simplify debugging issues related to too big number of columns per block in the future. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7568	2024-11-30 18:25:57 +01:00
Aliaksandr Valialkin	191180a1b5	lib/logstorage: follow-up for `22e6385f56` Make variable names and comments more clear. This should simplify code maintenance in the future. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7568	2024-11-30 18:06:59 +01:00
Nikolay	e45556fc05	lib/logstorage: fixes panic at Block.MustInitFromRows (#7695 ) Previously Block columns wasn't properly limited by maxColumnsPerBlock. And it was possible a case, when more columns per block added than expected. For example, if ingested log stream has many unuqie fields and it's sum exceed maxColumnsPerBlock. We only enforce fieldsPerBlock limit during row parsing, which limits isn't enough to mitigate this issue. Also it would be very expensive to apply maxColumnsPerBlock limit during ingestion, since it requires to track all possible field tags combinations. This commit adds check for maxColumnsPerBlock limit during MustInitFromRows function call. And it returns offset of the rows and timestamps added to the block. Function caller must create another block and ingest remaining rows into it. Related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7568 ### Describe Your Changes Please provide a brief description of the changes you made. Be as specific as possible to help others understand the purpose and impact of your modifications. ### Checklist The following checks are mandatory: - [ ] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/). --------- Signed-off-by: f41gh7 <nik@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2024-11-30 18:06:58 +01:00
Aliaksandr Valialkin	4892d4d805	lib/logstorage: allow special chars in unquoted _stream tag names and values This simplifies writing _stream filters. For example, {foo-bar=abc:de} can be written instead of {"foo-bar"="abc:de"}	2024-11-30 17:27:58 +01:00
Aliaksandr Valialkin	75e4a8e64b	lib/logstorage: properly skip filtered out dict values when calculating uniq_values, min, max, row_min and row_max stats functions Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7458	2024-11-14 17:21:28 +01:00
Aliaksandr Valialkin	8b287e8da4	lib/logstorage: properly clone field values at `values` stats function Previously field values weren't properly cloned, which could lead to garbage output for `values` stats function Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7458	2024-11-14 17:21:28 +01:00
Aliaksandr Valialkin	a7e1575ebe	lib/logstorage: simplify the code for `uniq_values` stats function a bit Move the repeated check for an empty value into statsUniqValuesProcessor.updateState() function. This allow removing duplicate code for this check from statsUniqValuesProcessor.updateState() call sites.	2024-11-14 17:21:27 +01:00
Aliaksandr Valialkin	5b0b7d509f	lib/logstorage: support for `[label1=value1 ... labelN=valueN]` syntax inside syslog messages for adding arbitrary labels (fields) to log entries	2024-11-14 17:21:26 +01:00
Aliaksandr Valialkin	a02d26e853	lib/logstorage: properly take into account the `end` query arg when calculating time range for _time:duration filters (cherry picked from commit `e5537bc64d`)	2024-11-08 17:07:57 +01:00
Aliaksandr Valialkin	f82cfa16bf	lib/logstorage: allow specifying _time filter offset without time range This is useful when builiding graphs on time ranges in the past. (cherry picked from commit `a98fb495c6`)	2024-11-08 17:07:57 +01:00
Aliaksandr Valialkin	a4ea3b87d7	lib/logstorage: optimize query imeediately after its parsing This eliminates possible bugs related to forgotten Query.Optimize() calls. This also allows removing optimize() function from pipe interface. While at it, drop filterNoop inside filterAnd. (cherry picked from commit `66b2987f49`)	2024-11-08 17:07:56 +01:00
Aliaksandr Valialkin	52929c060a	app/vlselect/logsql: call Query.Optimize() inside parseCommonArgs(), which is called et every /select/logsql/* endpoint. This reduces the probability of forgotten call to Query.Optimize(). (cherry picked from commit `0550093802`)	2024-11-08 17:07:56 +01:00
Aliaksandr Valialkin	7d078dd591	lib/logstorage: add an ability to add prefix to resulting query field names in `join` pipe See https://docs.victoriametrics.com/victorialogs/logsql/#join-pipe (cherry picked from commit `5a6531b329`)	2024-11-08 17:07:56 +01:00
Aliaksandr Valialkin	364a2e3e1f	docs/VictoriaLogs: properly sort log fields with floating-point numbers (cherry picked from commit `42c9183281`)	2024-11-07 13:00:20 +01:00
Aliaksandr Valialkin	7a39f526ec	lib/logstorage: add `block_stats` pipe for analyzing per-block storage stats (cherry picked from commit `5ed54ebadf`)	2024-11-07 13:00:19 +01:00
Aliaksandr Valialkin	83c9d42263	lib/logstorage: add `join` pipe for joining multiple query results (cherry picked from commit `f9e23bf8e3`)	2024-11-07 13:00:19 +01:00
Zakhar Bessarab	718f8077a8	Revert "lib/mergeset: add sparse indexdb cache (#7269 )" This reverts commit `837d0d136d`.	2024-11-04 10:33:22 -03:00
Aliaksandr Valialkin	fced48d540	app/vlinsert: implement the ability to add extra fields to the ingested logs This can be done via extra_fields query arg or via VL-Extra-Fields HTTP header. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7354#issuecomment-2448671445 (cherry picked from commit `4478e48eb6`)	2024-11-04 10:23:16 -03:00
Aliaksandr Valialkin	bcbaecd73f	lib/logstorage: increase the the maximum number of columns per block from 1000 to 2000 This will allow storing wide events with up to 2K fields per event into VictoriaLogs. While at it, remove the misleading comment that columnsHeader is read in full per each matching block. This is no longer the case after the improvements made at `202eb429a7` . Now only the needed columnHeader is read for the column mentioned in the query. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6425#issuecomment-2418337124 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762 (cherry picked from commit `9ba6be4179`)	2024-11-04 10:23:15 -03:00
Aliaksandr Valialkin	1ea65d662f	lib/logstorage: properly reset cached output fields for extract and extract_regexp pipes after the log entry matches if(...) condition Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7162 (cherry picked from commit `c5d08d317c`)	2024-10-31 14:11:08 +01:00
Aliaksandr Valialkin	b74bcb7886	lib/logstorage: properly cache replace() and replace_regexp() results for identical adjacent field values Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7162 (cherry picked from commit `2e635a42d8`)	2024-10-31 14:11:08 +01:00
Aliaksandr Valialkin	0c657a95dc	app/vlselect: add support for extra_filters and extra_stream_filters query args across all the HTTP querying APIs These query args are going to be used for quick filtering on field values at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7365 (cherry picked from commit `7603446850`)	2024-10-31 14:11:07 +01:00
Aliaksandr Valialkin	8baa5177aa	app/vlinsert: allow specifying comma-separated list of fields containing log message via _msg_field query arg and VL-Msg-Field HTTP request header This msy be useful when ingesting logs from different sources, which store the log message in different fields. For example, `_msg_field=message,event.data,some_field` will get log message from the first non-empty field: `message`, `event.data` and `some_field`. (cherry picked from commit `ed73f8350b`)	2024-10-30 15:19:52 +01:00
Aliaksandr Valialkin	bf243df9ce	lib/logstorage: make sure that the number of output (bloom, values) shards is bigger than zero. If the number of output (bloom, values) shards is zero, then this may lead to panic as shown at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7391 . This panic may happen when parts with only constant fields with distinct values are merged into output part with non-constant fields, which should be written to (bloom, values) shards. (cherry picked from commit `102e9d4f4e`)	2024-10-30 15:19:51 +01:00
cangqiaoyuzhuo	07cf3189f8	chore: fix function name (#7381 ) ### Describe Your Changes fix function name ### Checklist The following checks are mandatory: - [x] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/). (cherry picked from commit `45896fb477`)	2024-10-30 13:13:05 +01:00
Aliaksandr Valialkin	1dd01b8a8f	lib/logstorage: follow-up for af831a6c906158f371f1b6810706fa0a54b78386 Sync the code between top and sort pipes regarding the code related to rank. (cherry picked from commit `7a623c225f`)	2024-10-30 09:52:52 +01:00
Aliaksandr Valialkin	329d9a46ee	lib/logstorage: add an ability to return rank from `top` pipe results (cherry picked from commit `3c06d083ea`)	2024-10-30 09:52:51 +01:00
Aliaksandr Valialkin	fe5f16b662	lib/logstorage: dynamically adjust the number of (bloom, values) shards in a part depending on the number of non-const columns This allows reducing the amounts of data, which must be read during queries over logs with big number of fields (aka "wide events"). This, in turn, improves query performance when the data, which needs to be scanned during the query, doesn't fit OS page cache. (cherry picked from commit `7a62eefa34`)	2024-10-30 09:52:51 +01:00
Aliaksandr Valialkin	76b21c8560	lib/logstorage: avoid reading columnsHeader data when `field_values` pipe is applied directly to log filters This improves performance of `field_values` pipe when it is applied to large number of data blocks. This also improves performance of /select/logsql/field_values HTTP API. (cherry picked from commit `8d968acd0a`)	2024-10-30 09:52:50 +01:00
Hui Wang	9616814728	vmalert: integrate with victorialogs (#7255 ) address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6706. See https://github.com/VictoriaMetrics/VictoriaMetrics/blob/vmalert-support-vlog-ds/docs/VictoriaLogs/vmalert.md. Related fix https://github.com/VictoriaMetrics/VictoriaMetrics/pull/7254. Note: in this pull request, vmalert doesn't support [backfilling](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/vmalert-support-vlog-ds/docs/VictoriaLogs/vmalert.md#rules-backfilling) for rules with a customized time filter. It might be added in the future, see [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7289) for details. Feature can be tested with image `victoriametrics/vmalert:heads-vmalert-support-vlog-ds-0-g420629c-scratch`. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com> (cherry picked from commit `68bad22fd2`)	2024-10-29 16:32:00 +01:00
Zakhar Bessarab	8198e7241d	lib/mergeset: add sparse indexdb cache (#7269 ) Related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7182 - add a separate index cache for searches which might read through large amounts of random entries. Primary use-case for this is retention and downsampling filters, when applying filters background merge needs to fetch large amount of random entries which pollutes an index cache. Using different caches allows to reduce effect on memory usage and cache efficiency of the main cache while still having high cache hit rate. A separate cache size is 5% of allowed memory. - reduce size of indexdb/dataBlocks cache in order to free memory for new sparse cache. Reduced size by 5% and moved this to a separate cache. - add a separate metricName search which does not cache metric names - this is needed in order to allow disabling metric name caching when applying downsampling/retention filters. Applying filters during background merge accesses random entries, this fills up cache and does not provide an actual improvement due to random access nature. Merge performance and memory usage stats before and after the change: - before ![image](https://github.com/user-attachments/assets/485fffbb-c225-47ae-b5c5-bc8a7c57b36e) - after ![image](https://github.com/user-attachments/assets/f4ba3440-7c1c-4ec1-bc54-4d2ab431eef5) --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> (cherry picked from commit `837d0d136d`)	2024-10-24 12:43:06 -03:00
Aliaksandr Valialkin	62e4baf556	lib/logstorage: use simpler in-memory cache instead of workingsetcache for caching recently ingested _stream values and recently queried set of streams These caches aren't expected to grow big, so it is OK to use the most simplest cache based on sync.Map. The benefit of this cache compared to workingsetcache is better scalability on systems with many CPU cores, since it doesn't use mutexes at fast path. An additional benefit is lower memory usage on average, since the size of in-memory cache equals working set for the last 3 minutes. The downside is that there is no upper bound for the cache size, so it may grow big during workload spikes. But this is very unlikely for typical workloads. (cherry picked from commit `0f24078146`)	2024-10-18 11:42:16 +02:00
Aliaksandr Valialkin	f9d86a913c	lib/logstorage: do not persist streamIDCache, since it may go out of sync with partition directories, which can be changed manually between VictoriaLogs restarts Partition directories can be manually deleted and copied from another sources such as backups or other VitoriaLogs instances. In this case the persisted cache becomes out of sync with partitions. This can result in missing index entries during data ingestion or in incorrect results during querying. So it is better to do not persist caches. This shouldn't hurt VictoriaLogs performance just after the restart too much, since its caches usually contain small amounts of data, which can be quickly re-populated from the persisted data. (cherry picked from commit `8aa144fa74`)	2024-10-18 11:42:16 +02:00
Aliaksandr Valialkin	b9fae4378a	lib/logstorage: consistently use "pHits := m[..]" pattern Consistency improves maintainability of the code a bit. (cherry picked from commit `1892e357c3`)	2024-10-18 11:42:16 +02:00
Aliaksandr Valialkin	92b9b13df1	lib/logstorage: optimize performance for queries, which select all the log fields for logs containing hundreds of log fields (aka "wide events") Unpack the full columnsHeader block instead of unpacking meta-information per each individual column when the query, which selects all the columns, is executed. This improves performance when scanning logs with big number of fields. (cherry picked from commit `2023f017b1`)	2024-10-18 11:42:15 +02:00
Aliaksandr Valialkin	5d541322c6	lib/logstorage: improve performance of `top` and `field_values` pipes on systems with many CPU cores - Parallelize mering of per-CPU results. - Parallelize writing the results to the next pipe. (cherry picked from commit `78c6fb0883`)	2024-10-18 11:42:15 +02:00
Aliaksandr Valialkin	cd7823a310	lib/logstorage: optimize 'stats by(...)' calculations for by(...) fields with millions of unique values on multi-CPU systems - Parallelize merging of per-CPU `stats by(...)` result shards. - Parallelize writing `stats by(...)` results to the next pipe. (cherry picked from commit `c4b2fdff70`)	2024-10-18 11:42:15 +02:00
Aliaksandr Valialkin	1000ae437c	lib/logstorage: optimize performance for `top` pipe when it is applied to a field with millions of unique values - Use parallel merge of per-CPU shard results. This improves merge performance on multi-CPU systems. - Use topN heap sort of per-shard results. This improves performance when results contain millions of entries. (cherry picked from commit `192c07f76a`)	2024-10-18 11:42:15 +02:00
Aliaksandr Valialkin	54ccf09fdd	lib/logstorage: follow-up for `72941eac36` - Allow dropping metrics if the query result contains at least a single metric. - Allow copying by(...) fields. - Disallow overriding by(...) fields via `math` pipe. - Allow using `format` pipe in stats query. This is useful for constructing some labels from the existing by(...) fields. - Add more tests. - Remove the check for time range in the query filter according to https://github.com/VictoriaMetrics/VictoriaMetrics/pull/7254/files#r1803405826 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/7254	2024-10-17 11:09:16 -03:00
Hui Wang	21864de527	victorialogs: add more checks for stats query APIs (#7254 ) 1. Verify if field in [fields pipe](https://docs.victoriametrics.com/victorialogs/logsql/#fields-pipe) exists. If not, it generates a metric with illegal float value "" for prometheus metrics protocol. 2. check if multiple time range filters produce conflicted query time range, for instance: ``` query: _time: 5m \| stats count(), start:2024-10-08T10:00:00.806Z, end: 2024-10-08T12:00:00.806Z, time: 2024-10-10T10:02:59.806Z ``` must give no result due to invalid final time range. --------- Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2024-10-17 11:09:16 -03:00
Aliaksandr Valialkin	3346576a3a	lib/logstorage: refactor storage format to be more efficient for querying wide events It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields. For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log. Such logs are also known as "wide events". The previous storage format was optimized for logs with a few fields. When at least a single field was referenced in the query, then the all the meta-information about all the log fields was unpacked and parsed per each scanned block during the query. This could require a lot of additional disk IO and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset) index per each field in every data block. This index allows reading and extracting only the needed metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ). This allows increasing performance for queries over wide events by 10x and more. Another issue was that the data for bloom filters and field values across all the log fields except of _msg was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ). This could result in huge disk read IO overhead when some small field was referred in the query, since the Operating System usually reads more data than requested. It reads the data from disk in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB). So, if 512-byte bloom filter or values' block is read from the file, then the Operating System reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache), but this overhead may become very annoying when performing the query over large volumes of data which isn't present in OS page cache. The solution for this issue is to split bloom filters and field values across multiple shards. This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards, while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N. Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases performance for queries over large volumes of newly ingested data by up to 1000x. The new storage format is versioned as v1, while the old storage format is version as v0. It is stored in the partHeader.FormatVersion. Parts with the old storage format are converted into parts with the new storage format during background merge. It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .	2024-10-17 11:09:16 -03:00
Aliaksandr Valialkin	0881e5fd5c	app/vlselect: do not show empty fields in query results Empty fields are treated as non-existing fields by VictoriaLogs data model. So there is no sense in returning empty fields in query results, since they may mislead and confuse users. (cherry picked from commit `bac193e50b`)	2024-10-15 11:49:32 +02:00
Aliaksandr Valialkin	f627d7f686	app/vlstorage: add support for forced merge via /internal/force_merge HTTP endpoint (cherry picked from commit `3c73dbbacc`)	2024-10-15 11:49:31 +02:00
Aliaksandr Valialkin	ac2b6e8704	lib/logstorage: make a copy of s.partitions slice when performing queries over the selected partitions s.partitions can be changed when new partition is registered or when old partition is dropped. This could lead to data races and panics when s.partitions slice is accessed by concurrently executed queries. The fix is to make a copy of the selected partitions under s.partitionsLock before performing the query. (cherry picked from commit `b4b79a4961`)	2024-10-15 11:49:31 +02:00
Aliaksandr Valialkin	b694ca4952	lib/logstorage: move getConstColumnValue() and getColumnHeader() methods from columnsHeader to blockSearch This localizes blockSearch.getColumnsHeader() call at block_search.go . This call is going to be optimized in the next commits in order to avoid unmarshaling of header data for unneeded columns, which weren't requested by getConstColumnValue() / getColumnHeader(). (cherry picked from commit `507b206a7d`)	2024-10-15 11:49:30 +02:00
Aliaksandr Valialkin	beeb80e4f8	lib/logstorage: avoid redundant copying of column names and column values for dictionary-encoded columns during querying Refer the original byte slice with the marshaled columnsHeader for columns names and dictionary-encoded column values. This improves query performance a bit when big number of blocks with big number of columns are scanned during the query. (cherry picked from commit `279e25e7c8`)	2024-10-15 11:49:30 +02:00
Aliaksandr Valialkin	afe5158443	lib/logstorage: avoid calling columnsHeader.initFromBlockHeader() multiple times for the same blockSearch This should improve performance when blockSearch.getColumnsHeader() is called multiple times from different places of the code. (cherry picked from commit `9e48074b59`)	2024-10-15 11:49:30 +02:00
Aliaksandr Valialkin	e581338b84	lib/logstorage: make sure that bs.br is non-nil before checking br.bs.bsw.bh.rowsCount there br.bs may be nil when br contains the block with additional filters applied during pipe calculations. For example, `* \| count() if (error) errors`. (cherry picked from commit `867f671cc4`)	2024-10-15 11:49:29 +02:00
Aliaksandr Valialkin	b3bbf94310	lib/logstorage: disallow using pipe names as the first unquoted words in `filter` pipe Improperly written pipes could be silently parsed as filter pipe. For example, the following query: * \| by (x) was silently parsed to: * \| filter "by" x It is better to return error, so the user could identify and fix invalid pipe instead of silently executing invalid query with `filter` pipe. (cherry picked from commit `7b475ed95d`)	2024-10-11 14:27:46 +02:00
Aliaksandr Valialkin	834e2ad855	lib/logstorage: disallow using by as the first word in log filters, since it frequently clashes with `stats by(...)` pipe where `stats` word is omitted (cherry picked from commit `6acf543b90`)	2024-10-11 14:27:46 +02:00

1 2 3 4

183 commits