github-mirrors/VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-12-11 14:53:49 +00:00

Author	SHA1	Message	Date
Aliaksandr Valialkin	27a50b9551	lib/logstorage: print column names to the error log message when the number of unique columns in a block exceeds the limit This should simplify debugging issues related to too big number of columns per block in the future. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7568	2024-11-30 18:25:57 +01:00
Aliaksandr Valialkin	191180a1b5	lib/logstorage: follow-up for `22e6385f56` Make variable names and comments more clear. This should simplify code maintenance in the future. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7568	2024-11-30 18:06:59 +01:00
Nikolay	e45556fc05	lib/logstorage: fixes panic at Block.MustInitFromRows (#7695 ) Previously Block columns wasn't properly limited by maxColumnsPerBlock. And it was possible a case, when more columns per block added than expected. For example, if ingested log stream has many unuqie fields and it's sum exceed maxColumnsPerBlock. We only enforce fieldsPerBlock limit during row parsing, which limits isn't enough to mitigate this issue. Also it would be very expensive to apply maxColumnsPerBlock limit during ingestion, since it requires to track all possible field tags combinations. This commit adds check for maxColumnsPerBlock limit during MustInitFromRows function call. And it returns offset of the rows and timestamps added to the block. Function caller must create another block and ingest remaining rows into it. Related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/7568 ### Describe Your Changes Please provide a brief description of the changes you made. Be as specific as possible to help others understand the purpose and impact of your modifications. ### Checklist The following checks are mandatory: - [ ] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/). --------- Signed-off-by: f41gh7 <nik@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2024-11-30 18:06:58 +01:00
cangqiaoyuzhuo	07cf3189f8	chore: fix function name (#7381 ) ### Describe Your Changes fix function name ### Checklist The following checks are mandatory: - [x] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/). (cherry picked from commit `45896fb477`)	2024-10-30 13:13:05 +01:00
Aliaksandr Valialkin	fe5f16b662	lib/logstorage: dynamically adjust the number of (bloom, values) shards in a part depending on the number of non-const columns This allows reducing the amounts of data, which must be read during queries over logs with big number of fields (aka "wide events"). This, in turn, improves query performance when the data, which needs to be scanned during the query, doesn't fit OS page cache. (cherry picked from commit `7a62eefa34`)	2024-10-30 09:52:51 +01:00
Aliaksandr Valialkin	3346576a3a	lib/logstorage: refactor storage format to be more efficient for querying wide events It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields. For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log. Such logs are also known as "wide events". The previous storage format was optimized for logs with a few fields. When at least a single field was referenced in the query, then the all the meta-information about all the log fields was unpacked and parsed per each scanned block during the query. This could require a lot of additional disk IO and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset) index per each field in every data block. This index allows reading and extracting only the needed metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ). This allows increasing performance for queries over wide events by 10x and more. Another issue was that the data for bloom filters and field values across all the log fields except of _msg was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ). This could result in huge disk read IO overhead when some small field was referred in the query, since the Operating System usually reads more data than requested. It reads the data from disk in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB). So, if 512-byte bloom filter or values' block is read from the file, then the Operating System reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache), but this overhead may become very annoying when performing the query over large volumes of data which isn't present in OS page cache. The solution for this issue is to split bloom filters and field values across multiple shards. This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards, while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N. Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases performance for queries over large volumes of newly ingested data by up to 1000x. The new storage format is versioned as v1, while the old storage format is version as v0. It is stored in the partHeader.FormatVersion. Parts with the old storage format are converted into parts with the new storage format during background merge. It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .	2024-10-17 11:09:16 -03:00
Aliaksandr Valialkin	f52275bbd7	lib/logstorage: work-in-progress Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6258	2024-05-14 01:49:58 +02:00
Aliaksandr Valialkin	147704aab0	lib/logstorage: initial implementation of pipes in LogsQL See https://docs.victoriametrics.com/victorialogs/logsql/#pipes	2024-05-12 16:36:01 +02:00
Aliaksandr Valialkin	7bb5f75a2a	lib/logstorage: follow-up for `94627113db` - Move uniqueFields from rows to blockStreamMerger struct. This allows localizing all the references to uniqueFields inside blockStreamMerger.mustWriteBlock(), which should improve readability and maintainability of the code. - Remove logging of the event when blocks cannot be merged because they contain more than maxColumnsPerBlock, since the provided logging didn't provide the solution for the issue with too many columns. I couldn't figure out the proper solution, which could be helpful for end user, so decided to remove the logging until we find the solution. This commit also contains the following additional changes: - It truncates field names longer than 128 chars during logs ingestion. This should prevent from ingesting bogus field names. This also should prevent from too big columnsHeader blocks, which could negatively affect search query performance, since columnsHeader is read on every scan of the corresponding data block. - It limits the maximum length of const column value to 256. Longer values are stored in an ordinary columns. This helps limiting the size of columnsHeader blocks and improving search query performance by avoiding reading too long const columns on every scan of the corresponding data block. - It deduplicates columns with identical names during data ingestion and background merging. Previously it was possible to pass columns with duplicate names to block.mustInitFromRows(), and they were stored as is in the block. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4969	2023-10-02 21:06:49 +02:00
Zakhar Bessarab	876bce5a57	lib/logstorage: prevent from panic during background merge (#4969 ) * lib/logstorage: prevent from panic during background merge Fixes panic during background merge when resulting block would contain more columns than maxColumnsPerBlock. Buffered data will be flushed and replaced by the next block. See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762 Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * lib/logstorage: clarify field description and comment Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>	2023-10-02 19:29:31 +02:00
Aliaksandr Valialkin	374890294e	app/victoria-logs: initial code release	2023-07-06 17:30:05 -07:00

11 commits