github-mirrors/VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-11-21 14:44:00 +00:00

Author	SHA1	Message	Date
Aliaksandr Valialkin	202eb429a7	lib/logstorage: refactor storage format to be more efficient for querying wide events It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields. For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log. Such logs are also known as "wide events". The previous storage format was optimized for logs with a few fields. When at least a single field was referenced in the query, then the all the meta-information about all the log fields was unpacked and parsed per each scanned block during the query. This could require a lot of additional disk IO and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset) index per each field in every data block. This index allows reading and extracting only the needed metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ). This allows increasing performance for queries over wide events by 10x and more. Another issue was that the data for bloom filters and field values across all the log fields except of _msg was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ). This could result in huge disk read IO overhead when some small field was referred in the query, since the Operating System usually reads more data than requested. It reads the data from disk in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB). So, if 512-byte bloom filter or values' block is read from the file, then the Operating System reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache), but this overhead may become very annoying when performing the query over large volumes of data which isn't present in OS page cache. The solution for this issue is to split bloom filters and field values across multiple shards. This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards, while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N. Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases performance for queries over large volumes of newly ingested data by up to 1000x. The new storage format is versioned as v1, while the old storage format is version as v0. It is stored in the partHeader.FormatVersion. Parts with the old storage format are converted into parts with the new storage format during background merge. It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .	2024-10-16 17:35:07 +02:00
Aliaksandr Valialkin	bac193e50b	app/vlselect: do not show empty fields in query results Empty fields are treated as non-existing fields by VictoriaLogs data model. So there is no sense in returning empty fields in query results, since they may mislead and confuse users.	2024-10-14 23:43:58 +02:00
Aliaksandr Valialkin	279e25e7c8	lib/logstorage: avoid redundant copying of column names and column values for dictionary-encoded columns during querying Refer the original byte slice with the marshaled columnsHeader for columns names and dictionary-encoded column values. This improves query performance a bit when big number of blocks with big number of columns are scanned during the query.	2024-10-13 13:25:38 +02:00
Aliaksandr Valialkin	462b7cd597	lib/logstorage: quote logfmt strings only if they contain special chars, which could break logfmt parsing and/or reading	2024-10-07 14:31:30 +02:00
Aliaksandr Valialkin	c0caa69939	lib/logstorage: use quicktemplate.AppendJSONString instead of strconv.AppendQuote for encoding JSON strings The strconv.AppendQuote improperly encodes special chars such as \x1b . They must be encoded as \u001b . See https://github.com/VictoriaMetrics/victorialogs-datasource/issues/24	2024-07-05 01:22:23 +02:00
Aliaksandr Valialkin	6bb66cb3e9	lib/logstorage: properly search for the surrounding logs in `stream_context` pipe The set of log fields in the found logs may differ from the set of log fields present in the log stream. So compare only the log fields in the found logs when searching for the matching log entry in the log stream. While at it, return _stream field in the delimiter log entry, since this field is used by VictoriaLogs Web UI for grouping logs by log streams.	2024-07-01 02:29:50 +02:00
Aliaksandr Valialkin	dc291d8980	lib/logstorage: add delimiter between log chunks returned from `\| stream_context` pipe	2024-07-01 01:30:37 +02:00
Aliaksandr Valialkin	2b6a634ec0	lib/logstorage: work-in-progress	2024-06-17 12:13:18 +02:00
Aliaksandr Valialkin	43cf221681	lib/logstorage: work-in-progress	2024-06-05 03:18:12 +02:00
Aliaksandr Valialkin	539fce9227	lib/logstorage: work-in-progress	2024-06-04 01:49:02 +02:00
Aliaksandr Valialkin	22107421eb	lib/logstorage: work-in-progress	2024-05-22 21:01:20 +02:00
Aliaksandr Valialkin	cc2647d212	lib/encoding: optimize UnmarshalVarUint64, UnmarshalVarInt64 and UnmarshalBytes a bit Change the return values for these functions - now they return the unmarshaled result plus the size of the unmarshaled result in bytes, so the caller could re-slice the src for further unmarshaling. This improves performance of these functions in hot loops of VictoriaLogs a bit.	2024-05-14 01:23:54 +02:00
Aliaksandr Valialkin	9dbd0f9085	lib/logstorage: initial implementation of pipes in LogsQL See https://docs.victoriametrics.com/victorialogs/logsql/#pipes	2024-05-12 16:33:31 +02:00
Aliaksandr Valialkin	8dce4eb189	lib/logstorage: follow-up for `94627113db` - Move uniqueFields from rows to blockStreamMerger struct. This allows localizing all the references to uniqueFields inside blockStreamMerger.mustWriteBlock(), which should improve readability and maintainability of the code. - Remove logging of the event when blocks cannot be merged because they contain more than maxColumnsPerBlock, since the provided logging didn't provide the solution for the issue with too many columns. I couldn't figure out the proper solution, which could be helpful for end user, so decided to remove the logging until we find the solution. This commit also contains the following additional changes: - It truncates field names longer than 128 chars during logs ingestion. This should prevent from ingesting bogus field names. This also should prevent from too big columnsHeader blocks, which could negatively affect search query performance, since columnsHeader is read on every scan of the corresponding data block. - It limits the maximum length of const column value to 256. Longer values are stored in an ordinary columns. This helps limiting the size of columnsHeader blocks and improving search query performance by avoiding reading too long const columns on every scan of the corresponding data block. - It deduplicates columns with identical names during data ingestion and background merging. Previously it was possible to pass columns with duplicate names to block.mustInitFromRows(), and they were stored as is in the block. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4969	2023-10-02 19:19:08 +02:00
Zakhar Bessarab	94627113db	lib/logstorage: prevent from panic during background merge (#4969 ) * lib/logstorage: prevent from panic during background merge Fixes panic during background merge when resulting block would contain more columns than maxColumnsPerBlock. Buffered data will be flushed and replaced by the next block. See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762 Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * lib/logstorage: clarify field description and comment Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>	2023-09-29 11:58:20 +02:00
Aliaksandr Valialkin	87b66db47d	app/victoria-logs: initial code release	2023-06-19 22:55:12 -07:00

16 commits