Commit graph

43 commits

Author SHA1 Message Date
Aliaksandr Valialkin
9eb828b2c2 app/vminsert: add vm_rpc_send_duration_seconds_total metric per each vminsert->vmstorage link
This metric is useful for determining high link saturation with the following alerting rule:

rate(vm_rpc_send_duration_seconds_total) > 0.9s
2021-08-11 11:42:33 +03:00
Aliaksandr Valialkin
c18017a9c3 app/vminsert/netstorage: sort the -storageNode list passed to vminsert nodes
This should reduce resource usage (CPU, RAM, disk IO) at vmstorage nodes
if the addresses of vmstorage nodes are passed in random order to vminsert nodes.
2021-06-23 14:00:08 +03:00
Aliaksandr Valialkin
2c6b917749 app/vminsert/netstorage: update storageNode.lastRerouteTime before the rerouting
This is needed for reliable detection of storage nodes with recent rerouting
2021-06-08 12:06:32 +03:00
Aliaksandr Valialkin
0d067eb112 app/vminsert/netstorage: tune re-routing algorithm
Do not re-route data to unavailable storage node. Send it to the remaining storage nodes instead
even if they cannot keep up with the load. This should spread the load more evenly among available
storage nodes.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1054
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1165
2021-06-05 16:23:44 +03:00
Aliaksandr Valialkin
1c09e71f5b app/vminsert: add -disableRerouting command-line flag for disabling re-routing if some vmstorage nodes have lower performance than the others
Refactor the rerouting mechanism and make it more resilient to cases when some of vmstorage nodes are temporarily unavailable.

Reduce the probability of rerouting storm.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1054
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1165
2021-06-04 04:33:52 +03:00
Aliaksandr Valialkin
4a5f45c77e app/vminsert: add support for data ingestion via other vminsert nodes 2021-05-08 19:53:45 +03:00
Aliaksandr Valialkin
9e79fc27c8 app/vminsert/netstorage: properly update vm_rpc_rerouted_rows_processed_total metric
Previously this metric wasn't updated because of improper defer call.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/955

Thanks to @xemxx for spotting the bug.
2020-12-11 13:07:05 +02:00
Aliaksandr Valialkin
c6adcafedb app/vminsert: export vm_rpc_vmstorage_is_reachable metric, which can be used for monitoring reachability of vmstorage nodes from vminsert nodes 2020-11-17 22:13:26 +02:00
Aliaksandr Valialkin
882e2e2099 app/vminsert/netstorage: return 503 status code to client when all the vmstorage nodes are unavailable
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/896
2020-11-14 00:44:41 +02:00
Aliaksandr Valialkin
ffa6581c46 app/vminsert: refresh the list of healthy storage nodes only if the the row cannot be sent to destination storage node
Previously the list had been generated for each rerouted row. This could consume additional CPU time during rerouting,
which could lead to rerouting slowdown.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791
2020-09-29 01:29:24 +03:00
Aliaksandr Valialkin
bc37f1cbec app/vminsert: do not pollute logs with repated cannot dial storageNode errors
Log only the first error per -storageNode
2020-09-29 00:20:32 +03:00
Aliaksandr Valialkin
9d123eb22a app/vminsert: remove useless delays when sending data to vmstorage
This improves the maximum data ingestion performance for cluster VictoriaMetrics

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791
2020-09-28 21:41:15 +03:00
Aliaksandr Valialkin
fe08b1eb26 app/vminsert: improve error message when the data cannot be sent to vmstorage - log reroutedBR buffer size
This should improve debuggability for improperly configured cluster
2020-08-31 17:51:44 +03:00
Aliaksandr Valialkin
d962568e93 all: use %w instead of %s for wrapping errors in fmt.Errorf
This will simplify examining the returned errors such as httpserver.ErrorWithStatusCode .
See https://blog.golang.org/go1.13-errors for details.
2020-06-30 23:33:46 +03:00
Aliaksandr Valialkin
a586b8b6d4 app/vminsert/netstorage: do not re-route every time series to more than two vmstorage nodes when certain vmstorage nodes are temporarily slower than the rest of them
Previously vminsert may spread data for a single time series across all the available vmstorage nodes
when vmstorage nodes couldn't handle the given ingestion rate. This could lead to increased usage
of CPU and memory on every vmstorage node, since every vmstorage node had to register all the time
series seen in the cluster. Now a time series may spread to maximum two vmstorage nodes under heavy load.
Every time series is routed to a single vmstorage node under normal load.
2020-06-25 16:42:37 +03:00
Aliaksandr Valialkin
2fc2679a3f app/vminsert/netstorage: remove possible race condition when broken connection may be recovered before acquiring storageNode.bcLock 2020-06-20 16:38:08 +03:00
Aliaksandr Valialkin
4400700832 app/vminsert: properly replicate data for the last RF-1 storage nodes for -replicationFactor=RF
Previously the data for the last `RF-1` storage noes has been incorrectly replicated to the first storage node.
2020-06-19 12:40:22 +03:00
Aliaksandr Valialkin
85c1ccb8b8 app/vminsert/netstorage: add missing return in storageNode.checkHealth on connection failure 2020-06-18 20:51:51 +03:00
Aliaksandr Valialkin
464682f380 app/vminsert/netstorage: periodically check for each -storageNode health, so it could be marked as healthy when it is ready to accept data
This fixes uneven data routing in cluster version when `-replicationFactor` is set to 1 (default value),
i.e. when the replication is disabled.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/546
2020-06-18 20:42:43 +03:00
Aliaksandr Valialkin
43b14b9569 app/vminsert/netstorage: free up unused memory in buffer after memory usage spikes 2020-06-01 14:33:35 +03:00
Aliaksandr Valialkin
f41a01332a app/vminsert/netstorage: evenly distribute rerouted rows among all the availalbe storage nodes
Previously such rows were distributed to the original storage node or to the next storage node.
This may result to uneven load among the remaining storage nodes.
2020-05-30 13:51:09 +03:00
Aliaksandr Valialkin
02b2064d8e app/vminsert/netstorage: do not increment vm_rpc_rows_lost_total when all the vmstorage nodes are unavailable, since vminsert retries sending the data instead of dropping it 2020-05-28 22:36:56 +03:00
Aliaksandr Valialkin
7a61357b5d app/vminsert/netstorage: make sure that the the data is always replicated among -replicationFactor vmstorage nodes
Previously vminsert could write multiple copies of the data to a single vmstorage node when the ingestion rate
exceeds the maximum throughput for connections to vmstorage nodes.
2020-05-28 19:59:07 +03:00
Aliaksandr Valialkin
b4e3bffe4b app/vminsert/netstorage: emit warnings instead of errors when re-routing data to healthy storage nodes 2020-05-27 16:31:41 +03:00
Aliaksandr Valialkin
75f2f3b09d app/vminsert/netstorage: improve ingestion performance when a single vmstorage node is slower than other vmstorage nodes
Previously the ingestion performance has been limited by the slowest vmstorage node.
Now vminsert should re-route data from the slowest vmstorage node to the remaining nodes.
2020-05-27 15:08:22 +03:00
Aliaksandr Valialkin
9844845d79 app/vminsert: tune the maximum summary buffer size for pending data to 1/4 of available RAM, since 1/2 of RAM is too big considering GOGC overhead 2020-05-25 02:00:37 +03:00
Aliaksandr Valialkin
4a82631e44 app/vminsert: limit the summary buffer sizes for all the storage nodes to a half of the allowed memory 2020-05-25 01:39:33 +03:00
Aliaksandr Valialkin
4bd3d4b148 app/vminsert/netstorage: do not return error from storageNode.flushBufLocked when the buffer has been successfully re-routed to healthy nodes
This should reduce the number of false errors in the log and the number of falsely lost rows
2020-05-22 18:29:43 +03:00
Aliaksandr Valialkin
716bbe79d4 app/vminsert/netstorage: increase timeout for waiting for ack message after sending big data block to vmstorage 2020-04-28 11:19:46 +03:00
Aliaksandr Valialkin
989d84cf3f app/{vminsert,vmstorage}: wait for ack from vmstorage after each packet sent to it from vminsert
This should protect from possible data loss when `vmstorage` is stopped while the packet is sent from `vminsert`.

This commit switches to new protocol between vminsert and vmstorage, which is incompatible
with the previous protocol. So it is required that both vminsert and vmstorage nodes are updated.
2020-04-27 09:53:26 +03:00
Aliaksandr Valialkin
347aaba79d lib/{storage,mergeset}: use time.Ticker instead of time.Timer where appropriate
It has been appeared that time.Timer was used in places where time.Ticker must be used instead.
This could result in blocked goroutines as in the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/316 .
2020-02-13 13:21:48 +02:00
Aliaksandr Valialkin
4d70a81e18 app/vminsert: do not drop pending rows if all the vmstorage backends are unavailable
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/294
2020-01-24 22:10:10 +02:00
Aliaksandr Valialkin
ecddba30fe app/vminsert/netstorage: increase timeout for pushing data from vminsert to vmstorage by 3x
Our clients report that the previous timeout could lead to frequent errors when
vmstorage starts background merge for big parts on slow HDD.
2020-01-21 18:21:49 +02:00
Aliaksandr Valialkin
8d398af92f app/vminsert/netstorage: mention the data size that cannot be sent to vmstorage
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/175
2019-09-25 12:53:41 +03:00
Aliaksandr Valialkin
73ac7b8dd6 app/vminsert/netstorage: make sure the conn exists before closing it in storageNode.closeBrokenConn
The conn can be missing or already closed during the call to storageNode.closeBrokenConn.
Prevent `nil pointer dereference` panic by verifying whether the conn is already closed.

Thanks to @CH-anhngo for reporting the issue.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/189
2019-09-25 10:36:50 +03:00
Aliaksandr Valialkin
550a12415a app/vminsert/netstorage: log network errors when sending data to vmstorage nodes
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/175
2019-09-13 22:26:24 +03:00
Aliaksandr Valialkin
ed50b8792b app/vminsert/netstorage: reduce the maximum buffer size for rerouted rows, so it occupies less RAM 2019-09-11 14:50:30 +03:00
Aliaksandr Valialkin
694cc59ed1 app/vminsert/netstorage: dynamically adjust timeouts for sending packets from vminsert to vmstorage depending on packet size
Bigger packets will have more chances to be sent to vmstorage.
2019-09-11 14:50:14 +03:00
Aliaksandr Valialkin
697de90893 app/vminsert: do not drop data in reroutedBuf if all the storage nodes are unhealthy 2019-08-23 10:38:19 +03:00
Aliaksandr Valialkin
a5dc54efc3 app/vminsert: properly limit the size of reroutedBuf 2019-08-23 10:29:51 +03:00
Aliaksandr Valialkin
5fcdb4a59a app/vminsert: improve handling of unhealthy vmstorage nodes
* Spread load evenly among remaining healthy nodes instead of hammering
  the next node after the unhealthy node.
* Make sure that the packet is flushed to storage node before returning success.
  Previously packets could stay in local buffers and thus lost on connection errors.
* Keep rows in the limited memory when all the storage nodes are unhealthy.
2019-06-09 00:42:36 +03:00
Aliaksandr Valialkin
f37903adb3 app/vminsert: add -rpc.disableCompression command-line flag for reducing CPU usage at the cost of higher network bandwidth usage 2019-05-24 12:51:07 +03:00
Aliaksandr Valialkin
24578b4bb1 all: open-sourcing cluster version 2019-05-23 00:25:38 +03:00