Shutdown Procedure for remaining sockets #745

humaite · 2025-03-07T14:46:09Z

When stopping a node, cowboy is waiting for all connections to be correctly closed. Usually, it waits for the buffer to be empty before closing the socket. In most of the cases, this behavior is acceptable, but in some cases, in particular when a connection is very slow, this can have a direct impact on the delay to shutdown a node.

This commit adds a shutdown procedure when stopping the client connections. It has 3 stages: (1) when a node is stopped, all sockets are set in read-only mode and all clients are noticed of this change, and they will try to fetch the data before closing the connection (2) if some nodes are still present, the procedure sends another closing requests and will wait for a specific delay (3) if the connection is still up, the connection is simply killed, linger option is set to {true, 0} and the socket is definitively closed.

This draining procedure was inspired by ranch documentation example, that can be found at: https://ninenines.eu/docs/en/ranch/2.2/guide/connection_draining/

see: https://github.com/ArweaveTeam/arweave-dev/issues/817

When stopping a node, cowboy is waiting for all connections to be correctly closed. Usually, it waits for the buffer to be empty before closing the socket. In most of the cases, this behavior is acceptable, but in some cases, in particular when a connection is very slow, this can have a direct impact on the delay to shutdown a node. This commit adds a shutdown procedure when stopping the client connections. It has 3 stages: (1) when a node is stopped, all sockets are set in read-only mode and all clients are noticed of this change, and they will try to fetch the data before closing the connection (2) if some nodes are still present, the procedure sends another closing requests and will wait for a specific delay (3) if the connection is still up, the connection is simply killed, linger option is set to {true, 0} and the socket is definitively closed. This draining procedure was inspired by ranch documentation example, that can be found at: https://ninenines.eu/docs/en/ranch/2.2/guide/connection_draining/ see: https://github.com/ArweaveTeam/arweave-dev/issues/817

JamesPiechota · 2025-03-07T19:05:37Z

apps/arweave/src/ar.erl

@@ -664,6 +665,18 @@ parse_cli_args(["rocksdb_flush_interval", Seconds | Rest], C) ->
 parse_cli_args(["rocksdb_wal_sync_interval", Seconds | Rest], C) ->
 	parse_cli_args(Rest, C#config{ rocksdb_wal_sync_interval_s = list_to_integer(Seconds) });

+%% shutdown procedure
+parse_cli_args(["shutdown_tcp_connection_timeout", Delay|Rest], C) ->


For now I think probably best to keep to the format of the other parse lines, eg.

parse_cli_args(["rocksdb_flush_interval", Seconds | Rest], C) -> parse_cli_args(Rest, C#config{ rocksdb_flush_interval_s = list_to_integer(Seconds) });

This file is already monstrous enough, but keeping the styling/format the same helps a little.

Plus in general we do a hard/fast fail if any of the flags are set wrong rather than continue processing (e.g. if list_to_integer fails, we should allow the node to exit as we do for the other flags0

JamesPiechota · 2025-03-07T19:09:07Z

apps/arweave/include/ar_config.hrl

@@ -229,7 +229,8 @@
 	%% Undocumented/unsupported options
 	chunk_storage_file_size = ?CHUNK_GROUP_SIZE,
 	rocksdb_flush_interval_s = ?DEFAULT_ROCKSDB_FLUSH_INTERVAL_S,
-	rocksdb_wal_sync_interval_s = ?DEFAULT_ROCKSDB_WAL_SYNC_INTERVAL_S
+	rocksdb_wal_sync_interval_s = ?DEFAULT_ROCKSDB_WAL_SYNC_INTERVAL_S,
+	shutdown_tcp_connection_timeout = 60_000


This timeout is for all connections rather than per connection, right? Like, if we have a sequence of 10 requests queued up that historically the process would move through one at a time, are we waiting 60s per request (e.g. 10 min)? Or 60s total?

JamesPiechota · 2025-03-07T19:17:13Z