-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown Procedure for remaining sockets #745
base: master
Are you sure you want to change the base?
Conversation
89b186c
to
82c4542
Compare
When stopping a node, cowboy is waiting for all connections to be correctly closed. Usually, it waits for the buffer to be empty before closing the socket. In most of the cases, this behavior is acceptable, but in some cases, in particular when a connection is very slow, this can have a direct impact on the delay to shutdown a node. This commit adds a shutdown procedure when stopping the client connections. It has 3 stages: (1) when a node is stopped, all sockets are set in read-only mode and all clients are noticed of this change, and they will try to fetch the data before closing the connection (2) if some nodes are still present, the procedure sends another closing requests and will wait for a specific delay (3) if the connection is still up, the connection is simply killed, linger option is set to {true, 0} and the socket is definitively closed. This draining procedure was inspired by ranch documentation example, that can be found at: https://ninenines.eu/docs/en/ranch/2.2/guide/connection_draining/ see: https://github.com/ArweaveTeam/arweave-dev/issues/817
82c4542
to
c22dac9
Compare
@@ -664,6 +665,18 @@ parse_cli_args(["rocksdb_flush_interval", Seconds | Rest], C) -> | |||
parse_cli_args(["rocksdb_wal_sync_interval", Seconds | Rest], C) -> | |||
parse_cli_args(Rest, C#config{ rocksdb_wal_sync_interval_s = list_to_integer(Seconds) }); | |||
|
|||
%% shutdown procedure | |||
parse_cli_args(["shutdown_tcp_connection_timeout", Delay|Rest], C) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I think probably best to keep to the format of the other parse lines, eg.
parse_cli_args(["rocksdb_flush_interval", Seconds | Rest], C) ->
parse_cli_args(Rest, C#config{ rocksdb_flush_interval_s = list_to_integer(Seconds) });
This file is already monstrous enough, but keeping the styling/format the same helps a little.
Plus in general we do a hard/fast fail if any of the flags are set wrong rather than continue processing (e.g. if list_to_integer fails, we should allow the node to exit as we do for the other flags0
@@ -229,7 +229,8 @@ | |||
%% Undocumented/unsupported options | |||
chunk_storage_file_size = ?CHUNK_GROUP_SIZE, | |||
rocksdb_flush_interval_s = ?DEFAULT_ROCKSDB_FLUSH_INTERVAL_S, | |||
rocksdb_wal_sync_interval_s = ?DEFAULT_ROCKSDB_WAL_SYNC_INTERVAL_S | |||
rocksdb_wal_sync_interval_s = ?DEFAULT_ROCKSDB_WAL_SYNC_INTERVAL_S, | |||
shutdown_tcp_connection_timeout = 60_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout is for all connections rather than per connection, right? Like, if we have a sequence of 10 requests queued up that historically the process would move through one at a time, are we waiting 60s per request (e.g. 10 min)? Or 60s total?
Filter = fun({Pid, Ref}) -> false; (_) -> true end, | ||
NewPids = lists:filter(Filter, Pids), | ||
terminate_listener(NewPids, Delay); | ||
_ -> terminate_listener(Pids, Delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially loop, but the after AfterDelay
limits the looping to Delay*3
ms, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, no it doesn't.
Any risk to infinite loop here?
|
||
%%-------------------------------------------------------------------- | ||
%% @doc terminate a ranch/cowboy connection. this follows the drain | ||
%% procedure explaining in ranch and cowboy documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%% procedure explaining in ranch and cowboy documentation. | |
%% procedure explained in ranch and cowboy documentation: | |
%% https://ninenines.eu/docs/en/ranch/2.2/guide/connection_draining/ |
% remote peer will receive this information and will try to fetch | ||
% the last piece of data from the buffer. It does not guarantee the | ||
% socket will be closed, in particular if the connection is slow. | ||
logger:warning(#{ connection => Peer, socket => Connection, reason => "shutdown", msg => "shutdown (read-only)"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to use ?LOG_WARNING
/ ?LOG_ERROR
/ ?LOG_INFO
for now? to be consistent with the rest of the code
ranch_tcp:close(Socket), | ||
terminate_connection_loop(State); | ||
|
||
{'DOWN', _Ref, process, Connection, _} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this message comes in before one or both of the {timeout, xxx}
messages, that's okay, right? Nothing else needs to be cleaned up because the connection has already terminated, and the pending timeout messages will just be sent into the void
_ -> terminate_listener(Pids, Delay) | ||
after | ||
AfterDelay -> | ||
cowboy:stop_listener(ar_http_iface_listener) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll wait AfterDelay ms after all Pids have exited right? is that the intent? Or do we want to call cowboy:stop_listener
as soon as Pids
is []
?
When stopping a node, cowboy is waiting for all connections to be correctly closed. Usually, it waits for the buffer to be empty before closing the socket. In most of the cases, this behavior is acceptable, but in some cases, in particular when a connection is very slow, this can have a direct impact on the delay to shutdown a node.
This commit adds a shutdown procedure when stopping the client connections. It has 3 stages: (1) when a node is stopped, all sockets are set in read-only mode and all clients are noticed of this change, and they will try to fetch the data before closing the connection (2) if some nodes are still present, the procedure sends another closing requests and will wait for a specific delay (3) if the connection is still up, the connection is simply killed, linger option is set to {true, 0} and the socket is definitively closed.
This draining procedure was inspired by ranch documentation example, that can be found at: https://ninenines.eu/docs/en/ranch/2.2/guide/connection_draining/
see: https://github.com/ArweaveTeam/arweave-dev/issues/817