Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action times out if client is listening #1579

Closed
chrisstaite-menlo opened this issue Feb 7, 2025 · 3 comments · Fixed by #1580
Closed

Action times out if client is listening #1579

chrisstaite-menlo opened this issue Feb 7, 2025 · 3 comments · Fixed by #1580

Comments

@chrisstaite-menlo
Copy link
Collaborator

The addition of client_action_timeout_s means that an action times out if a client isn't actively polling for it. However, the RBE WaitExecution allows passive polling of an action. This is not considered and therefore long actions time out an signal down this channel that they have timed out. If there is an active WaitExecution for an action then it should not be subject to the client action timeout.

@MarcusSorealheis
Copy link
Collaborator

Wow! You found the smoking gun (I think). Thank you so much Chris!

We will look into this right away.

@chrisstaite-menlo
Copy link
Collaborator Author

chrisstaite-menlo commented Feb 7, 2025

I think this is handled by the code in OperationSubscriber::changed for store_awaited_action_db, however this same logic doesn't apply in memory_awaited_action_db.

@chrisstaite-menlo
Copy link
Collaborator Author

This might fix it:

diff --git a/nativelink-scheduler/src/memory_awaited_action_db.rs b/nativelink-scheduler/src/memory_awaited_action_db.rs
index ef703329..55b797a0 100644
--- a/nativelink-scheduler/src/memory_awaited_action_db.rs
+++ b/nativelink-scheduler/src/memory_awaited_action_db.rs
@@ -452,11 +452,21 @@ impl<I: InstantWrapper, NowFn: Fn() -> I + Clone + Send + Sync> AwaitedActionDbI
                     }
                 }
                 ActionEvent::ClientKeepAlive(client_id) => {
-                    let maybe_size = self
+                    if let Some(client_awaited_action) = self
                         .client_operation_to_awaited_action
-                        .size_for_key(&client_id)
-                        .await;
-                    if maybe_size.is_none() {
+                        .get(&client_id)
+                        .await
+                    {
+                        if let Some(awaited_action_sender) = self
+                            .operation_id_to_awaited_action
+                            .get(&client_awaited_action.operation_id)
+                        {
+                            awaited_action_sender.send_if_modified(|awaited_action| {
+                                awaited_action.update_client_keep_alive((self.now_fn)().now());
+                                false
+                            });
+                        }
+                    } else {
                         event!(
                             Level::ERROR,
                             ?client_id,

chrisstaite-menlo added a commit to chrisstaite-menlo/turbo-cache that referenced this issue Feb 7, 2025
When the scheduler was updated to add the keep alive to the AwaitedAction
the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive
was received.

Fix the test client_reconnect_keeps_action_alive which was not performing
the eviction due to optimisations in the filter_operations function which
then detected the issue.

Then update the ActionEvent::ClientKeepAlive event handler to update the
client keep alive timestamp in the AwaitedAction.

Fixes TraceMachina#1579.
chrisstaite-menlo added a commit to chrisstaite-menlo/turbo-cache that referenced this issue Feb 7, 2025
When the scheduler was updated to add the keep alive to the AwaitedAction
the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive
was received.

Fix the test client_reconnect_keeps_action_alive which was not performing
the eviction due to optimisations in the filter_operations function which
then detected the issue.

Then update the ActionEvent::ClientKeepAlive event handler to update the
client keep alive timestamp in the AwaitedAction.

Fixes TraceMachina#1579.
chrisstaite-menlo added a commit to chrisstaite-menlo/turbo-cache that referenced this issue Feb 7, 2025
When the scheduler was updated to add the keep alive to the AwaitedAction
the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive
was received.

Fix the test client_reconnect_keeps_action_alive which was not performing
the eviction due to optimisations in the filter_operations function which
then detected the issue.

Then update the ActionEvent::ClientKeepAlive event handler to update the
client keep alive timestamp in the AwaitedAction.

Fixes TraceMachina#1579.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants