RUST-1222 Cancel in-progress operations when SDAM heartbeats time out #1249

isabelatkinson · 2024-11-20T20:54:39Z

Introduces a new broadcast::Sender to the connection pool worker that broadcasts a cancellation message when the connection pool clears due to a network timeout. Each time a connection is checked out, it retrieves a corresponding broadcast::Receiver that listens for a cancellation message while it executes the operation.

isabelatkinson · 2024-11-20T20:55:46Z

src/cmap/conn.rs

-        message: Message,
-        // This value is only read if a compression feature flag is enabled.
-        #[allow(unused_variables)] can_compress: bool,
+        message: impl TryInto<Message, Error = impl Into<Error>>,


small refactor here to avoid needing to add an equivalent send_command_with_cancellation for pending connections

isabelatkinson · 2024-11-20T21:13:37Z

src/cmap/conn.rs

+            // A lagged error indicates that more heartbeats failed than the channel's capacity
+            // between checking out this connection and executing the operation. If this occurs,
+            // then proceed with cancelling the operation. RecvError::Closed can be ignored, as
+            // the sender (and by extension the connection pool) dropping does not indicate that
+            // the operation should be cancelled.


The lagged scenario I outlined here will probably never actually happen. The lifetime of a receiver is intentionally very short so that only relevant messages are received. The following would need to occur:

Connection is checked out for an operation, leading to a new receiver to be constructed. This new receiver will only receive messages sent after its construction.

Execution path proceeds with the rest of the steps between checkout and actually sending the message, which is primarily building the command. In the meantime...

SDAM heartbeat times out, leading to a pool clear and a message to be stored in the channel.

Another SDAM heartbeat times out after waiting the full heartbeat interval and another cancellation message is sent out.

Slow command construction finishes and send_message_with_cancellation is called; recv below immediately returns a lagged error because the receiver has two unseen messages from the two heartbeat timeouts which exceeds the channel's capacity. In this case we still want to proceed with cancellation.

These receivers are kind of acting like oneshots in that they're created fresh for each checked-out connection and only call recv once (i.e. on the below line), so the important thing here is to determine whether something was sent during their lifetime.

isabelatkinson · 2024-11-20T21:21:44Z

src/cmap/test.rs

+    "Pool clear SHOULD schedule the next background thread run immediately \
+     (interruptInUseConnections = false)",


This language and the test I've skipped above were added in the PR for the spec work for this ticket:

A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.

However, this doesn't necessarily seem relevant to interrupting in-use connections, and the test above is testing behavior when interruptInUseConnections is false. I'm also not sure what it would take to "schedule" maintenance with our existing pool design given that it already happens on an interval when no other requests have been sent to the pool. I'm inclined to file a ticket for implementing this behavior to minimize the scope of this work -- LMK if that sounds good to you.

Yup, that makes sense.

abr-egn

LGTM! This is much less invasive than any solution I could see.

abr-egn · 2024-11-22T17:01:14Z

src/cmap/conn.rs

+        message: impl TryInto<Message, Error = impl Into<Error>>,
+        cancellation_receiver: &mut broadcast::Receiver<()>,
+    ) -> Result<RawCommandResponse> {
+        tokio::select! {


Nit: should this have a biased; clause to make error behavior deterministic?

isabelatkinson added 3 commits November 20, 2024 13:52

cancellation

243cf65

cleanup

5c9926c

undo move

5f8fe40

isabelatkinson commented Nov 21, 2024

View reviewed changes

isabelatkinson marked this pull request as ready for review November 21, 2024 19:58

isabelatkinson requested a review from abr-egn November 21, 2024 19:58

abr-egn approved these changes Nov 22, 2024

View reviewed changes

isabelatkinson added 3 commits November 22, 2024 10:33

biased

d353d26

add todo for skipped test

7379113

skip flaky test

c830104

isabelatkinson merged commit e3df089 into mongodb:main Nov 22, 2024
16 checks passed

isabelatkinson deleted the cancel-ops branch December 6, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUST-1222 Cancel in-progress operations when SDAM heartbeats time out #1249

RUST-1222 Cancel in-progress operations when SDAM heartbeats time out #1249

isabelatkinson commented Nov 20, 2024 •

edited

Loading

isabelatkinson Nov 20, 2024

isabelatkinson Nov 20, 2024

isabelatkinson Nov 20, 2024

abr-egn Nov 22, 2024

abr-egn left a comment

abr-egn Nov 22, 2024

		"Pool clear SHOULD schedule the next background thread run immediately \
		(interruptInUseConnections = false)",

RUST-1222 Cancel in-progress operations when SDAM heartbeats time out #1249

RUST-1222 Cancel in-progress operations when SDAM heartbeats time out #1249

Conversation

isabelatkinson commented Nov 20, 2024 • edited Loading

isabelatkinson Nov 20, 2024

Choose a reason for hiding this comment

isabelatkinson Nov 20, 2024

Choose a reason for hiding this comment

isabelatkinson Nov 20, 2024

Choose a reason for hiding this comment

abr-egn Nov 22, 2024

Choose a reason for hiding this comment

abr-egn left a comment

Choose a reason for hiding this comment

abr-egn Nov 22, 2024

Choose a reason for hiding this comment

isabelatkinson commented Nov 20, 2024 •

edited

Loading