RUST-360 Streaming monitoring protocol #721

patrickfreed · 2022-08-11T20:16:02Z

This PR updates the driver to make use of the streaming monitoring protocol (described here), which allows the driver to accept topology updates that are pushed to it rather than polling the server regularly. This should allow the driver to more quickly recover from failover events.

patrickfreed · 2022-08-11T20:17:05Z

src/cmap/establish/handshake/mod.rs

-        &self,
-        conn: &mut Connection,
-        topology: Option<&Topology>,
-        handler: &Option<Arc<dyn SdamEventHandler>>,


The monitors now handle emitting these events, so we don't need to pipe these event handlers down into the handshaker anymore.

nice, this may make things easier for SDAM logging too 😇

patrickfreed · 2022-08-11T20:20:37Z

src/cmap/manager.rs

-            acknowledgment_receiver.wait_for_acknowledgment().await;
-        }
+    ///
+    /// Since management requests are treated with the highest priority by the pool and will be


I updated the pool worker task to use a biased select!, which means that it will poll in a certain order. Specifically for this, I updated it to check for management requests (clearing, marking as ready, etc) before checking for check-out requests. This means that we don't actually have to wait for acknowledgment that the pool is ready before returning here, since we know that no check-out requests will be processed before this MarkAsReady is.

As it turns out, this wasn't completely race-proof, so I reverted back to waiting for acknowledgment.

patrickfreed · 2022-08-11T20:23:15Z

src/hello.rs

-        Ok(hello_reply)
-    }) {
-        Ok(hello_reply) => {
-            emit_event(topology, handler, |handler| {


this logic was moved into the monitor.

patrickfreed · 2022-08-11T20:24:37Z

src/lib.rs

@@ -297,7 +297,8 @@
    allow(
        clippy::unreadable_literal,
        clippy::cognitive_complexity,
-        clippy::float_cmp
+        clippy::float_cmp,
+        clippy::match_like_matches_macro


this macro forces you to write something like

match a { SomeEnum::B => false, _ => true }

as

!matches!(a, SomeEnum::B)

which I personally find less readable (super easy to miss that first !). What do you all think?

patrickfreed · 2022-08-11T20:25:35Z

src/runtime/worker_handle.rs

@@ -18,24 +18,31 @@ impl WorkerHandle {
 /// Listener used to determine when all handles have been dropped.
 #[derive(Debug)]
 pub(crate) struct WorkerHandleListener {
-    receiver: mpsc::Receiver<()>,
+    sender: watch::Sender<()>,


This was rewritten to use a watch channel which is more appropriate for this type and also allowed us to use &self everywhere instead of &mut self. The functionality is otherwise unchanged.

which is more appropriate for this type

could you explain the rationale for this? I've read the docs for both mpsc and watch and it's not super obvious to me why it makes more sene for the listener to be the "producer" rather than the "consumer" (the other way actually seems more intuitive to me)

Sorry, yeah this comment is pretty vague. The main value is to be able to use the "closing" functionality on the sender half without taking a mutable reference. Also, the channel isn't actually used for sending any values, so watch not needing to allocate a buffer or anything under the hood is nice (watch channels are basically glorified RwLock<T>s with nice async semantics).

ahh that all makes sense. thanks!

patrickfreed · 2022-08-11T20:33:27Z

src/sdam/topology.rs

+    fn sync_hosts(&mut self, hosts: HashSet<ServerAddress>) -> bool {
+        let mut new_description = self.topology_description.clone();
+        new_description.sync_hosts(&hosts);
+        self.update_topology(new_description)


Everything that modifies the topology now goes through the following steps:

clone the existing topology description

modify it in some way

pass it to update_topology

This cuts down on the one-off logic in these various methods and hopefully makes the worker easier to read / reason about.

patrickfreed · 2022-08-11T20:35:42Z

src/test/spec/unified_runner/test_file.rs

@@ -94,7 +94,13 @@ impl RunOnRequirement {
            }
        }
        if let Some(ref topologies) = self.topologies {
-            if !topologies.contains(&client.topology().await) {
+            let client_topology = client.topology().await;
+            if !topologies.iter().any(|expected_topology| {


Per the spec, a "sharded" runOnRequirement should match both "sharded" and "sharded-replicaset" topologies. Since we were using direct equality, we may have been skipping some tests accidentally.

patrickfreed · 2022-08-11T20:36:54Z

src/test/util/event.rs

@@ -505,9 +505,12 @@ impl EventSubscriber<'_> {
        F: FnMut(&Event) -> bool,
    {
        let mut events = Vec::new();
-        while let Some(event) = self.wait_for_event(timeout, &mut filter).await {


This changes how this method works at a fundamental level, but I think in a way that makes much more sense. Before, it would collect events until an event was not received for the given timeout, so if there was a constant stream of events it would go on forever. The new version collects events for the given time and returns what it observed.

patrickfreed · 2022-08-11T20:39:42Z

src/sdam/topology.rs

@@ -319,7 +228,21 @@ impl Topology {
 #[derive(Debug, Clone)]
 pub(crate) struct TopologyState {
    pub(crate) description: TopologyDescription,
-    pub(crate) servers: HashMap<ServerAddress, Arc<Server>>,
+    servers: HashMap<ServerAddress, Weak<Server>>,


Storing Arc<Server> here actually clones them for every receiver (aka TopologyWatcher), which led to a reference cycle wherein the Servers were never actually getting dropped because the monitors held watchers, but the monitors never exited because the Servers never dropped. Storing Weak pointers here fixes this.

One future area to explore would be to ditch Arc / Weak here altogether and just pass clones of Server around, but I didn't want to go too deep down the refactor rabbit hole.

2.3.0 (which has the Topology refactor) is not affected by this issue, but that's because the monitors exit once they detect the topology is dropped, which then causes the Servers to drop. They have no need for the Server to be alive in order to be dropped. This does highlight a separate issue, which is that monitors on 2.3.0 (and probably earlier) don't exit if a server is removed from the Topology. This is fixed now, however. (Filed RUST-1443 for it)

patrickfreed · 2022-08-11T20:56:59Z

src/sdam/topology.rs

@@ -1,7 +1,7 @@
 use std::{


This file has gotten kind of long and complicated, so once this PR is merged I think I'll go in and break it down into a few smaller ones. For the purposes of preserving the diff I didn't do that in this PR.

patrickfreed · 2022-09-02T23:36:59Z

I will say that I don't have a great mental model of how all of the components in the driver interact so I think the parts of the code that involve coordination/communication between different components have been difficult for me to review and provide meaningful input on, especially in cases where things have been refactored.

I'm going to try to spend some time reasoning through how it all fits together before doing another round of review, but in the meantime figured I would go ahead and give these comments.

Sounds good. And yeah, the various refactors + cross component nature is part of what made this difficult from the implementation side too. Would be happy to do a walkthrough of some of the changes over zoom if that would be helpful to reviewers. If not, no worries too.

src/cmap/conn/mod.rs

src/runtime/stream.rs

src/cmap/conn/mod.rs

abr-egn · 2022-09-07T16:37:54Z

src/cmap/establish/handshake/mod.rs

@@ -116,19 +112,7 @@ lazy_static! {
                version: None,
            },
            platform: format!("{} with {}", rustc_version_runtime::version_meta().short_version_string, RUNTIME_NAME),
-        };
-
-        let info = os_info::get();


If we do need to reintroduce it, we could avoid the timeout issue by reading the lazy_static in client initialization.

src/sdam/monitor.rs

This fixes RUST-1464.

abr-egn

LGTM!

kmahar · 2022-09-08T16:15:46Z

src/runtime/tls_openssl.rs

+    /// Create a new `TlsConfig` from the provided options from the user.
+    /// This operation is expensive, so the resultant `TlsConfig` should be cached.
+    pub(crate) fn new(options: TlsOptions) -> Result<TlsConfig> {
+        let verify_hostname = options.allow_invalid_hostnames.unwrap_or(true);


I think this logic is wrong - if a user sets allow_invalid_hostnames to true, shouldn't verify_hostname be false?

Oh good catch, yeah it is. Updated to use a match to make the logic clearer here. As a side note, I'm a little concerned this didn't trigger any test failures. Filed RUST-1467 for introducing test coverage for this.

kmahar · 2022-09-08T16:57:00Z

src/runtime/stream.rs

+    ) -> Result<Self> {
+        let inner = AsyncTcpStream::connect(&address).await?;
+
+        // If there are TLS options, wrap the inner stream with rustls.


I know this is a copy-paste from below, but is this comment still accurate or can this also be openSSL now?

…ello-driver-parts

kmahar

LGTM! great job with this 👏

isabelatkinson

looks like all the questions i had were asked & answered -- lgtm!

This fixes a regression that was introduced in #721.

patrickfreed added 26 commits August 8, 2022 14:10

wip initial streaming protocol impl

532cc0d

implement prose tests, enable rtt monitoring

4b73ca5

clear requests after check completes

ee8ef91

cleanup

bf3df4c

sync sdam tests

210f54d

fix clippy various cleanup

99be868

return error if send_message when moreToCome set

5f480e6

document rtt monitor

5a0ba19

wip test changes

8bde2f8

skip heartbeat_events test on 4.4+

11e0de9

reject invalid set names, only update RTT in monitor

5e9f7d3

pull in unified spec tests

2f699d4

wip monitor timeout

88b477a

idk

f4a1ca9

wip

4281d67

idk tests not working

80fee99

wip fix cancellation, resource leaks

28fd38b

various cleanup

077c8ac

fix clippy

8b3470e

simplify reset

a0a6c25

remove unneeded ord/partialord impl for ServerAddress

091368e

set command_executing = true before streaming responses

aa9a8d4

various more cleanup

c0e6163

ensure monitors are closed before closing topology

b69fadb

delete old integration tests

74d9737

check_requester -> monitors_manager

d2e20e3

patrickfreed commented Aug 11, 2022

View reviewed changes

patrickfreed added 3 commits August 11, 2022 17:07

skip sdam prose tests on load balanced topology

1af37d4

wip monitor check request changes

04c8172

add acknowledgment back to poolready

45f5fe7

patrickfreed added 10 commits September 2, 2022 18:23

pass server_api directly

33f5ab0

revert to using hashset/hashmap for topologydescriptiondiff

0c485d9

set = to None instead of calling take()

a542eef

remove commented out code

a643820

remove unneeded workerhandlelistener creation

b987ff4

add TestClient method for checking for streaming protocol support

90eed60

clarify test comment

11a27d7

assert multiple heartbeats are received

c92da56

fix outdated comment

0cdaf21

add note about closeConnection

23da1b1

kmahar requested review from abr-egn, isabelatkinson and kmahar September 7, 2022 16:41

abr-egn reviewed Sep 7, 2022

View reviewed changes

patrickfreed added 3 commits September 7, 2022 14:28

remove needless pin

2c1bd80

drop unused timeout argument, use async establishment for async-std

d7e7bc1

This fixes RUST-1464.

use Connection::new instead of direct struct construction

712d460

patrickfreed requested a review from abr-egn September 7, 2022 20:29

abr-egn approved these changes Sep 8, 2022

View reviewed changes

kmahar reviewed Sep 8, 2022

View reviewed changes

patrickfreed added 3 commits September 8, 2022 15:11

fix allow_invalid_hostnames behavior on OpenSSL

24018ad

fix outdated comment

cd2eabf

Merge remote-tracking branch 'origin/main' into RUST-360/streamable-h…

2fabe8d

…ello-driver-parts

kmahar approved these changes Sep 8, 2022

View reviewed changes

isabelatkinson approved these changes Sep 13, 2022

View reviewed changes

patrickfreed merged commit 223cee3 into mongodb:main Sep 13, 2022

This was referenced Sep 15, 2022

RUST-360 Guard connection establishment by connectTimeoutMS #743

Merged

RUST-1464 Use async connection establishment when using async-std (2.3.x backport) #746

Merged

patrickfreed added a commit that referenced this pull request Sep 16, 2022

RUST-360 Guard connection establishment by connectTimeoutMS (#743)

5fb9b52

This fixes a regression that was introduced in #721.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUST-360 Streaming monitoring protocol #721

RUST-360 Streaming monitoring protocol #721

patrickfreed commented Aug 11, 2022

patrickfreed Aug 11, 2022

kmahar Aug 30, 2022

patrickfreed Aug 11, 2022

patrickfreed Aug 23, 2022

patrickfreed Aug 11, 2022

patrickfreed Aug 11, 2022

patrickfreed Aug 11, 2022

kmahar Sep 1, 2022

patrickfreed Sep 2, 2022

kmahar Sep 6, 2022

patrickfreed Aug 11, 2022

patrickfreed Aug 11, 2022

patrickfreed Aug 11, 2022

patrickfreed Aug 11, 2022 •

edited

Loading

patrickfreed Aug 11, 2022

patrickfreed commented Sep 2, 2022

abr-egn Sep 7, 2022

abr-egn left a comment

kmahar Sep 8, 2022

patrickfreed Sep 8, 2022

kmahar Sep 8, 2022

patrickfreed Sep 8, 2022

kmahar left a comment

isabelatkinson left a comment

RUST-360 Streaming monitoring protocol #721

RUST-360 Streaming monitoring protocol #721

Conversation

patrickfreed commented Aug 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickfreed Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickfreed commented Sep 2, 2022

Choose a reason for hiding this comment

abr-egn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmahar left a comment

Choose a reason for hiding this comment

isabelatkinson left a comment

Choose a reason for hiding this comment

patrickfreed Aug 11, 2022 •

edited

Loading