RUST-1510 Implement connection pool tracing messages #766

kmahar · 2022-10-20T20:41:00Z

This is the Rust PR corresponding to the spec changes in mongodb/specifications#1324:

Summary:

Sync new logging tests and reorganize yml/JSON CMAP files
Emit tracing events corresponding to all existing CMAP events
- Refactor connection pool code to pass around a new CmapEventEmitter type that emits both tracing events and the existing events
Support running close operations on client entities in the unified test runner (previously only supported cursors)

move cmap test files wip CMAP logging tests synced and passing fix test building w/o tracing feature sync test files

src/event/cmap.rs

src/test/spec/unified_runner/entity.rs

isabelatkinson · 2022-10-25T16:40:04Z

src/trace/connection.rs

+}
+
+impl TracingRepresentation for ConnectionClosedReason {
+    fn tracing_representation(self) -> String {


It looks like we're immediately calling as_str on the values returned from this method, so it would be nice to be able to return &'static str here to avoid the intermediary allocation of a String. Perhaps we could introduce an associated type with the TracingRepresentation trait to represent the return type of this method? That would allow the flexibility to return both owned strings and slices from this method.

nice idea! I added an associated type and switched to using &'static str here and for ConnectionCheckoutFailedReason.

somewhat relatedly, I discovered that as of a recent tracing release (https://github.com/tokio-rs/tracing/releases/tag/tracing-core-0.1.28) adding the as_str calls to Strings when passing them to tracing::debug! is no longer necessary, so I have cleaned up some of the existing event emitting code where the TracingRepresentation::Representation type is a String to make it a bit less verbose.

isabelatkinson · 2022-10-25T16:44:40Z

src/test/spec/unified_runner/operation.rs

+            match target_entity {
+                Entity::Client(client) => {
+                    client.client = ClientEntityState::Dropped;
+                    drop(entities);


I see that we already did this in the existing impl but why do we drop the entire entities map here?

you know, I'm not actually sure. I copied this and was thinking it would be necessary to make sure the reference count for the client we are dropping went to zero. but thinking about it more, since we mutate entities anyway I'm not actually sure why that would be the case as it shouldn't hold a reference anymore. the test seems to pass without it.
it looks like @patrickfreed added the drop(entities) for cursors via #712: https://github.com/mongodb/mongo-rust-driver/pull/712/files#diff-5a3e241f9938c810ad70bc17b905d8ed80e726e2548028ee7856ca0da36420f7R2093

Patrick, do you remember why that was needed?

Hmm, I can't think of a reason either. This might have just been a mistake. If the tests pass without it, I'd say leave it out for cursors.

For clients, I think we do need it (or something similar) in order to drop all descendant collections / databases to ensure the client actually gets dropped.

you raise a good point regarding the descendant databases and collections keeping a client alive.

that said, re-reading this code, all drop(entities) does is drop the RwLockWriteGuard we acquired which gave exclusive write access to the entities. I'm not sure why that would have been needed for cursors , and I don't see why we would need it for clients, still but I'll try running a full patch with it removed for cursors to see if the test failures (or lack thereof) give any indication here.

in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped, and I think I should update my spec PR to state that it is an error for a test to use any descendant object for an operation after its parent client is closed.
there is already a sentence to this effect re using the client which I can expand on:

Test files SHOULD NOT
specify any operations for a client entity following a close operation on it, as
driver behavior when an operation is attempted on a closed client is not consistent.

Ohh, I see now. I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access. Probably would have been good if I left a comment about that.

in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped

Sounds good 👍

I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access.

could you explain a little more how this might happen? since (I think) we are always executing test operations serially I'm wondering in what case we would be trying to access the entity map while this operation is still executing.

I've added new logic here to go through the entity map and remove all descendant entities from the map. this doesn't get exercised now since the test using close only has a client entity, but should hopefully correctly future-proof us against any cases where a closed client has descendants in the future.

Tests can run in parallel using the runOnThread operation introduced as part of the SDAM integration test port to the unified format. I'm not sure any test includes operations that would be problematic for this, but it's technically possible I think.

src/test/spec/unified_runner/entity.rs

src/event/cmap.rs

src/cmap/test/event.rs

kmahar · 2022-10-27T15:39:57Z

also, just to make sure my refactoring and changing of serialization logic didn't mess anything up for the astrolabe integration I ran a patch there and it all passed.

isabelatkinson

looks good, just one more suggestion/question. gonna tag in the rest of the team

src/event/cmap.rs

patrickfreed · 2022-10-28T22:11:44Z

src/test/spec/unified_runner/operation.rs

+            match target_entity {
+                Entity::Client(client) => {
+                    client.client = ClientEntityState::Dropped;
+                    drop(entities);


Hmm, I can't think of a reason either. This might have just been a mistake. If the tests pass without it, I'd say leave it out for cursors.

For clients, I think we do need it (or something similar) in order to drop all descendant collections / databases to ensure the client actually gets dropped.

src/trace/connection.rs

patrickfreed · 2022-11-01T22:03:03Z

src/test/spec/unified_runner/operation.rs

+            match target_entity {
+                Entity::Client(client) => {
+                    client.client = ClientEntityState::Dropped;
+                    drop(entities);


Ohh, I see now. I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access. Probably would have been good if I left a comment about that.

in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped

Sounds good 👍

src/trace/connection.rs

abr-egn

Looks good (modulo the entity map pruning), just a few minor comments.

abr-egn · 2022-11-02T17:23:20Z

src/cmap/test/event.rs


 #[derive(Clone, Debug)]
 pub struct EventHandler {
-    pub events: Arc<RwLock<Vec<Event>>>,
-    channel_sender: tokio::sync::broadcast::Sender<Event>,
+    pub(crate) events: Arc<RwLock<Vec<CmapEvent>>>,


I feel like I've asked this in other contexts but I don't remember - in a test mod, is there any difference between pub and pub(crate)?

in this case, I had to make the change because CmapEvent is defined in the driver and is pub(crate) there, so I wasn't allowed to leave this property as pub.

more generally, I think we decided at some point to use pub(crate) over pub going forward in tests, but I don't totally remember the justification. one reason pointed out in this Reddit thread seems like a good argument though: pub items do not get dead code warnings.

Another thing is that pub things (even in tests) can't expose pub(crate) things from the driver. In practice this doesn't matter for us because our tests don't get compiled in to the actual library, but it will prevent compilation from succeeding.

abr-egn · 2022-11-02T17:42:57Z

src/event/cmap.rs

+        }
+
+        let event = generate_event();
+        if let (Some(user_handler), Some(tracing_emitter)) =


Style nit: this might be a little more readable as a match, i.e.

match (&self.user_handler, tracing_emitter_to_use) { (Some(user_handler), Some(tracing_emitter)) => { ... } (Some(user_handler), None) => { ... } (None, Some(tracing_emitter)) => { ... } }

nice, updated.

abr-egn · 2022-11-02T17:47:43Z

src/test/spec/unified_runner/entity.rs

@@ -48,9 +48,19 @@ pub(crate) enum Entity {
    None,
 }

+#[derive(Clone, Debug)]
+pub(crate) enum ClientEntityState {


Style nit: personally, I'd just use Option for this and a comment about when it can be None rather than introducing a new type, but I don't feel super strongly about that.

I think I did it this way because I initially didn't realize we had the Deref magic in place, and so I figured I was going to have to change a bunch of code to unwrap ClientEntity.client and that the enum would help make those more explicit. but, it turns out we only ever match on this in a couple places within this file 🙂 so I agree an option seems just fine here. switched

kmahar · 2022-11-03T21:31:47Z

src/test/spec/unified_runner/operation.rs

+            match target_entity {
+                Entity::Client(client) => {
+                    client.client = ClientEntityState::Dropped;
+                    drop(entities);


I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access.

could you explain a little more how this might happen? since (I think) we are always executing test operations serially I'm wondering in what case we would be trying to access the entity map while this operation is still executing.

I've added new logic here to go through the entity map and remove all descendant entities from the map. this doesn't get exercised now since the test using close only has a client entity, but should hopefully correctly future-proof us against any cases where a closed client has descendants in the future.

src/trace/connection.rs

kmahar · 2022-11-03T21:37:45Z

src/test/spec/unified_runner/entity.rs

@@ -48,9 +48,19 @@ pub(crate) enum Entity {
    None,
 }

+#[derive(Clone, Debug)]
+pub(crate) enum ClientEntityState {


I think I did it this way because I initially didn't realize we had the Deref magic in place, and so I figured I was going to have to change a bunch of code to unwrap ClientEntity.client and that the enum would help make those more explicit. but, it turns out we only ever match on this in a couple places within this file 🙂 so I agree an option seems just fine here. switched

kmahar · 2022-11-03T21:45:34Z

src/event/cmap.rs

+        }
+
+        let event = generate_event();
+        if let (Some(user_handler), Some(tracing_emitter)) =


nice, updated.

kmahar · 2022-11-03T21:53:39Z

src/cmap/test/event.rs


 #[derive(Clone, Debug)]
 pub struct EventHandler {
-    pub events: Arc<RwLock<Vec<Event>>>,
-    channel_sender: tokio::sync::broadcast::Sender<Event>,
+    pub(crate) events: Arc<RwLock<Vec<CmapEvent>>>,


in this case, I had to make the change because CmapEvent is defined in the driver and is pub(crate) there, so I wasn't allowed to leave this property as pub.

more generally, I think we decided at some point to use pub(crate) over pub going forward in tests, but I don't totally remember the justification. one reason pointed out in this Reddit thread seems like a good argument though: pub items do not get dead code warnings.

kmahar · 2022-11-03T21:58:07Z

src/cmap/conn/mod.rs

-    /// Once the connection has received an error, it should not be used again or checked back
-    /// into a pool.
-    error: bool,
+    /// Stores a network error encountered while reading or writing. Once the connection has


the old comment seems a little wrong, because we check connections that have errored back into their pools and then allow the pool to handle closing them.

kmahar · 2022-11-03T21:58:42Z

src/cmap/conn/mod.rs

+    /// Stores a network error encountered while reading or writing. Once the connection has
+    /// received an error, it should not be used again and will be closed upon check-in to the
+    /// pool.
+    error: Option<Error>,


per conversation with @patrickfreed on the spec PR here we decided to add the errors that cause connections to be closed to "connection closed" tracing events when relevant, and similarly add the errors that cause checkout to fail to "connection checkout failed" tracing events. this required storing the error on the connection, so at the time the pool closes it we can access the error.

kmahar · 2022-11-03T21:59:40Z

src/cmap/conn/mod.rs

@@ -272,7 +275,9 @@ impl Connection {
            _ => message.write_to(&mut self.stream).await,
        };

-        self.error = write_result.is_err();
+        if let Err(ref err) = write_result {


Result.err() would be nice here, but it's still unstable API. https://doc.rust-lang.org/std/result/enum.Result.html#method.err

kmahar · 2022-11-03T22:04:06Z

src/event/cmap.rs

+
+    /// If the `reason` connection checkout failed was `Error`,the associated
+    /// error is contained here. This is attached so we can include it in log messages;
+    /// in future work we may add this to public API on the event itself. TODO: add


@patrickfreed was going to file a DRIVERS ticket about drivers adding these errors to their connection monitoring events, so once there is a ticket to link to I'll add that here. I figured for now we could just leave it out from the public API in case the eventual spec work leads to some different proposal to expose the error.

ideally, we would probably actually add the error to the ConnectionClosedReason::Error enum case but that would be a breaking change.

Ticket here: https://jira.mongodb.org/browse/DRIVERS-2495. Leaving it out until the RUST ticket for that sgtm

added reference to DRIVERS-2495 here and below

kmahar · 2022-11-03T22:11:39Z

src/event/cmap.rs

+    /// ticket link here.
+    #[cfg(feature = "tracing-unstable")]
+    #[serde(skip)]
+    #[derivative(PartialEq = "ignore")]


we rely on this PartialEq implementation in the CMAP tests. but since Error does not implement PartialEq right now and this field is currently just here as an implementation detail for tracing it seemed like we could leave it out for now when comparing events in tests.

that said, looking more closely, I'm not sure the PartialEq implementation is even needed. When I remove it from this type, the only thing that breaks is that the CmapEvent enum can't derive PartialEq. And if I remove the derived impl from that, the only thing that breaks is this line of code:

mongo-rust-driver/src/cmap/test/mod.rs

Line 224 in 6e3d43d

assert_eq!(subscriber.all(filter), Vec::new(), "{}", description);

and IIUC all that line is doing is checking that the length of subscriber.all(filter) is 0??? so maybe we should just remove the PartialEq stuff altogether.

Our Serialize / Deserialize impls are mostly there for test convenience, but I think something like PartialEq might be functionality that a user could rely on, even if inadvertently like we are via a Derive. For that reason, I'd be hesitant to remove any of those implementations until 3.0, even if they're largely unnecessary.

oh yeah, that's a good point. I wasn't thinking about this being a public type. I agree it seems best to keep it around then. filed RUST-1538 about considering a removal in 3.0

kmahar · 2022-11-03T22:16:50Z

src/cmap/worker.rs

-                };
-                handler.handle_connection_closed_event(event);
-            }
+                    error: Some(e.cause.clone()),


I just extracted the cause here since EstablishError is a different type than Error. AFAICT the extra info stored on the EstablishError is what phase of the handshake the error happened during. I wasn't sure if that info was worth including here or more exists so the driver can decide what to do about the error. LMK what you think.

Yep, that's just implementation detail stuff for SDAM error handling purposes. Using cause here SGTM.

patrickfreed

Looks good, just have one small question

src/test/spec/unified_runner/operation.rs

patrickfreed

LGTM!

isabelatkinson

lgtm!

abr-egn

LGTM!

kmahar added 2 commits October 20, 2022 16:17

CMAP logging

1182362

move cmap test files wip CMAP logging tests synced and passing fix test building w/o tracing feature sync test files

format, clippy

8bc5d96

kmahar commented Oct 20, 2022

View reviewed changes

src/event/cmap.rs Show resolved Hide resolved

src/test/spec/unified_runner/entity.rs Outdated Show resolved Hide resolved

src/test/spec/unified_runner/entity.rs Show resolved Hide resolved

kmahar added 2 commits October 20, 2022 17:30

sync tests

33f3520

reformat

56c6e37

kmahar marked this pull request as ready for review October 21, 2022 14:29

kmahar requested review from a team and isabelatkinson and removed request for a team October 21, 2022 14:29

isabelatkinson reviewed Oct 25, 2022

View reviewed changes

kmahar added 2 commits October 26, 2022 18:48

address Isabel comments

55d02e5

remove separate CMAP EventSubscriber type

cbea674

kmahar commented Oct 27, 2022

View reviewed changes

src/cmap/test/event.rs Show resolved Hide resolved

kmahar requested a review from isabelatkinson October 27, 2022 15:38

isabelatkinson reviewed Oct 27, 2022

View reviewed changes

src/event/cmap.rs Show resolved Hide resolved

isabelatkinson requested review from patrickfreed and abr-egn October 27, 2022 20:34

patrickfreed reviewed Oct 28, 2022

View reviewed changes

sync spec tests, fill in default port when it is being used

a6419a4

patrickfreed reviewed Nov 1, 2022

View reviewed changes

abr-egn reviewed Nov 2, 2022

View reviewed changes

kmahar added 2 commits November 2, 2022 14:40

add logic to remove all entities descended from a closed client

21f2d4a

address code review comments and add errors to relevant events

88269c9

kmahar commented Nov 3, 2022

View reviewed changes

kmahar added 2 commits November 3, 2022 18:19

always import derivative

e3d2dde

fix compilation without tracing-unstable feature

33fdc16

kmahar requested review from isabelatkinson and abr-egn November 4, 2022 17:16

kmahar requested a review from patrickfreed November 4, 2022 17:16

isabelatkinson mentioned this pull request Nov 7, 2022

RUST-1387 Execute CSFLE unified spec tests #770

Merged

patrickfreed reviewed Nov 7, 2022

View reviewed changes

src/test/spec/unified_runner/operation.rs Show resolved Hide resolved

patrickfreed approved these changes Nov 7, 2022

View reviewed changes

add ticket link

bd38746

isabelatkinson approved these changes Nov 7, 2022

View reviewed changes

abr-egn approved these changes Nov 7, 2022

View reviewed changes

kmahar merged commit e619a0c into mongodb:main Nov 7, 2022

kmahar deleted the RUST-1510/connection-logging branch November 7, 2022 22:04

RUST-1510 Implement connection pool tracing messages #766

RUST-1510 Implement connection pool tracing messages #766

Conversation

kmahar commented Oct 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmahar commented Oct 27, 2022

isabelatkinson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abr-egn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickfreed left a comment

Choose a reason for hiding this comment

patrickfreed left a comment

Choose a reason for hiding this comment

isabelatkinson left a comment

Choose a reason for hiding this comment

abr-egn left a comment

Choose a reason for hiding this comment