Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUST-1510 Implement connection pool tracing messages #766

Merged
merged 12 commits into from
Nov 7, 2022

Conversation

kmahar
Copy link
Contributor

@kmahar kmahar commented Oct 20, 2022

This is the Rust PR corresponding to the spec changes in mongodb/specifications#1324:

Summary:

  • Sync new logging tests and reorganize yml/JSON CMAP files
  • Emit tracing events corresponding to all existing CMAP events
    • Refactor connection pool code to pass around a new CmapEventEmitter type that emits both tracing events and the existing events
  • Support running close operations on client entities in the unified test runner (previously only supported cursors)

move cmap test files

wip

CMAP logging tests synced and passing

fix test building w/o tracing feature

sync test files
@kmahar kmahar marked this pull request as ready for review October 21, 2022 14:29
@kmahar kmahar requested review from a team and isabelatkinson and removed request for a team October 21, 2022 14:29
}

impl TracingRepresentation for ConnectionClosedReason {
fn tracing_representation(self) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're immediately calling as_str on the values returned from this method, so it would be nice to be able to return &'static str here to avoid the intermediary allocation of a String. Perhaps we could introduce an associated type with the TracingRepresentation trait to represent the return type of this method? That would allow the flexibility to return both owned strings and slices from this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea! I added an associated type and switched to using &'static str here and for ConnectionCheckoutFailedReason.

somewhat relatedly, I discovered that as of a recent tracing release (https://github.com/tokio-rs/tracing/releases/tag/tracing-core-0.1.28) adding the as_str calls to Strings when passing them to tracing::debug! is no longer necessary, so I have cleaned up some of the existing event emitting code where the TracingRepresentation::Representation type is a String to make it a bit less verbose.

match target_entity {
Entity::Client(client) => {
client.client = ClientEntityState::Dropped;
drop(entities);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we already did this in the existing impl but why do we drop the entire entities map here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you know, I'm not actually sure. I copied this and was thinking it would be necessary to make sure the reference count for the client we are dropping went to zero. but thinking about it more, since we mutate entities anyway I'm not actually sure why that would be the case as it shouldn't hold a reference anymore. the test seems to pass without it.
it looks like @patrickfreed added the drop(entities) for cursors via #712: https://github.com/mongodb/mongo-rust-driver/pull/712/files#diff-5a3e241f9938c810ad70bc17b905d8ed80e726e2548028ee7856ca0da36420f7R2093

Patrick, do you remember why that was needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't think of a reason either. This might have just been a mistake. If the tests pass without it, I'd say leave it out for cursors.

For clients, I think we do need it (or something similar) in order to drop all descendant collections / databases to ensure the client actually gets dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you raise a good point regarding the descendant databases and collections keeping a client alive.

that said, re-reading this code, all drop(entities) does is drop the RwLockWriteGuard we acquired which gave exclusive write access to the entities. I'm not sure why that would have been needed for cursors , and I don't see why we would need it for clients, still but I'll try running a full patch with it removed for cursors to see if the test failures (or lack thereof) give any indication here.

in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped, and I think I should update my spec PR to state that it is an error for a test to use any descendant object for an operation after its parent client is closed.
there is already a sentence to this effect re using the client which I can expand on:

Test files SHOULD NOT
specify any operations for a client entity following a close operation on it, as
driver behavior when an operation is attempted on a closed client is not consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see now. I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access. Probably would have been good if I left a comment about that.

in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped

Sounds good 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access.

could you explain a little more how this might happen? since (I think) we are always executing test operations serially I'm wondering in what case we would be trying to access the entity map while this operation is still executing.

I've added new logic here to go through the entity map and remove all descendant entities from the map. this doesn't get exercised now since the test using close only has a client entity, but should hopefully correctly future-proof us against any cases where a closed client has descendants in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests can run in parallel using the runOnThread operation introduced as part of the SDAM integration test port to the unified format. I'm not sure any test includes operations that would be problematic for this, but it's technically possible I think.

@kmahar kmahar requested a review from isabelatkinson October 27, 2022 15:38
@kmahar
Copy link
Contributor Author

kmahar commented Oct 27, 2022

also, just to make sure my refactoring and changing of serialization logic didn't mess anything up for the astrolabe integration I ran a patch there and it all passed.

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just one more suggestion/question. gonna tag in the rest of the team

match target_entity {
Entity::Client(client) => {
client.client = ClientEntityState::Dropped;
drop(entities);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't think of a reason either. This might have just been a mistake. If the tests pass without it, I'd say leave it out for cursors.

For clients, I think we do need it (or something similar) in order to drop all descendant collections / databases to ensure the client actually gets dropped.

match target_entity {
Entity::Client(client) => {
client.client = ClientEntityState::Dropped;
drop(entities);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see now. I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access. Probably would have been good if I left a comment about that.

in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped

Sounds good 👍

Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (modulo the entity map pruning), just a few minor comments.


#[derive(Clone, Debug)]
pub struct EventHandler {
pub events: Arc<RwLock<Vec<Event>>>,
channel_sender: tokio::sync::broadcast::Sender<Event>,
pub(crate) events: Arc<RwLock<Vec<CmapEvent>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I've asked this in other contexts but I don't remember - in a test mod, is there any difference between pub and pub(crate)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, I had to make the change because CmapEvent is defined in the driver and is pub(crate) there, so I wasn't allowed to leave this property as pub.

more generally, I think we decided at some point to use pub(crate) over pub going forward in tests, but I don't totally remember the justification. one reason pointed out in this Reddit thread seems like a good argument though: pub items do not get dead code warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is that pub things (even in tests) can't expose pub(crate) things from the driver. In practice this doesn't matter for us because our tests don't get compiled in to the actual library, but it will prevent compilation from succeeding.

}

let event = generate_event();
if let (Some(user_handler), Some(tracing_emitter)) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: this might be a little more readable as a match, i.e.

match (&self.user_handler, tracing_emitter_to_use) {
  (Some(user_handler), Some(tracing_emitter)) => { ... }
  (Some(user_handler), None) => { ... }
  (None, Some(tracing_emitter)) => { ... }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, updated.

@@ -48,9 +48,19 @@ pub(crate) enum Entity {
None,
}

#[derive(Clone, Debug)]
pub(crate) enum ClientEntityState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: personally, I'd just use Option for this and a comment about when it can be None rather than introducing a new type, but I don't feel super strongly about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did it this way because I initially didn't realize we had the Deref magic in place, and so I figured I was going to have to change a bunch of code to unwrap ClientEntity.client and that the enum would help make those more explicit. but, it turns out we only ever match on this in a couple places within this file 🙂 so I agree an option seems just fine here. switched

match target_entity {
Entity::Client(client) => {
client.client = ClientEntityState::Dropped;
drop(entities);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access.

could you explain a little more how this might happen? since (I think) we are always executing test operations serially I'm wondering in what case we would be trying to access the entity map while this operation is still executing.

I've added new logic here to go through the entity map and remove all descendant entities from the map. this doesn't get exercised now since the test using close only has a client entity, but should hopefully correctly future-proof us against any cases where a closed client has descendants in the future.

@@ -48,9 +48,19 @@ pub(crate) enum Entity {
None,
}

#[derive(Clone, Debug)]
pub(crate) enum ClientEntityState {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did it this way because I initially didn't realize we had the Deref magic in place, and so I figured I was going to have to change a bunch of code to unwrap ClientEntity.client and that the enum would help make those more explicit. but, it turns out we only ever match on this in a couple places within this file 🙂 so I agree an option seems just fine here. switched

}

let event = generate_event();
if let (Some(user_handler), Some(tracing_emitter)) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, updated.


#[derive(Clone, Debug)]
pub struct EventHandler {
pub events: Arc<RwLock<Vec<Event>>>,
channel_sender: tokio::sync::broadcast::Sender<Event>,
pub(crate) events: Arc<RwLock<Vec<CmapEvent>>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, I had to make the change because CmapEvent is defined in the driver and is pub(crate) there, so I wasn't allowed to leave this property as pub.

more generally, I think we decided at some point to use pub(crate) over pub going forward in tests, but I don't totally remember the justification. one reason pointed out in this Reddit thread seems like a good argument though: pub items do not get dead code warnings.

/// Once the connection has received an error, it should not be used again or checked back
/// into a pool.
error: bool,
/// Stores a network error encountered while reading or writing. Once the connection has
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old comment seems a little wrong, because we check connections that have errored back into their pools and then allow the pool to handle closing them.

/// Stores a network error encountered while reading or writing. Once the connection has
/// received an error, it should not be used again and will be closed upon check-in to the
/// pool.
error: Option<Error>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per conversation with @patrickfreed on the spec PR here we decided to add the errors that cause connections to be closed to "connection closed" tracing events when relevant, and similarly add the errors that cause checkout to fail to "connection checkout failed" tracing events. this required storing the error on the connection, so at the time the pool closes it we can access the error.

@@ -272,7 +275,9 @@ impl Connection {
_ => message.write_to(&mut self.stream).await,
};

self.error = write_result.is_err();
if let Err(ref err) = write_result {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result.err() would be nice here, but it's still unstable API. https://doc.rust-lang.org/std/result/enum.Result.html#method.err


/// If the `reason` connection checkout failed was `Error`,the associated
/// error is contained here. This is attached so we can include it in log messages;
/// in future work we may add this to public API on the event itself. TODO: add
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickfreed was going to file a DRIVERS ticket about drivers adding these errors to their connection monitoring events, so once there is a ticket to link to I'll add that here. I figured for now we could just leave it out from the public API in case the eventual spec work leads to some different proposal to expose the error.

ideally, we would probably actually add the error to the ConnectionClosedReason::Error enum case but that would be a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket here: https://jira.mongodb.org/browse/DRIVERS-2495. Leaving it out until the RUST ticket for that sgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added reference to DRIVERS-2495 here and below

/// ticket link here.
#[cfg(feature = "tracing-unstable")]
#[serde(skip)]
#[derivative(PartialEq = "ignore")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we rely on this PartialEq implementation in the CMAP tests. but since Error does not implement PartialEq right now and this field is currently just here as an implementation detail for tracing it seemed like we could leave it out for now when comparing events in tests.

that said, looking more closely, I'm not sure the PartialEq implementation is even needed. When I remove it from this type, the only thing that breaks is that the CmapEvent enum can't derive PartialEq. And if I remove the derived impl from that, the only thing that breaks is this line of code:

assert_eq!(subscriber.all(filter), Vec::new(), "{}", description);

and IIUC all that line is doing is checking that the length of subscriber.all(filter) is 0??? so maybe we should just remove the PartialEq stuff altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our Serialize / Deserialize impls are mostly there for test convenience, but I think something like PartialEq might be functionality that a user could rely on, even if inadvertently like we are via a Derive. For that reason, I'd be hesitant to remove any of those implementations until 3.0, even if they're largely unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, that's a good point. I wasn't thinking about this being a public type. I agree it seems best to keep it around then. filed RUST-1538 about considering a removal in 3.0

};
handler.handle_connection_closed_event(event);
}
error: Some(e.cause.clone()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just extracted the cause here since EstablishError is a different type than Error. AFAICT the extra info stored on the EstablishError is what phase of the handshake the error happened during. I wasn't sure if that info was worth including here or more exists so the driver can decide what to do about the error. LMK what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's just implementation detail stuff for SDAM error handling purposes. Using cause here SGTM.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just have one small question

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kmahar kmahar merged commit e619a0c into mongodb:main Nov 7, 2022
@kmahar kmahar deleted the RUST-1510/connection-logging branch November 7, 2022 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants