-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUST-1510 Implement connection pool tracing messages #766
Conversation
move cmap test files wip CMAP logging tests synced and passing fix test building w/o tracing feature sync test files
src/trace/connection.rs
Outdated
} | ||
|
||
impl TracingRepresentation for ConnectionClosedReason { | ||
fn tracing_representation(self) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're immediately calling as_str
on the values returned from this method, so it would be nice to be able to return &'static str
here to avoid the intermediary allocation of a String
. Perhaps we could introduce an associated type with the TracingRepresentation
trait to represent the return type of this method? That would allow the flexibility to return both owned strings and slices from this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea! I added an associated type and switched to using &'static str
here and for ConnectionCheckoutFailedReason
.
somewhat relatedly, I discovered that as of a recent tracing release (https://github.com/tokio-rs/tracing/releases/tag/tracing-core-0.1.28) adding the as_str
calls to String
s when passing them to tracing::debug!
is no longer necessary, so I have cleaned up some of the existing event emitting code where the TracingRepresentation::Representation
type is a String
to make it a bit less verbose.
match target_entity { | ||
Entity::Client(client) => { | ||
client.client = ClientEntityState::Dropped; | ||
drop(entities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we already did this in the existing impl but why do we drop the entire entities map here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you know, I'm not actually sure. I copied this and was thinking it would be necessary to make sure the reference count for the client we are dropping went to zero. but thinking about it more, since we mutate entities
anyway I'm not actually sure why that would be the case as it shouldn't hold a reference anymore. the test seems to pass without it.
it looks like @patrickfreed added the drop(entities)
for cursors via #712: https://github.com/mongodb/mongo-rust-driver/pull/712/files#diff-5a3e241f9938c810ad70bc17b905d8ed80e726e2548028ee7856ca0da36420f7R2093
Patrick, do you remember why that was needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can't think of a reason either. This might have just been a mistake. If the tests pass without it, I'd say leave it out for cursors.
For clients, I think we do need it (or something similar) in order to drop all descendant collections / databases to ensure the client actually gets dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you raise a good point regarding the descendant databases and collections keeping a client alive.
that said, re-reading this code, all drop(entities)
does is drop the RwLockWriteGuard
we acquired which gave exclusive write access to the entities. I'm not sure why that would have been needed for cursors , and I don't see why we would need it for clients, still but I'll try running a full patch with it removed for cursors to see if the test failures (or lack thereof) give any indication here.
in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped, and I think I should update my spec PR to state that it is an error for a test to use any descendant object for an operation after its parent client is closed.
there is already a sentence to this effect re using the client which I can expand on:
Test files SHOULD NOT
specify any operations for a client entity following aclose
operation on it, as
driver behavior when an operation is attempted on a closed client is not consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see now. I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access. Probably would have been good if I left a comment about that.
in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped
Sounds good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access.
could you explain a little more how this might happen? since (I think) we are always executing test operations serially I'm wondering in what case we would be trying to access the entity map while this operation is still executing.
I've added new logic here to go through the entity map and remove all descendant entities from the map. this doesn't get exercised now since the test using close
only has a client entity, but should hopefully correctly future-proof us against any cases where a closed client has descendants in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests can run in parallel using the runOnThread
operation introduced as part of the SDAM integration test port to the unified format. I'm not sure any test includes operations that would be problematic for this, but it's technically possible I think.
also, just to make sure my refactoring and changing of serialization logic didn't mess anything up for the astrolabe integration I ran a patch there and it all passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just one more suggestion/question. gonna tag in the rest of the team
match target_entity { | ||
Entity::Client(client) => { | ||
client.client = ClientEntityState::Dropped; | ||
drop(entities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can't think of a reason either. This might have just been a mistake. If the tests pass without it, I'd say leave it out for cursors.
For clients, I think we do need it (or something similar) in order to drop all descendant collections / databases to ensure the client actually gets dropped.
match target_entity { | ||
Entity::Client(client) => { | ||
client.client = ClientEntityState::Dropped; | ||
drop(entities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see now. I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access. Probably would have been good if I left a comment about that.
in any case, I think we need to update the logic here to actually remove all descendant objects from the entity map so the client can be dropped
Sounds good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good (modulo the entity map pruning), just a few minor comments.
|
||
#[derive(Clone, Debug)] | ||
pub struct EventHandler { | ||
pub events: Arc<RwLock<Vec<Event>>>, | ||
channel_sender: tokio::sync::broadcast::Sender<Event>, | ||
pub(crate) events: Arc<RwLock<Vec<CmapEvent>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I've asked this in other contexts but I don't remember - in a test mod, is there any difference between pub
and pub(crate)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, I had to make the change because CmapEvent
is defined in the driver and is pub(crate)
there, so I wasn't allowed to leave this property as pub
.
more generally, I think we decided at some point to use pub(crate)
over pub
going forward in tests, but I don't totally remember the justification. one reason pointed out in this Reddit thread seems like a good argument though: pub
items do not get dead code warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing is that pub
things (even in tests) can't expose pub(crate)
things from the driver. In practice this doesn't matter for us because our tests don't get compiled in to the actual library, but it will prevent compilation from succeeding.
src/event/cmap.rs
Outdated
} | ||
|
||
let event = generate_event(); | ||
if let (Some(user_handler), Some(tracing_emitter)) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: this might be a little more readable as a match, i.e.
match (&self.user_handler, tracing_emitter_to_use) {
(Some(user_handler), Some(tracing_emitter)) => { ... }
(Some(user_handler), None) => { ... }
(None, Some(tracing_emitter)) => { ... }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, updated.
@@ -48,9 +48,19 @@ pub(crate) enum Entity { | |||
None, | |||
} | |||
|
|||
#[derive(Clone, Debug)] | |||
pub(crate) enum ClientEntityState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: personally, I'd just use Option
for this and a comment about when it can be None
rather than introducing a new type, but I don't feel super strongly about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did it this way because I initially didn't realize we had the Deref
magic in place, and so I figured I was going to have to change a bunch of code to unwrap ClientEntity.client
and that the enum would help make those more explicit. but, it turns out we only ever match on this in a couple places within this file 🙂 so I agree an option seems just fine here. switched
match target_entity { | ||
Entity::Client(client) => { | ||
client.client = ClientEntityState::Dropped; | ||
drop(entities); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was dropped to avoid holding it across await point in case it caused a deadlock, since the guard constitutes exclusive access.
could you explain a little more how this might happen? since (I think) we are always executing test operations serially I'm wondering in what case we would be trying to access the entity map while this operation is still executing.
I've added new logic here to go through the entity map and remove all descendant entities from the map. this doesn't get exercised now since the test using close
only has a client entity, but should hopefully correctly future-proof us against any cases where a closed client has descendants in the future.
@@ -48,9 +48,19 @@ pub(crate) enum Entity { | |||
None, | |||
} | |||
|
|||
#[derive(Clone, Debug)] | |||
pub(crate) enum ClientEntityState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did it this way because I initially didn't realize we had the Deref
magic in place, and so I figured I was going to have to change a bunch of code to unwrap ClientEntity.client
and that the enum would help make those more explicit. but, it turns out we only ever match on this in a couple places within this file 🙂 so I agree an option seems just fine here. switched
src/event/cmap.rs
Outdated
} | ||
|
||
let event = generate_event(); | ||
if let (Some(user_handler), Some(tracing_emitter)) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, updated.
|
||
#[derive(Clone, Debug)] | ||
pub struct EventHandler { | ||
pub events: Arc<RwLock<Vec<Event>>>, | ||
channel_sender: tokio::sync::broadcast::Sender<Event>, | ||
pub(crate) events: Arc<RwLock<Vec<CmapEvent>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, I had to make the change because CmapEvent
is defined in the driver and is pub(crate)
there, so I wasn't allowed to leave this property as pub
.
more generally, I think we decided at some point to use pub(crate)
over pub
going forward in tests, but I don't totally remember the justification. one reason pointed out in this Reddit thread seems like a good argument though: pub
items do not get dead code warnings.
/// Once the connection has received an error, it should not be used again or checked back | ||
/// into a pool. | ||
error: bool, | ||
/// Stores a network error encountered while reading or writing. Once the connection has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old comment seems a little wrong, because we check connections that have errored back into their pools and then allow the pool to handle closing them.
/// Stores a network error encountered while reading or writing. Once the connection has | ||
/// received an error, it should not be used again and will be closed upon check-in to the | ||
/// pool. | ||
error: Option<Error>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per conversation with @patrickfreed on the spec PR here we decided to add the errors that cause connections to be closed to "connection closed" tracing events when relevant, and similarly add the errors that cause checkout to fail to "connection checkout failed" tracing events. this required storing the error on the connection, so at the time the pool closes it we can access the error.
@@ -272,7 +275,9 @@ impl Connection { | |||
_ => message.write_to(&mut self.stream).await, | |||
}; | |||
|
|||
self.error = write_result.is_err(); | |||
if let Err(ref err) = write_result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result.err()
would be nice here, but it's still unstable API. https://doc.rust-lang.org/std/result/enum.Result.html#method.err
src/event/cmap.rs
Outdated
|
||
/// If the `reason` connection checkout failed was `Error`,the associated | ||
/// error is contained here. This is attached so we can include it in log messages; | ||
/// in future work we may add this to public API on the event itself. TODO: add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickfreed was going to file a DRIVERS ticket about drivers adding these errors to their connection monitoring events, so once there is a ticket to link to I'll add that here. I figured for now we could just leave it out from the public API in case the eventual spec work leads to some different proposal to expose the error.
ideally, we would probably actually add the error to the ConnectionClosedReason::Error
enum case but that would be a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticket here: https://jira.mongodb.org/browse/DRIVERS-2495. Leaving it out until the RUST ticket for that sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added reference to DRIVERS-2495 here and below
/// ticket link here. | ||
#[cfg(feature = "tracing-unstable")] | ||
#[serde(skip)] | ||
#[derivative(PartialEq = "ignore")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we rely on this PartialEq
implementation in the CMAP tests. but since Error
does not implement PartialEq
right now and this field is currently just here as an implementation detail for tracing it seemed like we could leave it out for now when comparing events in tests.
that said, looking more closely, I'm not sure the PartialEq
implementation is even needed. When I remove it from this type, the only thing that breaks is that the CmapEvent
enum can't derive PartialEq
. And if I remove the derived impl from that, the only thing that breaks is this line of code:
mongo-rust-driver/src/cmap/test/mod.rs
Line 224 in 6e3d43d
assert_eq!(subscriber.all(filter), Vec::new(), "{}", description); |
and IIUC all that line is doing is checking that the length of subscriber.all(filter)
is 0??? so maybe we should just remove the PartialEq
stuff altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our Serialize
/ Deserialize
impls are mostly there for test convenience, but I think something like PartialEq
might be functionality that a user could rely on, even if inadvertently like we are via a Derive
. For that reason, I'd be hesitant to remove any of those implementations until 3.0, even if they're largely unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, that's a good point. I wasn't thinking about this being a public type. I agree it seems best to keep it around then. filed RUST-1538 about considering a removal in 3.0
}; | ||
handler.handle_connection_closed_event(event); | ||
} | ||
error: Some(e.cause.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just extracted the cause
here since EstablishError
is a different type than Error
. AFAICT the extra info stored on the EstablishError
is what phase of the handshake the error happened during. I wasn't sure if that info was worth including here or more exists so the driver can decide what to do about the error. LMK what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's just implementation detail stuff for SDAM error handling purposes. Using cause
here SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just have one small question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is the Rust PR corresponding to the spec changes in mongodb/specifications#1324:
Summary:
CmapEventEmitter
type that emits both tracing events and the existing eventsclose
operations on client entities in the unified test runner (previously only supported cursors)