-
Notifications
You must be signed in to change notification settings - Fork 137
Respond if term changed when installing snapshot #421
Respond if term changed when installing snapshot #421
Conversation
Note that the test that I added in this commit is not strictly necessary, and is not associated with a particular change in this PR. However it reinforces the assumptions described and checked in this other commit. |
Codecov Report
@@ Coverage Diff @@
## master #421 +/- ##
==========================================
- Coverage 76.72% 76.72% -0.01%
==========================================
Files 51 51
Lines 9686 9681 -5
Branches 2476 2474 -2
==========================================
- Hits 7432 7428 -4
Misses 1088 1088
+ Partials 1166 1165 -1
|
There's this feature in the jepsen repo that allows to run the tests against a custom raft branch, see canonical/jepsen.dqlite#82 |
Nice. I've browsed to https://github.com/canonical/jepsen.dqlite/actions/workflows/test-sanitize.yml?query=, but I can't find the A single or few runs would be nice, but I think we'll have to wait a day or two to see some statistical data. |
Conceptually I don't think I can agree with this, sending a response to a server that didn't send a request. |
I believe we shouldn't regard them as plainly "regular" requests in the traditional sense. These "requests" are actually "messages" or "events", for example they could be duplicated (a receiver can get the same message multiple times, because the network duplicates it) and they are broadly idempotent. The "response" here is really just a "let the current leader know about my state" message. It's basically message-driven state synchronization. Thinking about messages probably solves the conceptual side. In general, trying to minimize the state we track leads to simpler logic to reason around, since in most cases looking at the situation as it is now is less complicated than looking at the situation as it is now combined with what the situation was at some point the past. I believe we have a few cases where things can be improved in that regard. |
I figured this out, I was looking at "Actions -> Dqlite Jepsen tests", but it's actually "Actions -> Callabale Dqlite Jepsen tests". |
I don't have a strong opinion about sending responses to the "wrong" server, it's okay with me if it simplifies the code. Just need to make sure that #339 is not reintroduced in some form. |
I'll add a couple of additional observations here to possibly address the conceptual concern:
This PR basically applies the same |
I'm just a bit worried that by sending an AppendEntriesRPC to a node, the leader sets some state and that the processing of the AppendEntriesResultRPC assumes that particular state. So when we send a Result to a different leader, that leader might not expect that particular answer and things go wrong. I would need to look into it to ease my mind a bit. |
That's precisely the type of state-related complication this PR tries to reduce, and I think we have a few cases like this that we can incrementally simplify. The less assumptions/state we have and track, especially about the past, the less complex the resulting logic will be, and reasoning about the associated mechanics becomes easier. Note that a node must be prepared to receive any message from any node at any time, because the network is unreliable (a TCP transport is an implementation detail, it could be, say, UDP) and can duplicate, drop or delay messages arbitrarily. Essentially In this particular case, the leader might indeed set some internal state associated to an AppendEntriesRPC message, but it needs to be prepared to receive any AppendEntriesResultRPC message from anyone at any time, because for example the AppendEntriesResultRPC it receives might be a delayed message an in the meantime the leader had stepped down, reset that internal state associated with that initial AppendEntriesRPC and then become leader again with a brand new state, which is equivalent to the situation you describe, if I understand correctly. This is just an example to illustrate that no assumptions can be made that strongly couple messages, and that, being this a deterministic event-driver state machine, reasoning on the current state only is generally both simpler and more robust, while trying to prevent this kind situation from happening at all by using state to introduce constraints is going to be more fragile and complicated to reason around. |
This is where we get into a bit of trouble with our implementation I think, original raft assumes an |
I'm not sure to understand exactly what you mean, given that you are sending a message to a remote server, you can't know if it succeeded or it failed (if you don't hear a reply anything could have happened). Or perhaps you mean something else. |
Section 8.1 of the Raft dissertation (Formal specification and proof for basic Raft algorithm) might be an interesting read in this regard, in particular: The specification models an asynchronous system (it has no notion of time) with the following (the minor change referred to in this quoted text is adding the receiver's last log entry index, which actually real-wold Raft implementations generally do anyways, as a hint for the leader, because it speeds up the process of finding the match index). What I meant above about Section 8.3 (Building correct implementations) also gives some interesting insights around this topic, for example when talking about ocaml-raft: Howard describes a nice design for building ocaml-raft correctly [37, 36]. It collects all the Raft |
Add a few new assertions to make sure that the follower is actually still installing the snapshot it has been sent when it receives a new term. Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
It's okay to send an AppendEntries result to a leader that was not the same of the one that originally sent the snapshot, or that has bumped its term. This makes the logic a bit simpler, since we don't need to track what the term was at the time the InstallSnapshot RPC was first received. It also can speed up synchronization in case the new leader does not yet know the progress of the follower. Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
This test exercise the case where a leader steps down and becomes a follwer while a raft_io->append() request to write new entries to disk is in flight and completes once the leader has stepped down. Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
Sanity check that will catch violations to our expectations. Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
This change simplifies slightly the logic for handling completions of
raft_io->snapshot_put()
requests on followers. It stops tracking the term that was in place when the request started, and instead just looks at the current situation, since it's not only harmless to send and AppendEntries RPC result to a different leader, but it might actually be useful since a new leader might get to now about the follower's state faster.I don't expect this PR alone to fix #355, but it's an isolated change that I believe makes sense on its own. If merged, we can wait a bit to see how Jepsen behaves (I'd expect no improvements but also no regressions).
Next step would be to do a similar change for the equivalent logic that we have after completing a
raft_io->append()
request. Then I'd like also to allow converting to candidate state while installing a snapshot, for consistency with what we do when appending entries and also because it's probably going to simplify logic and avoid unnecessary delays in elections. We can push these changes slowly to observe jepsen's reactions.