Skip to content
This repository was archived by the owner on Mar 4, 2024. It is now read-only.

Respond if term changed when installing snapshot #421

Closed
wants to merge 4 commits into from
Closed

Respond if term changed when installing snapshot #421

wants to merge 4 commits into from

Conversation

freeekanayaka
Copy link
Contributor

This change simplifies slightly the logic for handling completions of raft_io->snapshot_put() requests on followers. It stops tracking the term that was in place when the request started, and instead just looks at the current situation, since it's not only harmless to send and AppendEntries RPC result to a different leader, but it might actually be useful since a new leader might get to now about the follower's state faster.

I don't expect this PR alone to fix #355, but it's an isolated change that I believe makes sense on its own. If merged, we can wait a bit to see how Jepsen behaves (I'd expect no improvements but also no regressions).

Next step would be to do a similar change for the equivalent logic that we have after completing a raft_io->append() request. Then I'd like also to allow converting to candidate state while installing a snapshot, for consistency with what we do when appending entries and also because it's probably going to simplify logic and avoid unnecessary delays in elections. We can push these changes slowly to observe jepsen's reactions.

@freeekanayaka
Copy link
Contributor Author

Note that the test that I added in this commit is not strictly necessary, and is not associated with a particular change in this PR. However it reinforces the assumptions described and checked in this other commit.

@codecov
Copy link

codecov bot commented May 31, 2023

Codecov Report

Merging #421 (364c0b1) into master (0ea24be) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #421      +/-   ##
==========================================
- Coverage   76.72%   76.72%   -0.01%     
==========================================
  Files          51       51              
  Lines        9686     9681       -5     
  Branches     2476     2474       -2     
==========================================
- Hits         7432     7428       -4     
  Misses       1088     1088              
+ Partials     1166     1165       -1     
Files Changed Coverage Δ
src/replication.c 68.76% <100.00%> (-0.07%) ⬇️

@MathieuBordere
Copy link
Contributor

There's this feature in the jepsen repo that allows to run the tests against a custom raft branch, see canonical/jepsen.dqlite#82

@freeekanayaka
Copy link
Contributor Author

There's this feature in the jepsen repo that allows to run the tests against a custom raft branch, see canonical/jepsen.dqlite#82

Nice.

I've browsed to https://github.com/canonical/jepsen.dqlite/actions/workflows/test-sanitize.yml?query=, but I can't find the Run workflow button. Here's what I see:

Screenshot from 2023-05-31 13-53-45

A single or few runs would be nice, but I think we'll have to wait a day or two to see some statistical data.

@MathieuBordere
Copy link
Contributor

It's okay to send an AppendEntries result to a leader that was not the same of
the one that originally sent the snapshot, or that has bumped its term.

Conceptually I don't think I can agree with this, sending a response to a server that didn't send a request.

@freeekanayaka
Copy link
Contributor Author

freeekanayaka commented May 31, 2023

It's okay to send an AppendEntries result to a leader that was not the same of
the one that originally sent the snapshot, or that has bumped its term.

Conceptually I don't think I can agree with this, sending a response to a server that didn't send a request.

I believe we shouldn't regard them as plainly "regular" requests in the traditional sense. These "requests" are actually "messages" or "events", for example they could be duplicated (a receiver can get the same message multiple times, because the network duplicates it) and they are broadly idempotent. The "response" here is really just a "let the current leader know about my state" message. It's basically message-driven state synchronization. Thinking about messages probably solves the conceptual side.

In general, trying to minimize the state we track leads to simpler logic to reason around, since in most cases looking at the situation as it is now is less complicated than looking at the situation as it is now combined with what the situation was at some point the past. I believe we have a few cases where things can be improved in that regard.

@freeekanayaka
Copy link
Contributor Author

I've browsed to https://github.com/canonical/jepsen.dqlite/actions/workflows/test-sanitize.yml?query=, but I can't find the Run workflow button. Here's what I see:

I figured this out, I was looking at "Actions -> Dqlite Jepsen tests", but it's actually "Actions -> Callabale Dqlite Jepsen tests".

@cole-miller
Copy link
Contributor

I don't have a strong opinion about sending responses to the "wrong" server, it's okay with me if it simplifies the code. Just need to make sure that #339 is not reintroduced in some form.

@freeekanayaka
Copy link
Contributor Author

freeekanayaka commented Jun 1, 2023

I don't have a strong opinion about sending responses to the "wrong" server.

I'll add a couple of additional observations here to possibly address the conceptual concern:

  • The semantics of an AppendRequest message is basically: "follower, please append these entries and when done notify your progress to the leader". If the leader changes, the new leader is not the "wrong" server, it's actually precisely the server that the follower should notify, so it's the "right" server. Note that since this is a message/event oriented synchronization protocol there's no mandatory coupling or pairing between messages (unlike, say, HTTP).

  • As we know, the message/event oriented design supports unreliable networks where messages can be delayed, duplicated or dropped. Imagine leader L1 sending entry E to follower F, then leader L1 steps down and leader L2 gets elected. L2 also sends the exact same entry E to follower F. Then L2 receives a notification message from F that it has appended E. This notification message could have been triggered by the original message that L1 sent, and that F sent to L2 instead of L1 because it completed persisting the entry after learning about L2, or it could have been triggered by the new message that L2 sent to F: if one of the two notification messages was dropped by the network, there would be no way for L2 to distinguish between the "right" one and the "wrong" one and tell which one of the two messages it received, and rightfully so, because actually there is no right or wrong, there is just a follower fulfilling its expected behavior.

Just need to make sure that #339 is not reintroduced in some form.

This PR basically applies the same if (r->state == RAFT_FOLLOWER) check as #342, which is enough for #339, because we now send a message only if we are followers and hence r->follower_state.current_leader is not corrupted. There are also additional assertions now in this commit to signal any future violation that would lead to a regression with respect to #339.

@MathieuBordere
Copy link
Contributor

It's okay to send an AppendEntries result to a leader that was not the same of
the one that originally sent the snapshot, or that has bumped its term.

Conceptually I don't think I can agree with this, sending a response to a server that didn't send a request.

I believe we shouldn't regard them as plainly "regular" requests in the traditional sense. These "requests" are actually "messages" or "events", for example they could be duplicated (a receiver can get the same message multiple times, because the network duplicates it) and they are broadly idempotent. The "response" here is really just a "let the current leader know about my state" message. It's basically message-driven state synchronization. Thinking about messages probably solves the conceptual side.

In general, trying to minimize the state we track leads to simpler logic to reason around, since in most cases looking at the situation as it is now is less complicated than looking at the situation as it is now combined with what the situation was at some point the past. I believe we have a few cases where things can be improved in that regard.

I'm just a bit worried that by sending an AppendEntriesRPC to a node, the leader sets some state and that the processing of the AppendEntriesResultRPC assumes that particular state. So when we send a Result to a different leader, that leader might not expect that particular answer and things go wrong. I would need to look into it to ease my mind a bit.

@freeekanayaka
Copy link
Contributor Author

freeekanayaka commented Jun 1, 2023

I'm just a bit worried that by sending an AppendEntriesRPC to a node, the leader sets some state and that the processing of the AppendEntriesResultRPC assumes that particular state. So when we send a Result to a different leader, that leader might not expect that particular answer and things go wrong. I would need to look into it to ease my mind a bit.

That's precisely the type of state-related complication this PR tries to reduce, and I think we have a few cases like this that we can incrementally simplify.

The less assumptions/state we have and track, especially about the past, the less complex the resulting logic will be, and reasoning about the associated mechanics becomes easier.

Note that a node must be prepared to receive any message from any node at any time, because the network is unreliable (a TCP transport is an implementation detail, it could be, say, UDP) and can duplicate, drop or delay messages arbitrarily.

Essentially struct raft is a deterministic state machine that takes in an event (e.g. a message), looks at its current state, and outputs some other messages to be sent, or data to be written to disk.

In this particular case, the leader might indeed set some internal state associated to an AppendEntriesRPC message, but it needs to be prepared to receive any AppendEntriesResultRPC message from anyone at any time, because for example the AppendEntriesResultRPC it receives might be a delayed message an in the meantime the leader had stepped down, reset that internal state associated with that initial AppendEntriesRPC and then become leader again with a brand new state, which is equivalent to the situation you describe, if I understand correctly. This is just an example to illustrate that no assumptions can be made that strongly couple messages, and that, being this a deterministic event-driver state machine, reasoning on the current state only is generally both simpler and more robust, while trying to prevent this kind situation from happening at all by using state to introduce constraints is going to be more fragile and complicated to reason around.

@MathieuBordere
Copy link
Contributor

I'm just a bit worried that by sending an AppendEntriesRPC to a node, the leader sets some state and that the processing of the AppendEntriesResultRPC assumes that particular state. So when we send a Result to a different leader, that leader might not expect that particular answer and things go wrong. I would need to look into it to ease my mind a bit.

That's precisely the type of state-related complication this PR tries to reduce, and I think we have a few cases like this that we can incrementally simplify.

The less assumptions/state we have and track, especially about the past, the less complex the resulting logic will be, and reasoning about the associated mechanics becomes easier.

Note that a node must be prepared to receive any message from any node at any time, because the network is unreliable (a TCP transport is an implementation detail, it could be, say, UDP) and can duplicate, drop or delay messages arbitrarily.

Essentially struct raft is a deterministic state machine that takes in an event (e.g. a message), looks at its current state, and outputs some other messages to be sent, or data to be written to disk.

In this particular case, the leader might indeed set some internal state associated to an AppendEntriesRPC message, but it needs to be prepared to receive any AppendEntriesResultRPC message from anyone at any time, because for example the AppendEntriesResultRPC it receives might be a delayed message an in the meantime the leader had stepped down, reset that internal state associated with that initial AppendEntriesRPC and then become leader again with a brand new state, which is equivalent to the situation you describe, if I understand correctly. This is just an example to illustrate that no assumptions can be made that strongly couple messages, and that, being this a deterministic event-driver state machine, reasoning on the current state only is generally both simpler and more robust, while trying to prevent this kind situation from happening at all by using state to introduce constraints is going to be more fragile and complicated to reason around.

This is where we get into a bit of trouble with our implementation I think, original raft assumes an RPC that can succeed or fail, we don't really have that luxury.

@freeekanayaka
Copy link
Contributor Author

This is where we get into a bit of trouble with our implementation I think, original raft assumes an RPC that can succeed or fail, we don't really have that luxury.

I'm not sure to understand exactly what you mean, given that you are sending a message to a remote server, you can't know if it succeeded or it failed (if you don't hear a reply anything could have happened). Or perhaps you mean something else.

@freeekanayaka
Copy link
Contributor Author

freeekanayaka commented Jun 1, 2023

This is where we get into a bit of trouble with our implementation I think, original raft assumes an RPC that can succeed or fail, we don't really have that luxury.

I'm not sure to understand exactly what you mean, given that you are sending a message to a remote server, you can't know if it succeeded or it failed (if you don't hear a reply anything could have happened). Or perhaps you mean something else.

Section 8.1 of the Raft dissertation (Formal specification and proof for basic Raft algorithm) might be an interesting read in this regard, in particular:

The specification models an asynchronous system (it has no notion of time) with the following
assumptions:
• Messages may take an arbitrary number of steps (transitions) to arrive at a server. Sending
a message enables a transition to occur (the receipt of the message) but with no particular
timeliness.
• Servers fail by stopping and may later restart from stable storage on disk.
• The network may reorder, drop, and duplicate messages.
The formal specification is slightly more general than the Raft algorithm presented in Chapter 3.
These differences make the formal specification applicable to a wider range of implementations
and also make some of its state transitions more orthogonal, which simplifies the proof. One way
in which the formal specification differs from the algorithm’s description is that it uses message-passing rather than RPC. This requires a minor change to the AppendEntries response format, but
it eliminates the need to pair responses with requests.

(the minor change referred to in this quoted text is adding the receiver's last log entry index, which actually real-wold Raft implementations generally do anyways, as a hint for the leader, because it speeds up the process of finding the match index).

What I meant above about struct raft being a state machine whose transitions are deterministic and based on messages is basically that struct raft is logically equivalent to the code contained in the formal proof, for the part which covers the core Raft algorithm. Unfortunately the code in the formal proof does not contain all those real-world features that are needed in practice (snapshotting, pipelining, pre-vote, etc), but the conceptual event-driven state-machine model does not change.

Section 8.3 (Building correct implementations) also gives some interesting insights around this topic, for example when talking about ocaml-raft:

Howard describes a nice design for building ocaml-raft correctly [37, 36]. It collects all the Raft
state transitions in one module, while all code for determining when transitions should occur is
elsewhere. Each transition checks its pre-conditions using assertions and has no system-level code
intermixed, so the code resembles the core of the Raft specification. Because all of the code that
manipulates the state variables is collected in one place, it is easier to verify that state variables
transition in restricted ways. A separate module invokes the transitions at the appropriate times.
Moreover, ocaml-raft can simulate an entire cluster in a single process, which allows it to assert
Raft’s invariants across virtual servers during execution. For example, it can check that there is at
most one leader per term at runtime

@freeekanayaka freeekanayaka deleted the respond-if-term-changed-when-installing-snapshot branch June 23, 2023 10:38
@freeekanayaka freeekanayaka restored the respond-if-term-changed-when-installing-snapshot branch June 23, 2023 10:42
@freeekanayaka freeekanayaka reopened this Jun 23, 2023
Add a few new assertions to make sure that the follower is actually still
installing the snapshot it has been sent when it receives a new term.

Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
It's okay to send an AppendEntries result to a leader that was not the same of
the one that originally sent the snapshot, or that has bumped its term.

This makes the logic a bit simpler, since we don't need to track what the term
was at the time the InstallSnapshot RPC was first received.

It also can speed up synchronization in case the new leader does not yet know
the progress of the follower.

Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
This test exercise the case where a leader steps down and becomes a follwer
while a raft_io->append() request to write new entries to disk is in flight and
completes once the leader has stepped down.

Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
Sanity check that will catch violations to our expectations.

Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
@freeekanayaka freeekanayaka deleted the respond-if-term-changed-when-installing-snapshot branch August 18, 2023 17:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion `r->last_applied <= r->commit_index' failed.
3 participants