Skip to content
This repository was archived by the owner on Mar 4, 2024. It is now read-only.

Assertion `r->last_applied <= r->commit_index' failed. #355

Closed
MathieuBordere opened this issue Jan 9, 2023 · 11 comments · Fixed by #423
Closed

Assertion `r->last_applied <= r->commit_index' failed. #355

MathieuBordere opened this issue Jan 9, 2023 · 11 comments · Fixed by #423
Assignees

Comments

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Jan 9, 2023

Observed during Jepsen runs:

https://github.com/canonical/jepsen.dqlite/actions/runs/3873600397/jobs/6603811458

https://github.com/canonical/jepsen.dqlite/actions/runs/3869648631/jobs/6595906665

https://github.com/canonical/jepsen.dqlite/actions/runs/3860783772/jobs/6581330548

Looks like it should have been introduced by recently merged PR's as it's occurring frequently and has only started recently. Based on the first run in which it occurs (the last one in the list aboive), it's possible it's introduced by #351, #352 or #353.

@cole-miller
Copy link
Contributor

I suspect that #352 is at fault here, it's the only one of those PRs that obviously affects log indices

@cole-miller cole-miller self-assigned this Jan 9, 2023
@cole-miller
Copy link
Contributor

Actually, based on looking at logs it seems more likely to be caused by #351.

@cole-miller
Copy link
Contributor

This should be fixed by #362.

@cole-miller
Copy link
Contributor

This has been showing up again when the "pause" nemesis is enabled (canonical/jepsen.dqlite#91), so re-opening.

@cole-miller cole-miller reopened this Mar 17, 2023
@cole-miller
Copy link
Contributor

Finally looking into this again. It seems that this is the problematic update to commit_index:

r->commit_index = min(args->leader_commit, r->last_stored);

I do not think it should be possible for that logic to decrease commit_index, which is what I see happening in the Jepsen tests. The culprit is that last_stored is less than commit_index -- now trying to trace how that happens...

@cole-miller
Copy link
Contributor

Hmm, I'm having trouble finding how/where last_stored gets updated. deleteConflictingEntries is not the culprit, nor snapshotRestore.

@cole-miller
Copy link
Contributor

Adding some printf(last_stored) before and after this call to deleteConflictingEntries seems to make the assertion failure go away:

rv = deleteConflictingEntries(r, args, &i);

That's pretty weird. I wonder whether last_stored is not getting initialized?

@freeekanayaka
Copy link
Contributor

freeekanayaka commented May 30, 2023

I didn't look at this in depth, but I feel that we might want to revisit #351 and #353 anyways, because they seem to introduce some unnecessary complexity.

For #351, I think the only thing we should check is if we have a leader at all set in r->follower_state.current_leader, and send it an AppendEntries RPC result if we have one. It does not matter if it's the same leader that originally sent us the AppendEntries RPC containing the entries that we just finished to persist. It's both easier (less bookkeeping) and useful (if it's a new leader, we might speed up syncing to it) to still send the result.

For #353, I think it's a bit similar, and it also creates an asymmetry between what we do for AppendEntries and what we do for InstallSnapshot. For AppendEntries we don't prevent a follower from converting to a candidate if there some entries are being written to disk. And InstallSnapshot is after all just a different version of AppendEntries (think of an AppendEntries that sends a lot of entries), there's not much reason to differentiate between the two and to prevent state transitions.

I had noticed this and had been thinking about this since a while, and I have also a small change that rectifies that. Perhaps I could make a PR of it? It might well solve this issue if it's true that it's a regression caused by those PRs.

@cole-miller
Copy link
Contributor

@freeekanayaka sure, makes sense to me.

@freeekanayaka
Copy link
Contributor

Finally looking into this again. It seems that this is the problematic update to commit_index:

r->commit_index = min(args->leader_commit, r->last_stored);

I do not think it should be possible for that logic to decrease commit_index, which is what I see happening in the Jepsen tests. The culprit is that last_stored is less than commit_index -- now trying to trace how that happens...

After digging a bit I actually found a case where a decrease it's possible, which is the source of this issue. Fixed that in #423.

@freeekanayaka
Copy link
Contributor

Observed during Jepsen runs:

https://github.com/canonical/jepsen.dqlite/actions/runs/3873600397/jobs/6603811458

https://github.com/canonical/jepsen.dqlite/actions/runs/3869648631/jobs/6595906665

https://github.com/canonical/jepsen.dqlite/actions/runs/3860783772/jobs/6581330548

Looks like it should have been introduced by recently merged PR's as it's occurring frequently and has only started recently. Based on the first run in which it occurs (the last one in the list aboive), it's possible it's introduced by #351, #352 or #353.

I didn't investigate this too much, but #351 #352 and #353 might actually be a red herring. Since the problem seems to be fixed by #423, I presume that the change that actually made jepsen start failing could be #336, where we add a new barrier at the beginning of the term. I suspect the bug was already there, but that change might have made it more likely to trigger.

cole-miller pushed a commit to cole-miller/raft that referenced this issue Jul 10, 2023
Add a failing test that reproduces the situation that triggered the assertion
failure described in canonical#355.

Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants