-
Notifications
You must be signed in to change notification settings - Fork 137
Assertion `r->last_applied <= r->commit_index' failed. #355
Comments
I suspect that #352 is at fault here, it's the only one of those PRs that obviously affects log indices |
Actually, based on looking at logs it seems more likely to be caused by #351. |
This should be fixed by #362. |
This has been showing up again when the "pause" nemesis is enabled (canonical/jepsen.dqlite#91), so re-opening. |
Finally looking into this again. It seems that this is the problematic update to commit_index: Line 1105 in 7cfccd3
I do not think it should be possible for that logic to decrease commit_index, which is what I see happening in the Jepsen tests. The culprit is that last_stored is less than commit_index -- now trying to trace how that happens... |
Hmm, I'm having trouble finding how/where last_stored gets updated. deleteConflictingEntries is not the culprit, nor snapshotRestore. |
Adding some printf(last_stored) before and after this call to deleteConflictingEntries seems to make the assertion failure go away: Line 1082 in 7cfccd3
That's pretty weird. I wonder whether last_stored is not getting initialized? |
I didn't look at this in depth, but I feel that we might want to revisit #351 and #353 anyways, because they seem to introduce some unnecessary complexity. For #351, I think the only thing we should check is if we have a leader at all set in For #353, I think it's a bit similar, and it also creates an asymmetry between what we do for AppendEntries and what we do for InstallSnapshot. For AppendEntries we don't prevent a follower from converting to a candidate if there some entries are being written to disk. And InstallSnapshot is after all just a different version of AppendEntries (think of an AppendEntries that sends a lot of entries), there's not much reason to differentiate between the two and to prevent state transitions. I had noticed this and had been thinking about this since a while, and I have also a small change that rectifies that. Perhaps I could make a PR of it? It might well solve this issue if it's true that it's a regression caused by those PRs. |
@freeekanayaka sure, makes sense to me. |
After digging a bit I actually found a case where a decrease it's possible, which is the source of this issue. Fixed that in #423. |
I didn't investigate this too much, but #351 #352 and #353 might actually be a red herring. Since the problem seems to be fixed by #423, I presume that the change that actually made jepsen start failing could be #336, where we add a new barrier at the beginning of the term. I suspect the bug was already there, but that change might have made it more likely to trigger. |
Add a failing test that reproduces the situation that triggered the assertion failure described in canonical#355. Signed-off-by: Free Ekanayaka <free@ekanayaka.io>
Observed during Jepsen runs:
https://github.com/canonical/jepsen.dqlite/actions/runs/3873600397/jobs/6603811458
https://github.com/canonical/jepsen.dqlite/actions/runs/3869648631/jobs/6595906665
https://github.com/canonical/jepsen.dqlite/actions/runs/3860783772/jobs/6581330548
Looks like it should have been introduced by recently merged PR's as it's occurring frequently and has only started recently. Based on the first run in which it occurs (the last one in the list aboive), it's possible it's introduced by #351, #352 or #353.
The text was updated successfully, but these errors were encountered: