Assertion `r->last_applied <= r->commit_index' failed. #355

MathieuBordere · 2023-01-09T12:43:16Z

Observed during Jepsen runs:

https://github.com/canonical/jepsen.dqlite/actions/runs/3873600397/jobs/6603811458

https://github.com/canonical/jepsen.dqlite/actions/runs/3869648631/jobs/6595906665

https://github.com/canonical/jepsen.dqlite/actions/runs/3860783772/jobs/6581330548

Looks like it should have been introduced by recently merged PR's as it's occurring frequently and has only started recently. Based on the first run in which it occurs (the last one in the list aboive), it's possible it's introduced by #351, #352 or #353.

cole-miller · 2023-01-09T15:57:56Z

I suspect that #352 is at fault here, it's the only one of those PRs that obviously affects log indices

cole-miller · 2023-01-10T20:47:52Z

Actually, based on looking at logs it seems more likely to be caused by #351.

cole-miller · 2023-01-11T17:19:59Z

This should be fixed by #362.

cole-miller · 2023-03-17T16:57:25Z

This has been showing up again when the "pause" nemesis is enabled (canonical/jepsen.dqlite#91), so re-opening.

cole-miller · 2023-05-30T21:05:09Z

Finally looking into this again. It seems that this is the problematic update to commit_index:

raft/src/replication.c

Line 1105 in 7cfccd3

r->commit_index = min(args->leader_commit, r->last_stored);

I do not think it should be possible for that logic to decrease commit_index, which is what I see happening in the Jepsen tests. The culprit is that last_stored is less than commit_index -- now trying to trace how that happens...

cole-miller · 2023-05-30T21:24:02Z

Hmm, I'm having trouble finding how/where last_stored gets updated. deleteConflictingEntries is not the culprit, nor snapshotRestore.

cole-miller · 2023-05-30T22:02:28Z

Adding some printf(last_stored) before and after this call to deleteConflictingEntries seems to make the assertion failure go away:

raft/src/replication.c

Line 1082 in 7cfccd3

rv = deleteConflictingEntries(r, args, &i);

That's pretty weird. I wonder whether last_stored is not getting initialized?

freeekanayaka · 2023-05-30T22:03:07Z

I didn't look at this in depth, but I feel that we might want to revisit #351 and #353 anyways, because they seem to introduce some unnecessary complexity.

For #351, I think the only thing we should check is if we have a leader at all set in r->follower_state.current_leader, and send it an AppendEntries RPC result if we have one. It does not matter if it's the same leader that originally sent us the AppendEntries RPC containing the entries that we just finished to persist. It's both easier (less bookkeeping) and useful (if it's a new leader, we might speed up syncing to it) to still send the result.

For #353, I think it's a bit similar, and it also creates an asymmetry between what we do for AppendEntries and what we do for InstallSnapshot. For AppendEntries we don't prevent a follower from converting to a candidate if there some entries are being written to disk. And InstallSnapshot is after all just a different version of AppendEntries (think of an AppendEntries that sends a lot of entries), there's not much reason to differentiate between the two and to prevent state transitions.

I had noticed this and had been thinking about this since a while, and I have also a small change that rectifies that. Perhaps I could make a PR of it? It might well solve this issue if it's true that it's a regression caused by those PRs.

cole-miller · 2023-05-30T22:09:32Z

@freeekanayaka sure, makes sense to me.

freeekanayaka · 2023-06-01T10:31:06Z

Finally looking into this again. It seems that this is the problematic update to commit_index:

raft/src/replication.c

Line 1105 in 7cfccd3

r->commit_index = min(args->leader_commit, r->last_stored);

I do not think it should be possible for that logic to decrease commit_index, which is what I see happening in the Jepsen tests. The culprit is that last_stored is less than commit_index -- now trying to trace how that happens...

After digging a bit I actually found a case where a decrease it's possible, which is the source of this issue. Fixed that in #423.

freeekanayaka · 2023-06-01T10:39:27Z

Observed during Jepsen runs:

https://github.com/canonical/jepsen.dqlite/actions/runs/3873600397/jobs/6603811458

https://github.com/canonical/jepsen.dqlite/actions/runs/3869648631/jobs/6595906665

https://github.com/canonical/jepsen.dqlite/actions/runs/3860783772/jobs/6581330548

Looks like it should have been introduced by recently merged PR's as it's occurring frequently and has only started recently. Based on the first run in which it occurs (the last one in the list aboive), it's possible it's introduced by #351, #352 or #353.

I didn't investigate this too much, but #351 #352 and #353 might actually be a red herring. Since the problem seems to be fixed by #423, I presume that the change that actually made jepsen start failing could be #336, where we add a new barrier at the beginning of the term. I suspect the bug was already there, but that change might have made it more likely to trigger.

Add a failing test that reproduces the situation that triggered the assertion failure described in canonical#355. Signed-off-by: Free Ekanayaka <free@ekanayaka.io>

cole-miller self-assigned this Jan 9, 2023

This was referenced Jan 10, 2023

Revert "replication: Drop append entries result if a state transition has occurred" #362

Merged

Bad memory access (?) in uvClientInit #339

Closed

cole-miller closed this as completed Jan 11, 2023

nurturenature mentioned this issue Mar 17, 2023

Enable process pause nemesis canonical/jepsen.dqlite#91

Merged

cole-miller reopened this Mar 17, 2023

This was referenced May 31, 2023

Respond if term changed when installing snapshot #421

Closed

Fix last_applied failing assertion #423

Merged

MathieuBordere closed this as completed in #423 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion `r->last_applied <= r->commit_index' failed. #355

Assertion `r->last_applied <= r->commit_index' failed. #355

MathieuBordere commented Jan 9, 2023 •

edited

Loading

cole-miller commented Jan 9, 2023

cole-miller commented Jan 10, 2023

cole-miller commented Jan 11, 2023

cole-miller commented Mar 17, 2023

cole-miller commented May 30, 2023

cole-miller commented May 30, 2023

cole-miller commented May 30, 2023

freeekanayaka commented May 30, 2023 •

edited

Loading

cole-miller commented May 30, 2023

freeekanayaka commented Jun 1, 2023

freeekanayaka commented Jun 1, 2023

Assertion `r->last_applied <= r->commit_index' failed. #355

Assertion `r->last_applied <= r->commit_index' failed. #355

Comments

MathieuBordere commented Jan 9, 2023 • edited Loading

cole-miller commented Jan 9, 2023

cole-miller commented Jan 10, 2023

cole-miller commented Jan 11, 2023

cole-miller commented Mar 17, 2023

cole-miller commented May 30, 2023

cole-miller commented May 30, 2023

cole-miller commented May 30, 2023

freeekanayaka commented May 30, 2023 • edited Loading

cole-miller commented May 30, 2023

freeekanayaka commented Jun 1, 2023

freeekanayaka commented Jun 1, 2023

MathieuBordere commented Jan 9, 2023 •

edited

Loading

freeekanayaka commented May 30, 2023 •

edited

Loading