You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a race condition that can happen when connected resources change. Might be possible for other changes too.
In syncMembers there's a sequence like this:
Do all processing for currently "ready" members, i.e. handle reconfigs by applying new configmeta and/or generating notifications.
If ANY members in ready, creating, or createPending states don't yet have the latest configmeta JSON (and they should), bail out.
Proceed to handle all non-ready-state members.
Checking whether a member has the latest configmeta is done by comparing a member's LastConfigDataGeneration to its SpecGenerationToProcess. If LastConfigDataGeneration is nil, that's fine, this member has no idea of any configmeta yet so we don't need to worry about it being stale. If however it is non-nil and less than SpecGenerationToProcess, then a new configmeta JSON injection is needed.
LastConfigDataGeneration is initially nil for a member, but it is set to the SpecGenerationToProcess value in:
appConfig called from handleCreatingMembers (initial config case)
handleReadyMembers (reconfig case)
And the value-to-compare-to, SpecGenerationToProcess, is bumped (indicating the need for new configmeta JSON injection) in syncCluster for two reasons:
cluster membership is changing
connected resources are changing
Therefore, we can get this sequence:
A member starts up and gets into creating state with some value for LastConfigDataGeneration.
While the member is chewing on its app configuration scripts (over multiple handlers), the connected resources change.
So, SpecGenerationToProcess gets bumped up for that member.
We can't make any progress now. In syncMembers we will always bail out before reaching handleCreatingMembers to finish the app config and move the member to some ready or error state.
This is unlikely to happen in "normal" operation but can certainly happen in the backup-restore case since things like create time and UID will change for the connected resources as they are restored, which will make the hash of those resources different from the last-known hash.
It's worth thinking about why this race hasn't arisen previously from cluster membership changes. I don't know yet! Maybe just lucky.
The changes to do that "bail out if configmeta not updated" came in PR #272 (handling member container re-creation).
I think the specific vulnerability here happens only when a member stays in creating state for multiple handler passes. The first pass sets its LastConfigDataGeneration to non-nil, which makes it then vulnerable to triggering the bailout case on subsequent passes.
My initial thoughts:
There's no need to do the spec generation check for createPending members; they will always have LastConfigDataGeneration = nil.
Members in creating state may have LastConfigDataGeneration = nil if they haven't been processed yet. Or, if their LastConfigDataGeneration is non-nil, they're already chewing on some version of the configmeta and we can't interrupt them anyway... we need to keep running their handlers.
So, to be concrete. The code currently looks like this:
Removing the checkGenOk stuff for createPending members is straightforward.
The other change I think we want to make is that we do always want to run handleCreatingMembers for any members where LastConfigDataGeneration is already non-nil. The rough-and-ready way to do that would be to add a parameter to handleCreatingMembers indicating whether it is for members that need initial config (not ongoing config). Let's say that parameter is "true" for initial config, then this code block would look like:
Correction, createPending members can have a non-nil LastConfigDataGeneration if they are members with persistent storage that are being "rebooted" (container ID changed). That seems to just make it truly necessary to not have that check block the handleCreatingMembers processing though, for members that are in the middle of config.
It seems like that final new argument for handleCreatingMembers then is not specifically about whether or not LastConfigDataGeneration is nil. Instead it's about whether we're allowed yet to start a new initial config.
We have a race condition that can happen when connected resources change. Might be possible for other changes too.
In syncMembers there's a sequence like this:
Checking whether a member has the latest configmeta is done by comparing a member's LastConfigDataGeneration to its SpecGenerationToProcess. If LastConfigDataGeneration is nil, that's fine, this member has no idea of any configmeta yet so we don't need to worry about it being stale. If however it is non-nil and less than SpecGenerationToProcess, then a new configmeta JSON injection is needed.
LastConfigDataGeneration is initially nil for a member, but it is set to the SpecGenerationToProcess value in:
And the value-to-compare-to, SpecGenerationToProcess, is bumped (indicating the need for new configmeta JSON injection) in syncCluster for two reasons:
Therefore, we can get this sequence:
This is unlikely to happen in "normal" operation but can certainly happen in the backup-restore case since things like create time and UID will change for the connected resources as they are restored, which will make the hash of those resources different from the last-known hash.
It's worth thinking about why this race hasn't arisen previously from cluster membership changes. I don't know yet! Maybe just lucky.
The changes to do that "bail out if configmeta not updated" came in PR #272 (handling member container re-creation).
I think the specific vulnerability here happens only when a member stays in creating state for multiple handler passes. The first pass sets its LastConfigDataGeneration to non-nil, which makes it then vulnerable to triggering the bailout case on subsequent passes.
My initial thoughts:
So, to be concrete. The code currently looks like this:
Removing the checkGenOk stuff for createPending members is straightforward.
The other change I think we want to make is that we do always want to run handleCreatingMembers for any members where LastConfigDataGeneration is already non-nil. The rough-and-ready way to do that would be to add a parameter to handleCreatingMembers indicating whether it is for members that need initial config (not ongoing config). Let's say that parameter is "true" for initial config, then this code block would look like:
(with appropriate comments added of course).
I'll give it a whirl.
The text was updated successfully, but these errors were encountered: