Race in change processing can cause deadlock. #518

joel-bluedata · 2021-10-15T20:36:37Z

We have a race condition that can happen when connected resources change. Might be possible for other changes too.

In syncMembers there's a sequence like this:

Do all processing for currently "ready" members, i.e. handle reconfigs by applying new configmeta and/or generating notifications.
If ANY members in ready, creating, or createPending states don't yet have the latest configmeta JSON (and they should), bail out.
Proceed to handle all non-ready-state members.

Checking whether a member has the latest configmeta is done by comparing a member's LastConfigDataGeneration to its SpecGenerationToProcess. If LastConfigDataGeneration is nil, that's fine, this member has no idea of any configmeta yet so we don't need to worry about it being stale. If however it is non-nil and less than SpecGenerationToProcess, then a new configmeta JSON injection is needed.

LastConfigDataGeneration is initially nil for a member, but it is set to the SpecGenerationToProcess value in:

appConfig called from handleCreatingMembers (initial config case)
handleReadyMembers (reconfig case)

And the value-to-compare-to, SpecGenerationToProcess, is bumped (indicating the need for new configmeta JSON injection) in syncCluster for two reasons:

cluster membership is changing
connected resources are changing

Therefore, we can get this sequence:

A member starts up and gets into creating state with some value for LastConfigDataGeneration.
While the member is chewing on its app configuration scripts (over multiple handlers), the connected resources change.
So, SpecGenerationToProcess gets bumped up for that member.
We can't make any progress now. In syncMembers we will always bail out before reaching handleCreatingMembers to finish the app config and move the member to some ready or error state.

This is unlikely to happen in "normal" operation but can certainly happen in the backup-restore case since things like create time and UID will change for the connected resources as they are restored, which will make the hash of those resources different from the last-known hash.

It's worth thinking about why this race hasn't arisen previously from cluster membership changes. I don't know yet! Maybe just lucky.

The changes to do that "bail out if configmeta not updated" came in PR #272 (handling member container re-creation).

I think the specific vulnerability here happens only when a member stays in creating state for multiple handler passes. The first pass sets its LastConfigDataGeneration to non-nil, which makes it then vulnerable to triggering the bailout case on subsequent passes.

My initial thoughts:

There's no need to do the spec generation check for createPending members; they will always have LastConfigDataGeneration = nil.
Members in creating state may have LastConfigDataGeneration = nil if they haven't been processed yet. Or, if their LastConfigDataGeneration is non-nil, they're already chewing on some version of the configmeta and we can't interrupt them anyway... we need to keep running their handlers.

So, to be concrete. The code currently looks like this:

    for _, r := range roles {
        if ready, readyOk := r.membersByState[memberReady]; readyOk {
            handleReadyMembers(reqLogger, cr, r, configmetaGenerator)
            if allMembersUpdated {
                allMembersUpdated = checkGenOk(ready)
            }
        }
        if allMembersUpdated {
            if createPending, createPendingOk := r.membersByState[memberCreatePending]; createPendingOk {
                allMembersUpdated = checkGenOk(createPending)
            }
        }
        if allMembersUpdated {
            if creating, creatingOk := r.membersByState[memberCreating]; creatingOk {
                allMembersUpdated = checkGenOk(creating)
            }
        }
    }
    if !allMembersUpdated {
        return nil
    }
    for _, r := range roles {
        if _, ok := r.membersByState[memberCreatePending]; ok {
            handleCreatePendingMembers(reqLogger, cr, r)
        }
        if _, ok := r.membersByState[memberCreating]; ok {
            handleCreatingMembers(reqLogger, cr, r, roles, configmetaGenerator)
        }
        if _, ok := r.membersByState[memberDeletePending]; ok {
            handleDeletePendingMembers(reqLogger, cr, r, roles)
        }
        if _, ok := r.membersByState[memberDeleting]; ok {
            handleDeletingMembers(reqLogger, cr, r)
        }
    }

Removing the checkGenOk stuff for createPending members is straightforward.

The other change I think we want to make is that we do always want to run handleCreatingMembers for any members where LastConfigDataGeneration is already non-nil. The rough-and-ready way to do that would be to add a parameter to handleCreatingMembers indicating whether it is for members that need initial config (not ongoing config). Let's say that parameter is "true" for initial config, then this code block would look like:

    for _, r := range roles {
        if ready, readyOk := r.membersByState[memberReady]; readyOk {
            handleReadyMembers(reqLogger, cr, r, configmetaGenerator)
            if allMembersUpdated {
                allMembersUpdated = checkGenOk(ready)
            }
        }
        if allMembersUpdated {
            handleCreatingMembers(reqLogger, cr, r, roles, configmetaGenerator, false)
            if creating, creatingOk := r.membersByState[memberCreating]; creatingOk {
                allMembersUpdated = checkGenOk(creating)
            }
        }
    }
    if !allMembersUpdated {
        return nil
    }
    for _, r := range roles {
        if _, ok := r.membersByState[memberCreatePending]; ok {
            handleCreatePendingMembers(reqLogger, cr, r)
        }
        if _, ok := r.membersByState[memberCreating]; ok {
            handleCreatingMembers(reqLogger, cr, r, roles, configmetaGenerator, true)
        }
        if _, ok := r.membersByState[memberDeletePending]; ok {
            handleDeletePendingMembers(reqLogger, cr, r, roles)
        }
        if _, ok := r.membersByState[memberDeleting]; ok {
            handleDeletingMembers(reqLogger, cr, r)
        }
    }

(with appropriate comments added of course).

I'll give it a whirl.

joel-bluedata · 2021-10-15T22:03:06Z

Correction, createPending members can have a non-nil LastConfigDataGeneration if they are members with persistent storage that are being "rebooted" (container ID changed). That seems to just make it truly necessary to not have that check block the handleCreatingMembers processing though, for members that are in the middle of config.

It seems like that final new argument for handleCreatingMembers then is not specifically about whether or not LastConfigDataGeneration is nil. Instead it's about whether we're allowed yet to start a new initial config.

See issue bluek8s#518. Stimulated the case and made sure the deadlock didn't happen.

joel-bluedata added Priority: Blocker Type: Bug Project: Cluster Reconcile beyond simple xlate of model to K8s spec labels Oct 15, 2021

joel-bluedata added this to the 0.7.0 milestone Oct 15, 2021

joel-bluedata self-assigned this Oct 15, 2021

joel-bluedata added a commit to joel-bluedata/kubedirector that referenced this issue Oct 15, 2021

address race in change processing

a52ec3d

See issue bluek8s#518. Stimulated the case and made sure the deadlock didn't happen.

joel-bluedata mentioned this issue Oct 15, 2021

address race in change processing #519

Merged

joel-bluedata closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race in change processing can cause deadlock. #518

Race in change processing can cause deadlock. #518

joel-bluedata commented Oct 15, 2021

joel-bluedata commented Oct 15, 2021

Race in change processing can cause deadlock. #518

Race in change processing can cause deadlock. #518

Comments

joel-bluedata commented Oct 15, 2021

joel-bluedata commented Oct 15, 2021