Backup/restore support #508

joel-bluedata · 2021-09-09T03:45:59Z

closes issue #500

Posting as a draft for starters while I do more testing and decide how to explain/doc it. (Also need to rebase to master.)

The individual commits are fairly self-contained. Might be a good idea to look at them instead of, or supplemental to, looking at the whole diff.

==========

Issue #500 covers the problem we're addressing here, but let me summarize.

KD has three challenges in a backup-and-restore situation:

The status stanza of the kdcluster is lost.
Owner references on various kdcluster-related components are lost (or made stale). Without these, deleting a kdcluster will not properly clean up everything.
KD can't make appropriate reconciliation decisions until all relevant resources are restored. For example if a statefulset for a role is missing, KD doesn't know if it's actually missing or if the backup solution is about to get around to restoring it.

There are lots of other issues in doing backup-and-restore properly, especially with complex resource arrangements e.g. statefulsets with pods that mount PVs, but these issues are not KD-specific. Any KD changes here assume that the backup-and-restore solution is handling all the "plain Kubernetes" issues appropriately.

We're specifically testing with Velero, so no assurances for any other solution, but we don't expect there to be big differences in what will be required of KD.

==========

The solution for challenge 1 is to (optionally) "back up" the status stanza for each kdcluster in a separate CR in the namespace. We're using a CR rather than e.g. a configmap because we want to make it easy to control its ACLs, and to not pollute any listings of configmaps. Really only KD should care about this resource.

There's a new KD config flag, backupClusterStatus, that defaults to false.

If it is set true, any write to a kdcluster status stanza will also write the same content to a kdstatusbackup CR of the same name as the kdcluster (creating that CR if necessary).

If it is set false, any existing kdstatusbackup CRs will be deleted.

Note that changing the backupClusterStatus value will only affect a kdstatusbackup CR on the next reconciliation handler for its kdcluster.

==========

The solution for challenge 2 is to include owner reference repair/enforcement in the reconciliation of all relevant resources.

==========

The solution for challenge 3 is somewhat more involved.

A kdcluster depends on 3 kinds of things: its kdapp CR, its kdstatusbackup CR, and any K8s-specific resources that are used to implement the kdcluster. For that last category, we really only care about the resources that KD will want to explicitly create/reconfigure/delete: per-role statefulsets, per-member services, and the cluster (headless) service. Anything created by the statefulset controller is a thing where KD already properly handles cases of things appearing/disappearing asynchronously to its control.

So if a kdcluster is restored, we want to wait for these things to appear before resuming reconciliation. In the case of the kdstatusbackup CR, when it reappears its contents can be copied back into the kdcluster's status stanza.

KD will detect a "restored" kdcluster by an annotation that was previously placed on the CR (if necessary) when the kdcluster status was updated. If this annotation is seen on a kdcluster that is being created, the validator skips validation of the incoming CR and sets the kubedirector.hpe.com/restoring label on the CR. (No label value.)

The presence of this label prevents the normal reconciliation from happening for this kdcluster. Instead, the handler will watch for those necessary resources (and set some status flags to let any observer know what it is still waiting on). If those resources all reappear, KD will remove the label and let reconciliation resume.

Various things to note about this process:

While the restoring label is in place, the spec of the kdcluster cannot be modified.
A non-KD client can also manually remove the restoring label, to force reconciliation to resume. However any attempt to remove the label (including by KD) will trigger validation of the kdcluster, and if validation fails the label cannot be removed.
While the restoring label is in place, the kdcluster normally cannot be deleted. If you were allowed to delete a kdcluster while its resources were still being restored, you might end up with orphaned resources floating around in your namespace.
However. If you absolutely do need to delete a kdcluster, and you don't want to (or can't, because of validation failures) remove the restoring label, then you can first add the kubedirector.hpe.com/allow-delete-while-restoring label to the CR (label value does not matter). Once this label is in place, deleting the kdcluster will be allowed even if it has the restoring label.

(Note that any incoming status stanza will be dumped because we are using a status subresource.) Implements the following behaviors in the validator: - if operation is create and already has finalizer, set being-restored label and skip validation; accept - if operation is update and has being-restored label, reject any spec change - if operation is delete and has being-restored label, reject unless allow-force-delete is set Because we're doing more early-exits from the cluster CR validate function BUT may still need to do patches in those cases, I moved the patch-apply code into a defer func so it always runs on function exit. This defer func also handles errors, and changing the allowed result to true if there are no errors. The intended use for these labels: The "being restored" label will, in an upcoming commit, be checked by the reconciliation code. Reconciliation will be skipped if this label is set. The user can manually delete this label to force reconciliation to restart; or, a later commit will handle checking to see if all necessary resources have been restored so that it can un-set this label automatically. A user cannot normally delete a kdcluster that is in the "being restored" state. This is because other resources may be restored which depend on the kdcluster as their "owner", for proper cleanup when the kdcluster is deleted. However if the user is double-sure that they really do want to delete the kdcluster, and they don't want to remove the "being restored" label (because that would restart reconciliation), they can manually set another label to force the delete to be allowed.

* populate the restoreProgress object in status * don't reconcile until being-restored label is cleared Next few commits will be about automatic determination of when/if we can clear the various "waiting" flags in restoreProgress (and if all get cleared, then auto unset the being-restored label).

This is the easy one!

auto-delete it when kdcluster is deleted

Various advantages here in some corner cases, it's nice to make sure we get back the same pod and PVC names.

Put it in restoreProgress rather than top-level status. (We won't let the CR out of the restoring state while there is a validation error.)

Currently we bail out on the first not-found resource, rather than always trying to find and log all missing resources.

Triggers an immediate reconcile that will be discarded since it will have stale status. Instead, only do the label-clearing as the sole operation of a handler. Also mark restoreProgress as omitempty to look nicer.

We depend on owner references for lots of cleanup, and backup/restore will lose them. So let's put them back. Also fixed a bug in reconcile that could happen if we failed to update the status backup and had to retry.

joel-bluedata · 2021-09-10T05:23:29Z

I did a fairly late switch from having an explicit "this CR was previously created" annotation to instead relying on our finalizer. While this worked fine in manual "restore" tests, now that I've jumped back over to using Velero it looks like maybe Velero strips the finalizers on restore? Need to investigate that a bit more but probably I'll need to go back to using an annotation.

Edit: back to using annotation; description above updated.

Backup solution might strip out finalizers. Use an explicit annotation instead. As a bonus we can use the value of this annotation to memo-ize whether or not a status backup exists, to let the user know whether waiting is likely to resolve anything. Note that if you change the global config setting -- so that status backups are created or deleted for all kdclusters when they reconcile -- this means that every kdcluster will update its annotation as well as doing a no-op "status update" to create/delete the backup. This generates a bang-bang rapid sequence of two changes to the kdcluster, so that we are guaranteed to generate the harmless-but-messy "stale UID" message in the reconciler when the handlers then run for those two changes (since the first will be stale). I think that's fine, probably not worth the effort to avoid the no-op status update in that situation.

joel-bluedata · 2021-09-22T23:25:57Z

Found a bug with the owner-ref repairing. The statefulset not only has its own owner-ref, but also an owner-ref in the volumeClaimTemplates. This means that after the sset is restored, any newly created PVCs will be automatically deleted, which is pretty puzzling when you see it in action.

I'm kinda surprised Velero left that reference in there? Well in any case, even if they had stripped it out, we'd still need to repair/restore it. (Otherwise we'd get the opposite undesirable behavior of PVC remaining after sset deletion.)

joel-bluedata · 2021-09-23T00:10:10Z

Hmm. We're not allowed to edit that part of the spec in an existing statefulset.

There's no other way to indicate that PVCs should be auto-deleted, at least not until kubernetes/enhancements#2440 arrives in some future K8s version.

I guess we could make the KD webhook fix up the owner ref on any PVCs created with the kubedirector.hpe.com/kdcluster label. Kind of annoying, but should work... it should even fix PVCs from statefulsets created by previous versions of KD and then restored, which have this problem of an incorrect owner ref.

An alternative would be to have the webhook fix the statefulset's volumeClaimTemplates when it gets recreated by the restore. That would really work just as well, in the nomal course of things. But I'm leaning to the PVC-oriented solution for a few reasons:

It's the PVC that we need to set the ref on, so if we have to intervene, why not go directly to what we want to change.
This gives us some potential flexibility to detach the PV lifecycle from the kdcluster's on-the-fly, if we ever need or want to pursue that.
It's a little more elegant to not store the owner-ref-in-immutable-field in the backed-up object, since we know it will be bad on restore. One specific consequence of that: if KD itself is down during the restore of the statefulset, things will still work if KD comes back up later.

The downside to the PVC solution is that we have to intervene on every PVC creation rather than just once at statefulset creation. That 3rd point above is pretty convincing for me though.

For PVCs to get cleaned up when the kdcluster is deleted, we have an owner reference on them. Up until now this was specified by the volumeClaimTemplates in the statefulset. The problem with the volumeClaimTemplates approach is that that part of the statefulset spec is immutable; we can't edit it in an existing statefulset. So if the statefulset is backed up then restored, the stale owner ref there can't be fixed in reconciliation. This means any PVC for this statefulset will instantly get deleted. So let's not put the owner ref there. A different way to get owner refs onto the PVCs is to use our mutating webhook. We could choose to have the webhook either set/edit a statefulset's volumeClaimTemplates on statefulset creation, or else set a PVC's owner ref on PVC creation. I've gone with the latter approach, for reasons discussed in PR bluek8s#508 comments. This changeset includes: * Move the ownerReferences and ownerReferencesPresent functions from the executor package to the shared package, since our webhook code will need to deal with owner refs now. Also move the definition of the kdcluster label key into the shared package. * Don't set owner references in the volume claim templates. * Make the webhook service get deleted and re-created each time KD restarts. Also update the mutating webhook config each time KD restarts. This makes it simpler to upgrade KD, if the webhook is changing -- also helps during development. (It's also a step toward better ways to deal with expiring certs in the webhook service, but that's just serendipity.) * Add a webhook for PVCs. We don't want to prevent the creation of PVCs if KD is down, so this will be a "soft" webhook that is allowed to fail. * Implement that PVC webhook so that it defines the owner ref for any PVC that has the kdcluster label.

deploy/kubedirector/kubedirector.hpe.com_kubedirectorstatusbackups_crd.yaml

pkg/validator/pvc.go

doc/backup-and-restore.md

pkg/apis/kubedirector/v1beta1/kubedirectorstatusbackup_types.go

pkg/controller/kubedirectorcluster/cluster_restore.go

kmathur2 · 2021-09-29T21:32:46Z

Do we need to run --configure after kdcluster is restored?

joel-bluedata · 2021-09-29T21:36:15Z

Do we need to run --configure after kdcluster is restored?

That's handled by the normal "container has changed" logic already. When the container changes, if it's using persistent storage we don't re-run. If it's not persistent, we do.

kmathur2 · 2021-09-29T22:37:43Z

Just a few questions, overall looking great!!

Will start testing now

Implement the ability to wait for connections before resuming reconcile, and put a boolean in kd-global-config to choose whether to do this. By default we will now wait. Updated the docs for this, and also fixed an error in the docs where the "awaiting" flags were described as initially-false rather than as initially-true.

joel-bluedata · 2021-09-30T19:08:58Z

I pushed the code change for wait-for-connections so you can see it in this PR (commit 2d2e48d). Not tested yet; I'll test it this evening once you're done with our test cluster.

kmathur2 · 2021-09-30T20:44:05Z

I pushed the code change for wait-for-connections so you can see it in this PR (commit 2d2e48d). Not tested yet; I'll test it this evening once you're done with our test cluster.

Sounds good. Feel free to use the system in the eve. I have to be away in the afternoon. Will resume testing late evening after you are done.

kmathur2 · 2021-10-01T07:10:43Z

🚢 🚢

joel-bluedata · 2021-10-01T15:43:32Z

OK, away we go!

joel-bluedata added 15 commits September 6, 2021 16:35

schema changes and new CR for status-backup

ce67fb7

manage status backups

924f8e6

handle awaiting-kdapp flag

f522271

This is the easy one!

include owner reference on status backup CR

a1eea3c

auto-delete it when kdcluster is deleted

when re-creating missing sset, use same name

22872e7

Various advantages here in some corner cases, it's nice to make sure we get back the same pod and PVC names.

handle awaiting-status flag

f3dd181

move can't-restore-because-invalid message

9b83c39

Put it in restoreProgress rather than top-level status. (We won't let the CR out of the restoring state while there is a validation error.)

don't allow exit restoring state if invalid

e097d02

handle awaiting-resources flag

aa67023

Currently we bail out on the first not-found resource, rather than always trying to find and log all missing resources.

try to auto-resume reconciliation if everything is present

963808d

don't clear being-restored label in same handler as status update

b4255a7

Triggers an immediate reconcile that will be discarded since it will have stale status. Instead, only do the label-clearing as the sole operation of a handler. Also mark restoreProgress as omitempty to look nicer.

handle owner reference repair during reconcile

a82c3dd

We depend on owner references for lots of cleanup, and backup/restore will lose them. So let's put them back. Also fixed a bug in reconcile that could happen if we failed to update the status backup and had to retry.

Merge branch 'master' into backup-support

01da729

joel-bluedata force-pushed the backup-support branch from 57047e3 to 16fc303 Compare September 11, 2021 01:17

joel-bluedata force-pushed the backup-support branch from 16fc303 to 3c151e1 Compare September 13, 2021 21:04

draft of backup-and-restore feature doc

e9efcb2

joel-bluedata marked this pull request as ready for review September 14, 2021 22:29

Merge branch 'master' into backup-support

8625ddc

kmathur2 reviewed Sep 27, 2021

View reviewed changes

deploy/kubedirector/kubedirector.hpe.com_kubedirectorstatusbackups_crd.yaml Show resolved Hide resolved

kmathur2 reviewed Sep 27, 2021

View reviewed changes

pkg/validator/pvc.go Show resolved Hide resolved

Merge branch 'master' into backup-support

df6737d