kdcluster backup/restore enablement #500

joel-bluedata · 2021-06-21T06:00:34Z

Backup-and-restore for kdclusters currently has three issues.

A kdcluster needs to be restored AFTER the relevant kdapp.
Depending on how we attack this, the kdcluster probably also needs to be restored after the native K8s resources that make up the kdcluster.
We lose the kdcluster status stanza, when that kdcluster resource is re-created. There's important stuff in there!

Thoughts about ordering w.r.t. the kdapp:

If we capture the specific distro ID and version of the kdapp in the kdcluster status, then we could do something like:

If we detect that this is a restored kdcluster rather than a new one, don't immediately reject a kdcluster that has a missing kdapp.
Once we're in a position to do reconciliation, if the kdapp is still missing then just post an Event describing the problem and skip reconciliation.
If/when the kdapp finally appears, reconciliation can proceed normally.

(To detect when we are restoring a kdcluster, rather than creating a new one; e.g. we could check for the presence of KD's finalizer.)

It's important to track & honor the distro ID and version, to make sure we really are reconnecting with the same kdapp -- not some other kdapp resource created in the interim between backup/restore that happens to have the same name as the old one but different contents.

Thoughts about ordering w.r.t. component K8s resources:

This one is hard to handle programmatically, as from KD's perspective there's no detectable difference between "the restore process hasn't restored my statefulset yet" vs. "my statefulset is missing and must be re-created".

We seem to only have two choices. Either require that kdcluster gets restored last, or have a post-restoration kdcluster NOT immediately start reconciliation until it is explicitly told to do so. (The "explicit tell" could take many forms.)

Thoughts about losing the status stanza:

Ideally, perhaps, our important state would all be out there in the native K8s resources. Some thoughts about that (and why we aren't doing it currently) in this gist: https://gist.github.com/joel-bluedata/ce39dd74f960fe773ad20a011eb7086d

Another alternative to using the status stanza would be to use an actual database rather than storing stuff in etcd documents. This isn't currently under consideration, but we're aware that it's a possibility.

The track we're pursuing at the moment involves storing the state for a kdcluster in some other document. There would be some advantages to stashing these documents in their own special namespace, but that complicates some common backup/restore scenarios. So it looks like such a document would live in the same namespace as its kdcluster.

It would be somewhat natural to use a configmap or secret for this purpose. Drawbacks however:

"polluting" the list of configmaps/secrets with a bunch of documents not interesting to the end-user
requiring careful RBAC control, or KD admission control for configmap/secret edits, to safeguard them
if we need to control when these are restored relative to other native K8s resources, this can be difficult (with the backup solutions we're looking at)

Using a CR for this purpose would solve the above issues, so probably we'll need to add a CR that is configmap-like but used only for storing kdcluster state.

The other issue to settle is the relationship between the status stanza and this new resource. A good end-state could be to have this new resource be the authoritative state tracker for the kdcluster, and the status stanza only would duplicate parts of that info out of convenience for the end-user. However, to tackle this work in stages, and to minimally disrupt existing stuff that looks at the kdcluster status stanza, as a first cut we're thinking of making this new resource just a mirror of the status stanza. The status stanza is still authoritative. The mirror is used to restore the status stanza when it goes missing.

kmathur2 · 2021-09-27T21:42:10Z

"We lose the kdcluster status stanza, when that kdcluster resource is re-created. There's important stuff in there!"

Can you please elaborate on this line? What is the workflow you are talking about?

joel-bluedata · 2021-09-27T21:53:07Z

Backup/restore solutions like Velero don't have direct access to etcd; they just read a resource from the API, store it, and then later use the API to re-create that resource. When you create a resource you can't specify what its status stanza should look like. (And honestly it would often not be a good idea to do so anyway, as the status of the new resource could differ from the old one.)

In a kdcluster, the status stanza contains info about the process of reconciliation, for example whether there are pending "add" notifies that should be sent to a currently down container once it comes back up. So we have to restore or re-create that status.

kmathur2 · 2021-09-27T22:13:37Z

Got it, when Valero backs up kdcluster resource, when it is re-creating it can get the status from backed up kdcluster resource right? why do we need a new crd ?

kmathur2 · 2021-09-27T22:37:04Z

Got it, when Valero backs up kdcluster resource, when it is re-creating it can get the status from backed up kdcluster resource right? why do we need a new crd ?

Ignore this, KD needs a k8s native resource and you already explained config map is a bad choice for various reasons.

joel-bluedata · 2021-09-27T22:38:31Z

Right. I should also add here that the (reasonable) approach by Velero is that it doesn't provide any hooks for any controllers to directly intervene in how it re-creates various resources -- this is true for native K8s resources as well as for CRs. Velero just re-creates the spec, and then it is up to the relevant controller to figure out how to go from there.

joel-bluedata added Priority: High Type: Enhancement Project: Cluster Reconcile beyond simple xlate of model to K8s spec labels Jun 21, 2021

joel-bluedata added this to the 0.7.0 milestone Jun 21, 2021

joel-bluedata self-assigned this Aug 13, 2021

joel-bluedata mentioned this issue Sep 9, 2021

Backup/restore support #508

Merged

joel-bluedata closed this as completed Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kdcluster backup/restore enablement #500

kdcluster backup/restore enablement #500

joel-bluedata commented Jun 21, 2021

kmathur2 commented Sep 27, 2021

joel-bluedata commented Sep 27, 2021

kmathur2 commented Sep 27, 2021

kmathur2 commented Sep 27, 2021

joel-bluedata commented Sep 27, 2021

kdcluster backup/restore enablement #500

kdcluster backup/restore enablement #500

Comments

joel-bluedata commented Jun 21, 2021

kmathur2 commented Sep 27, 2021

joel-bluedata commented Sep 27, 2021

kmathur2 commented Sep 27, 2021

kmathur2 commented Sep 27, 2021

joel-bluedata commented Sep 27, 2021