failure if fractional CPU resource requested for kdcluster #466

joel-bluedata · 2021-04-30T17:19:18Z

Observed on K8s 1.20.2.

If I try to create a kdcluster with a "0.5" CPU resource, it doesn't reach a final configured state. The process is blocked with these errors in the KD log:

{"level":"error","ts":1619800365.0318282,"logger":"controller_kubedirectorcluster","msg":"trying status update again in 16s; failed","Request.Namespace":"import-test","Request.Name":"centos8","error":"admission webhook "validate-cr.kubedirector.hpe.com" denied the request: \nChange to spec not allowed before previous spec change has been processed.", ... }

If I enable more detailed kdcluster logging for the API server, I see that the problematic request includes a spec which is changing the CPU resource from "0.5" to "500m". Obviously this is quantitatively equivalent, but it triggers the "is the spec being changed" check in the validator which is just doing an object comparison.

If I just use "500m" in the initial creation, everything works fine.

A few questions at this point:

Is this behavior new to 1.20, or have we just not seen the issue before because people tend to request whole-number CPU resource values?
The KD code is writing directly to the CR's status subresource, so why is the API server receiving a PUT to the CR's spec? The rejection of this PUT is being seen by the status write though, so the spec update is happening somewhere along the code path for that. Maybe in the controller-runtime client code?
Where is this resource value getting changed from "0.5" to "500m"? We're not doing it explicitly in the KD code. Again maybe something further down that status update callstack.

We could work around this issue by adding logic to KD, probably in the block of code currently marked by "Shortcut out of here if the spec is not being changed" in admitClusterCR. We could specifically dig into the resource properties, see if they are quantitatively equivalent, and if so then copy the new properties into the old object before doing the object compare.

But it would be nice to understand a bit more about what is happening.

… check There are situations where a client is not allowed to change a role spec: if the previous change is still being processed, or if the role has nonzero member count. As per issue bluek8s#466 though, there are some cases where a resource quantity is getting its format changed, e.g. from "0.5" to "500m", without its quantity actually being changed. Sometimes this is even happening as part of a status write from KD itself. We're not explicitly asking for this format change -- something along the way in client-go or K8s is doing it -- but whatever the reason, we need to allow it to happen. So before we compare old spec to new spec, we'll copy any quantity-equivalent resource values from new spec into old spec. That way the subsequent bulk compare of the two specs will not see those values as changing.

Motivated by issue bluek8s#466 initially, but really we should be doing this everywhere. Along with general regression-test creations of existing kdapp types while monitoring the KD logs, did these specific tests: - creating kdcluster with 0.5 CPU (works now) - changing kdcluster role spec to something "semantically equivalent", for role with existing members (works now) - changing kdcluster role spec to something non-equivalent, for role with existing members (blocked, as it should be) - changing the members count for a scaleout kdcluster role (works, as it should) - manually editing a kdcluster status stanza (blocked, as it should be) - manually editing a kdconfig status stanza (blocked, as it should be)

joel-bluedata added Priority: High Project: Cluster Reconcile beyond simple xlate of model to K8s spec Type: Bug labels Apr 30, 2021

joel-bluedata self-assigned this May 12, 2021

joel-bluedata mentioned this issue May 12, 2021

resource quantity format changes shouldn't trigger "spec has changed" check #473

Merged

joel-bluedata closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failure if fractional CPU resource requested for kdcluster #466

failure if fractional CPU resource requested for kdcluster #466

joel-bluedata commented Apr 30, 2021

failure if fractional CPU resource requested for kdcluster #466

failure if fractional CPU resource requested for kdcluster #466

Comments

joel-bluedata commented Apr 30, 2021