-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-2400] swap updates, beta3 graduation and GA criterias #4701
[KEP-2400] swap updates, beta3 graduation and GA criterias #4701
Conversation
05552e0
to
4905593
Compare
16c4878
to
b3f9708
Compare
Please update https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/2400.yaml and update missing bits of the PRR questionaire. |
b3f9708
to
443db7f
Compare
Thanks @deads2k!
I see you're the assigned approver for alpha/beta.
Done! PTAL :) |
f4af444
to
4124649
Compare
/retitle [KEP-2400] Node swap ppdates, GA criterias and clarifications |
D'oh /retitle [KEP-2400] Node swap updates, GA criterias and clarifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Here's a mix of feedback; I hope it is all useful.
a85dd1c
to
8e73c9e
Compare
how to identify when the node is under pressure and how to rank pods for eviction. | ||
|
||
The eviction manager will become swap aware by making the following changes to its memory pressure handling: | ||
- **How to identify pressure**: The eviction manager will consider the total sum of all running pods' accessible swap as additional memory capacity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the memory pressure comes from pods that cannot use swap? If their memory usage continue to grow, we'd still want to evict pods, no? Does the current proposal take into account of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would occur according to their usage compared request (see: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#pod-selection-for-kubelet-eviction). In general eviction technically does not require any pod to even use more than their request. In practice there should be as much memory as can be requested and pods should not be evicted before they exceed their request (it does happen though, i.e. if your system reserve is not set properly). With this change more memory becomes available to request (in the form of swap). Everything else stays the same and eviction will use the same general order and logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's for deciding the pod eviction order, but my question is more about triggering the eviction in kubelet. Say, there's only one pod that can use swap, and the node still has plenty of swap space but is short on memory, would the usage be considered over the threshold and trigger eviction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say, there's only one pod that can use swap, and the node still has plenty of swap space but is short on memory, would the usage be considered over the threshold and trigger eviction?
In the scenario you're describing the desired outcome AFAICT is to start using swap before triggering eviction. Since swapping is a heavy operation from the kernel's perspective, it does so only when the memory is low enough and there's no other choice. If evictions were to be triggered only in case the node is short on memory, but not short on swap, the kernel would never have the opportunity to start swapping pages, since the eviction would always kill workloads before the kernel would become pressured.
The idea behind the suggested design is that swap (at least the "accessible swap") needs to be used before evictions are triggered in order for it to become usable.
That being said, if there are no pods that have access to swap, hence the "accessible swap" is zero, the eviction behavior stays the same as it is before this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind the suggested design is that swap (at least the "accessible swap") needs to be used before evictions are triggered in order for it to become usable.
IIUC, a pod can be configured to have a large amount of accessible swap, but use only very little of it due to its actually very small memory footprint. This means that the node will have 'accessible swap" but no other pods are eligible to use. And the node can still be under memory pressure without triggering eviction.
Did I misunderstand this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @haircommander for the clarification.
Firstly, since the scenario you've outlined is possible only in situations where pods would have a large amount of accessible but unused swap. Those pods can opt-out of swap even if are of Burstable QoS by setting memory limits that equal requests:
https://github.com/kubernetes/kubernetes/blob/ea50baedcd6f8e565bcd91ed78a554bbfac50e1c/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L422.
More generally, although I understand this hypothetical, I feel we always circle back to the customization front which we've already agreed to defer to a follow-up. This KEP was designed to be with minimal APIs and to serve a first step of swap enablement. With such minimal APIs and no customization ability it did not, and can not, aim to fit advanced or esoteric use cases.
As I've written many times before, I think it would most benefit the ecosystem to let users experiment with basic swap enablement (which there's a high demand for) and have a proper and long discussion about how to tackle the controversial topic of changing APIs in a follow-up. The ground is set for that with the concept of kubelet's "swap behaviors".
My opinion is we should document this as a limitation and circle back to it in a follow-up.
@yujuhong WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small addon: We have been running swap on all our clusters (15 clusters with ca 10-150 nodes each) with disabled memory eviction. We found no issues whatsoever with memory pressure on our nodes as long as we set proper memory requests for kubernetes components (i.e. kube-proxy or kubelet). That is probably generally good advise but not super trivial to get right as memory usage depends on the number of nodes and pods. Nevertheless, this feature is overall extremely stable and we did not encounter any issues in production over the last one and a half years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong is it vital we determine the exact scheme in the KEP process? IMO: we have a pretty good outline and we can discuss implementation details in the implementation. thoughts??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iholder101 I'm definitely not advocating for changing APIs at this point. Allowing more customizability is out of the scope of this KEP. (And expecting users to set resource requests properly to avoid hitting scenarios like this also doesn't seem like the best course to me) :)
The only thing I'm trying to ensure here is that the memory eviction works reasonably given that swap behavior we defined today. The fact that we may not trigger memory eviction at all due to inaccessible swap space seems like a real issue. Could we make sure to consider all those cases?
@jabdoa2 disabling memory eviction is one way to solve this, but I don't have confidence on that working with a diverse set of workloads, and I believe this is not something SIG-Node wants to recommend at this point (please correct me if I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ultimately the worst a cluster admin would have to do given swap is lower the eviction threshold. If they want to prioritize eviciton happening that seems possible and reasonable. we can come up with a scheme to show what to set eviction thresholds to given certain swap/memory ratios and then have e2e tests to test some of those scenerios?
whether the feature is enabled or disabled. | ||
Instead, the behaviour is automatic and implicit that requires minimum user | ||
intervention (see [proposal below](#steps-to-calculate-swap-limit) for more details). | ||
As mentioned above, in the very near future, follow-up KEPs would bring API extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the feedback from the SIG-Node meeting today, it's good to have early exploration to make sure whatever we GA is compatible with the future changes. If you already have some thoughts on this, maybe you could add a few sentences here. @iholder101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reshaped this section, mentioning that currently we have only two on/off swap behaviors with no APIs or customizability (NoSwap
and LimitedSwap
), and that in the future we'll add customizability through adding more swap behaviors that might lead to API changes, perhaps at the pod level.
It was agreed many times before that this KEP would revolve around basic swap enablement. I've intentionally left this vague as API changes is a controversial, complex and serious topic, and I prefer to have proper discussions about it in a follow-up KEP. As agreed upon in yesterday's sig-node meeting, I'll start working on a sketch document to initiate these conversations which will serve as a first step for a follow-up KEP.
PTAL
278f02b
to
eecbd42
Compare
eecbd42
to
5a684ba
Compare
Signed-off-by: Itamar Holder <iholder@redhat.com>
…avior Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
5a684ba
to
0f6a6df
Compare
I plan to approve this KEP in principle to unblock progress on the fundamental node swap enablement. We recognize the importance of moving forward with this feature, and this KEP provides a solid base. However, I have significant concerns regarding the proposed eviction management logic. Some of my concerns raised by @yujuhong and @haircommander at #4701 (comment) But there are more:
As we discussed in the SIG Node meeting this week, we can punt certain implementation details to the implementation phase. However, we must reach a clear consensus on these critical decisions before GA in 1.33 release. Specifically:
I want to emphasize that we will reject GA unless we have reached a clear consensus on these points. We need to make sure that the test cases that are implemented, cover all of the edge cases. cc/ @yujuhong @haircommander @mrunalp WDYT? |
I agree! Let's move forward for enhancements phase to unblock, and plan on reaching consensus in implementation. /lgtm |
+1 on getting clear concensus. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, iholder101, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks everyone working on this. |
Add updates, GA criterias and clarifications
This PR updates the KEP in the following ways:
Emphasize that this KEP is about basic swap enablement
The original KEP indicated that pod-level swap APIs are out of scope:
enhancements/keps/sig-node/2400-node-swap/README.md
Lines 163 to 166 in 155a949
enhancements/keps/sig-node/2400-node-swap/README.md
Lines 142 to 144 in 155a949
However, the lack of APIs and the implicit nature of the current implementation sometimes brings suggestions to extend the API under this KEP.
This KEP focuses on a basic swap enablement. Follow-up KEPs regarding several topics (e.g. customization, zram/zswap suport, and more) will be introduced in the near future, in which we would be able to design and implement each extension in a focused way.
This PR updates the KEP to emphasize this approach.
Swap-aware evictions
This KEP also details how the eviction manager will be extended.
Implementation PR is available here: kubernetes/kubernetes#129578.
beta3 and GA criterias
The PR adds beta3 and GA criterias, alongside the intent to graduate to beta3 in version 1.33 and GA in 1.34.
Make sure PRR is ready
Updates
Since the last KEP updates, many improvements were made and many concerns were addressed. For example:
This PR updates the KEP to reflect these updates.