Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-2400] swap updates, beta3 graduation and GA criterias #4701

Merged
merged 13 commits into from
Feb 14, 2025

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Jun 6, 2024

  • One-line PR description:
    Add updates, GA criterias and clarifications
  • Other comments:

This PR updates the KEP in the following ways:

Emphasize that this KEP is about basic swap enablement
The original KEP indicated that pod-level swap APIs are out of scope:

- Allocating swap on a per-workload basis with accounting (e.g. pod-level
specification of swap). If desired, this should be designed and implemented
as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence,
swap will be an overcommitted resource in the context of this KEP.

This KEP will be limited in scope to the first two scenarios. The third can be
addressed in a follow-up KEP. The enablement work that is in scope for this KEP
will be necessary to implement the third scenario.

However, the lack of APIs and the implicit nature of the current implementation sometimes brings suggestions to extend the API under this KEP.

This KEP focuses on a basic swap enablement. Follow-up KEPs regarding several topics (e.g. customization, zram/zswap suport, and more) will be introduced in the near future, in which we would be able to design and implement each extension in a focused way.

This PR updates the KEP to emphasize this approach.

Swap-aware evictions
This KEP also details how the eviction manager will be extended.
Implementation PR is available here: kubernetes/kubernetes#129578.

beta3 and GA criterias
The PR adds beta3 and GA criterias, alongside the intent to graduate to beta3 in version 1.33 and GA in 1.34.

Make sure PRR is ready

Updates
Since the last KEP updates, many improvements were made and many concerns were addressed. For example:

  • Memory-backed volumes
  • Added metrics
  • Kubelet Configuration examples
  • more

This PR updates the KEP to reflect these updates.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 6, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 6, 2024
@iholder101
Copy link
Contributor Author

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch 2 times, most recently from 16c4878 to b3f9708 Compare June 9, 2024 09:17
@deads2k
Copy link
Contributor

deads2k commented Jun 10, 2024

Please update https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/2400.yaml and update missing bits of the PRR questionaire.

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch from b3f9708 to 443db7f Compare June 18, 2024 10:47
@iholder101
Copy link
Contributor Author

Thanks @deads2k!

Please update master/keps/prod-readiness/sig-node/2400.yaml

I see you're the assigned approver for alpha/beta.
Is it OK to also assign you as the approver for GA?

and update missing bits of the PRR questionaire.

Done! PTAL :)

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch 2 times, most recently from f4af444 to 4124649 Compare June 18, 2024 10:55
@sftim
Copy link
Contributor

sftim commented Jun 24, 2024

/retitle [KEP-2400] Node swap ppdates, GA criterias and clarifications

@k8s-ci-robot k8s-ci-robot changed the title [KEP-2400] Updates, GA criterias and clarifications [KEP-2400] Node swap ppdates, GA criterias and clarifications Jun 24, 2024
@sftim
Copy link
Contributor

sftim commented Jun 24, 2024

D'oh

/retitle [KEP-2400] Node swap updates, GA criterias and clarifications

@k8s-ci-robot k8s-ci-robot changed the title [KEP-2400] Node swap ppdates, GA criterias and clarifications [KEP-2400] Node swap updates, GA criterias and clarifications Jun 24, 2024
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Here's a mix of feedback; I hope it is all useful.

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch 2 times, most recently from a85dd1c to 8e73c9e Compare July 10, 2024 14:04
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025
how to identify when the node is under pressure and how to rank pods for eviction.

The eviction manager will become swap aware by making the following changes to its memory pressure handling:
- **How to identify pressure**: The eviction manager will consider the total sum of all running pods' accessible swap as additional memory capacity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the memory pressure comes from pods that cannot use swap? If their memory usage continue to grow, we'd still want to evict pods, no? Does the current proposal take into account of that?

Copy link

@jabdoa2 jabdoa2 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would occur according to their usage compared request (see: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#pod-selection-for-kubelet-eviction). In general eviction technically does not require any pod to even use more than their request. In practice there should be as much memory as can be requested and pods should not be evicted before they exceed their request (it does happen though, i.e. if your system reserve is not set properly). With this change more memory becomes available to request (in the form of swap). Everything else stays the same and eviction will use the same general order and logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for deciding the pod eviction order, but my question is more about triggering the eviction in kubelet. Say, there's only one pod that can use swap, and the node still has plenty of swap space but is short on memory, would the usage be considered over the threshold and trigger eviction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say, there's only one pod that can use swap, and the node still has plenty of swap space but is short on memory, would the usage be considered over the threshold and trigger eviction?

In the scenario you're describing the desired outcome AFAICT is to start using swap before triggering eviction. Since swapping is a heavy operation from the kernel's perspective, it does so only when the memory is low enough and there's no other choice. If evictions were to be triggered only in case the node is short on memory, but not short on swap, the kernel would never have the opportunity to start swapping pages, since the eviction would always kill workloads before the kernel would become pressured.

The idea behind the suggested design is that swap (at least the "accessible swap") needs to be used before evictions are triggered in order for it to become usable.

That being said, if there are no pods that have access to swap, hence the "accessible swap" is zero, the eviction behavior stays the same as it is before this KEP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind the suggested design is that swap (at least the "accessible swap") needs to be used before evictions are triggered in order for it to become usable.

IIUC, a pod can be configured to have a large amount of accessible swap, but use only very little of it due to its actually very small memory footprint. This means that the node will have 'accessible swap" but no other pods are eligible to use. And the node can still be under memory pressure without triggering eviction.

Did I misunderstand this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haircommander for the clarification.

Firstly, since the scenario you've outlined is possible only in situations where pods would have a large amount of accessible but unused swap. Those pods can opt-out of swap even if are of Burstable QoS by setting memory limits that equal requests:
https://github.com/kubernetes/kubernetes/blob/ea50baedcd6f8e565bcd91ed78a554bbfac50e1c/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L422.

More generally, although I understand this hypothetical, I feel we always circle back to the customization front which we've already agreed to defer to a follow-up. This KEP was designed to be with minimal APIs and to serve a first step of swap enablement. With such minimal APIs and no customization ability it did not, and can not, aim to fit advanced or esoteric use cases.

As I've written many times before, I think it would most benefit the ecosystem to let users experiment with basic swap enablement (which there's a high demand for) and have a proper and long discussion about how to tackle the controversial topic of changing APIs in a follow-up. The ground is set for that with the concept of kubelet's "swap behaviors".

My opinion is we should document this as a limitation and circle back to it in a follow-up.
@yujuhong WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small addon: We have been running swap on all our clusters (15 clusters with ca 10-150 nodes each) with disabled memory eviction. We found no issues whatsoever with memory pressure on our nodes as long as we set proper memory requests for kubernetes components (i.e. kube-proxy or kubelet). That is probably generally good advise but not super trivial to get right as memory usage depends on the number of nodes and pods. Nevertheless, this feature is overall extremely stable and we did not encounter any issues in production over the last one and a half years.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong is it vital we determine the exact scheme in the KEP process? IMO: we have a pretty good outline and we can discuss implementation details in the implementation. thoughts??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iholder101 I'm definitely not advocating for changing APIs at this point. Allowing more customizability is out of the scope of this KEP. (And expecting users to set resource requests properly to avoid hitting scenarios like this also doesn't seem like the best course to me) :)

The only thing I'm trying to ensure here is that the memory eviction works reasonably given that swap behavior we defined today. The fact that we may not trigger memory eviction at all due to inaccessible swap space seems like a real issue. Could we make sure to consider all those cases?

@jabdoa2 disabling memory eviction is one way to solve this, but I don't have confidence on that working with a diverse set of workloads, and I believe this is not something SIG-Node wants to recommend at this point (please correct me if I'm wrong).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ultimately the worst a cluster admin would have to do given swap is lower the eviction threshold. If they want to prioritize eviciton happening that seems possible and reasonable. we can come up with a scheme to show what to set eviction thresholds to given certain swap/memory ratios and then have e2e tests to test some of those scenerios?

whether the feature is enabled or disabled.
Instead, the behaviour is automatic and implicit that requires minimum user
intervention (see [proposal below](#steps-to-calculate-swap-limit) for more details).
As mentioned above, in the very near future, follow-up KEPs would bring API extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the feedback from the SIG-Node meeting today, it's good to have early exploration to make sure whatever we GA is compatible with the future changes. If you already have some thoughts on this, maybe you could add a few sentences here. @iholder101

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reshaped this section, mentioning that currently we have only two on/off swap behaviors with no APIs or customizability (NoSwap and LimitedSwap), and that in the future we'll add customizability through adding more swap behaviors that might lead to API changes, perhaps at the pod level.

It was agreed many times before that this KEP would revolve around basic swap enablement. I've intentionally left this vague as API changes is a controversial, complex and serious topic, and I prefer to have proper discussions about it in a follow-up KEP. As agreed upon in yesterday's sig-node meeting, I'll start working on a sketch document to initiate these conversations which will serve as a first step for a follow-up KEP.

PTAL

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
Signed-off-by: Itamar Holder <iholder@redhat.com>
…avior

Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
@dchen1107
Copy link
Member

dchen1107 commented Feb 13, 2025

I plan to approve this KEP in principle to unblock progress on the fundamental node swap enablement. We recognize the importance of moving forward with this feature, and this KEP provides a solid base.

However, I have significant concerns regarding the proposed eviction management logic. Some of my concerns raised by @yujuhong and @haircommander at #4701 (comment) But there are more:

  • I/O Issue could slowing down swap, it can lead to a cascading failure, where multiple services become unresponsive or crash. If the swap is extremely slow, the system could reach a point where it runs out of memory before the kernel can effectively swap pages. This will result in an Out-Of-Memory (OOM) kill, which can lead to application crashes and data loss. The current design of the eviciton management doesn't consider this at all. I like @jabdoa2's proposal on introducing a soft trigger. But even with the soft memory trigger, if the system is relying on swap to free memory, but the swap is slow, the eviction process will be delayed. This will lead to the node becoming unstable. This requires urgent and thorough re-evaluation.
  • I have strong reservations about the current proposed eviction order which prioritize non-swap pods for eviction. This enables pods' ability to leverage swap, but not all workloads should enable the swap, such as latency-sensitive workloads, or some high-throughput services, etc.

As we discussed in the SIG Node meeting this week, we can punt certain implementation details to the implementation phase. However, we must reach a clear consensus on these critical decisions before GA in 1.33 release. Specifically:

  • The eviction trigger logic and thresholds.
  • The eviction order, particularly concerning the handling of "accessible swap" and potential I/O bottlenecks.
  • Clear documentation on I/O monitoring and mitigation.

I want to emphasize that we will reject GA unless we have reached a clear consensus on these points. We need to make sure that the test cases that are implemented, cover all of the edge cases.

cc/ @yujuhong @haircommander @mrunalp WDYT?

@haircommander
Copy link
Contributor

I agree! Let's move forward for enhancements phase to unblock, and plan on reaching consensus in implementation.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025
@yujuhong
Copy link
Contributor

I want to emphasize that we will reject GA unless we have reached a clear consensus on these points. We need to make sure that the test cases that are implemented, cover all of the edge cases.

+1 on getting clear concensus.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 14, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, iholder101, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025
@k8s-ci-robot k8s-ci-robot merged commit 12f3adf into kubernetes:master Feb 14, 2025
4 checks passed
@dchen1107
Copy link
Member

Thanks everyone working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Development

Successfully merging this pull request may close these issues.