KEP-2400: Node system swap support

Release Signoff Checklist
Summary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Just set --fail-swap-on=false
- Restrict swap usage at the cgroup level
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
(R) Graduation criteria is in place
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Kubernetes currently does not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, swap support was considered out of scope.

However, there are a number of use cases that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.

Motivation

There are two distinct types of user for swap, who may overlap:

node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
application developers, who have written applications that would benefit from using swap memory

There are hence a number of possible ways that one could envision swap use on a node.

Scenarios

Swap is enabled on a node's host system, but the kubelet does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
Swap is enabled at the node level. The kubelet can permit Kubernetes workloads scheduled on the node to use some quantity of swap, depending on the configuration.
Swap is set on a per-workload basis. The kubelet sets swap limits for each individual workload.

This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.

Goals

On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
Configuration is available for kubelet to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
Cluster administrators can enable and configure kubelet swap utilization on a per-node basis.
Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.

Non-Goals

Addressing non-Linux operating systems. Swap support will only be available for Linux.
Provisioning swap. Swap must already be available on the system.
Setting swappiness. This can already be set on a system-wide level outside of Kubernetes.
Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, swap will be an overcommitted resource in the context of this KEP.
Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.

Proposal

We propose that, when swap is provisioned and available on a node, cluster administrators can configure the kubelet such that:

It can start with swap on.
It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
It will have configuration options to configure swap utilization for the entire node.

This proposal enables scenarios 1 and 2 above, but not 3.

User Stories

Improved Node Stability

cgroupsv2 improved memory management algorithms, such as oomd, strongly recommend the use of swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.

https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
https://chrisdown.name/2018/01/02/in-defence-of-swap.html
https://media.ccc.de/v/ASG2018-175-oomd
https://github.com/facebookincubator/oomd/blob/master/docs/production_setup.md#swap

This user story is addressed by scenario 1 and 2, and could benefit from 3.

Long-running applications that swap out startup memory

Applications such as the Java and Node runtimes rely on swap for optimal performance kubernetes/kubernetes#53533 (comment)
Initialization logic of applications can be safely swapped out without affecting long-running application resource usage kubernetes/kubernetes#53533 (comment)

This user story is addressed by scenario 2, and could benefit from 3.

Memory Flexibility

This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).

Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs kubernetes/kubernetes#53533 (comment)
Lack of swap support would require provisioning 3x the amount of memory as required with swap kubernetes/kubernetes#53533 (comment)
On-premise deployment can’t horizontally scale available memory based on load kubernetes/kubernetes#53533 (comment)
Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost kubernetes/kubernetes#53533 (comment)

This user story is addressed by scenario 2, and could benefit from 3.

Local development and systems with fast storage

Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).

Single node, local Kubernetes deployment on laptop kubernetes/kubernetes#53533 (comment)
Linux has optimizations for swap on SSD, allowing for performance boosts kubernetes/kubernetes#53533 (comment)

This user story is addressed by scenarios 1 and 2, and could benefit from 3.

Low footprint systems

For example, edge devices with limited memory.

Edge compute systems/devices with small memory footprints (<2Gi) kubernetes/kubernetes#53533 (comment)
Clusters with nodes <4Gi memory kubernetes/kubernetes#53533 (comment)

This user story is addressed by scenario 2, and could benefit from 3.

Virtualization management overhead

This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.

Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.

With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.

Required for live migration of VMs kubernetes/kubernetes#53533 (comment)

This user story is addressed by scenario 2, and could benefit from 3.

Notes/Constraints/Caveats (Optional)

In updating the CRI, we must ensure that container runtime downstreams are able to support the new configurations.

We considered adding parameters for both per-workload memory-swap and swappiness. These are documented as part of the Open Containers runtime specification for Linux memory configuration. Since memory-swap is a per-workload parameter, and swappiness is optional and can be set globally, we are choosing to only expose memory-swap which will adjust swap available to workloads.

Since we are not currently setting memory-swap in the CRI, the current default behaviour when --fail-swap-on=false is set is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting memory-swap equal to limit.

Risks and Mitigations

Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.

This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.

Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.

Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.

Design Details

We summarize the implementation plan as following:

Add a feature gate NodeSwap to enable swap support.
Leave the default value of kubelet flag --fail-on-swap to true, to avoid changing default behaviour.
Introduce a new kubelet config parameter, MemorySwap, which configures how much swap Kubernetes workloads can use on the node.
Introduce a new CRI parameter, memory_swap_limit_in_bytes.
Ensure container runtimes are updated so they can make use of the new CRI configuration.
Based on the behaviour set in the kubelet config, the kubelet will instruct the CRI on the amount of swap to allocate to each container. The container runtime will then write the swap settings to the container level cgroup.

Enabling swap as an end user

Swap can be enabled as follows:

Provision swap on the target worker nodes,
Enable the NodeMemorySwap feature flag on the kubelet,
Set --fail-on-swap flag to false, and
(Optional) Allow Kubernetes workloads to use swap by setting MemorySwap.SwapBehavior=UnlimitedSwap in the kubelet config.

API Changes

KubeConfig addition

We will add an optional MemorySwap value to the KubeletConfig struct in pkg/kubelet/apis/config/types.go as follows:

// KubeletConfiguration contains the configuration for the Kubelet
type KubeletConfiguration struct {
	metav1.TypeMeta
...
	// Configure swap memory available to container workloads.
	// +featureGate=NodeSwap
	// +optional
	MemorySwap MemorySwapConfiguration
}

type MemorySwapConfiguration struct {
	// Configure swap memory available to container workloads. May be one of
	// "", "LimitedSwap": workload combined memory and swap usage cannot exceed pod memory limit
	// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
	SwapBehavior string
}

We want to expose common swap configurations based on the Docker and open container specification for the --memory-swap flag. Thus, the MemorySwapConfiguration.SwapBehavior setting will have the following effects:

If SwapBehavior is not set or set to "LimitedSwap", containers do not have access to swap beyond their memory limit. This value prevents a container from using swap in excess of their memory limit, even if it is enabled on a system.
- With cgroups v1, it is possible for a container to use some swap if its combined memory and swap usage do not exceed the memory.memsw.limit_in_bytes limit.
- With cgroups v2, swap is configured independently from memory. Thus, the container runtimes can set memory.swap.max to 0 in this case, and no swap usage will be permitted.
If SwapBehavior is set to "UnlimitedSwap", the container is allowed to use unlimited swap, up to the maximum amount available on the host system.

CRI Changes

The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes. We will introduce a parameter memory_swap_limit_in_bytes to the CRI API (found in k8s.io/cri-api/pkg/apis/runtime/v1/api.proto):

// LinuxContainerResources specifies Linux specific configuration for
// resources.
message LinuxContainerResources {
...
    // Memory + swap limit in bytes. Default: 0 (not specified).
    int64 memory_swap_limit_in_bytes = 9;
...
}

Test Plan

For alpha:

Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
- Container runtimes must be bumped in CI to use the new CRI.
Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
- Focus should be on supported user stories as listed above.

For beta:

Add e2e tests that exercise all available swap configurations via the CRI.
Add e2e tests that verify pod-level control of swap utilization.
Add e2e tests that verify swap performance with pods using a tmpfs.
Verify new system-reserved settings for swap memory.

Graduation Criteria

Alpha

Kubelet can be started with swap enabled and will support two configurations for Kubernetes workloads: LimitedSwap and UnlimitedSwap.
Kubelet can configure CRI to allocate swap to Kubernetes workloads. By default, workloads will not be allocated any swap.
e2e test jobs are configured for Linux systems with swap enabled.

Beta

Add support for controlling swap consumption at the pod level via cgroups.
- Handle usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at container cgroup boundary).
Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
Consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled.
- Better understand relationship of swap with memory QoS in cgroup v2 (particularly memory.high usage).
Collect feedback from test user cases.
Improve coverage for appropriate scenarios in testgrid.

GA

(Tentative.)

Test a wide variety of scenarios that may be affected by swap support.
Remove feature flag.

Upgrade / Downgrade Strategy

No changes are required on upgrade to maintain previous behaviour.

It is possible to downgrade a kubelet on a node that was using swap, but this would require disabling the use of swap and setting swapoff on the node.

Version Skew Strategy

Feature flag will apply to kubelet only, so version skew strategy is N/A.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: NodeSwap
- Components depending on the feature gate: API Server, Kubelet
Other
- Describe the mechanism: --fail-swap-on=false flag for kubelet must also be set at kubelet start
- Will enabling / disabling the feature require downtime of the control plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be restarted. Hence, there would be brief control component downtime on a given node.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled). Yes. See above; disabling would require brief node downtime.

Does enabling the feature change any default behavior?

No. If the feature flag is enabled, the user must still set --fail-swap-on=false to adjust the default behaviour.

A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

No. The feature flag can be disabled while the --fail-swap-on=false flag is set, but this would result in undefined behaviour.

To turn this off, the kubelet would need to be restarted. If a cluster admin wants to disable swap on the node without repartitioning the node, they could stop the kubelet, set swapoff on the node, and restart the kubelet with --fail-swap-on=true. The setting of the feature flag will be ignored in this case.

What happens if we reenable the feature if it was previously rolled back?

N/A

Are there any tests for feature enablement/disablement?

N/A. This should be tested separately for scenarios with the flag enabled and disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

If a new node with swap memory fails to come online, it will not impact any running components.

It is possible that if a cluster administrator adds swap memory to an already running node, and then performs an in-place upgrade, the new kubelet could fail to start unless the configuration was modified to tolerate swap. However, we would expect that if a cluster admin is adding swap to the node, they will also update the kubelet's configuration to not fail with swap present.

Generally, it is considered best practice to add a swap memory partition at node image/boot time and not provision it dynamically after a kubelet is already running and reporting Ready on a node.

What specific metrics should inform a rollback?

Workload churn or performance degradations on nodes. The metrics will be application/use-case specific, but we can provide some suggestions, based on the stability metrics identified earlier.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must be restarted with or without swap support.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

KubeletConfiguration has set failOnSwap: false.

The prometheus node_exporter will also export stats on swap memory utilization.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

TBD. We will determine a set of metrics as a requirement for beta graduation. We will need more production data; there is not a single metric or set of metrics that can be used to generally quantify node performance.

This section to be updated before the feature can be marked as graduated, and to be worked on during 1.23 development.

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

The KubeletConfig API object may slightly increase in size due to new config fields.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

It is possible for this feature to affect performance of some worker node-level SLIs/SLOs. We will need to monitor for differences, particularly during beta testing, when evaluating this feature for beta and graduation.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change. Feature is specific to individual nodes.

What are other known failure modes?

Individual nodes with swap memory enabled may experience performance degradations under load. This could potentially cause a cascading failure on nodes without swap: if nodes with swap fail Ready checks, workloads may be rescheduled en masse.

Thus, cluster administrators should be careful while enabling swap. To minimize disruption, you may want to taint nodes with swap available to protect against this problem. Taints will ensure that workloads which tolerate swap will not spill onto nodes without swap under load.

What steps should be taken if SLOs are not being met to determine the problem?

It is suggested that if nodes with swap memory enabled cause performance or stability degradations, those nodes are cordoned, drained, and replaced with nodes that do not use swap memory.

Implementation History

2015-04-24: Discussed in #7294.
2017-10-06: Discussed in #53533.
2021-01-05: Initial design discussion document for swap support and use cases.
2021-04-05: Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).

Drawbacks

When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.

Currently, there exists an unsupported workaround, which is setting the kubelet flag --fail-swap-on to false.

Alternatives

Just set `--fail-swap-on=false`

This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim.

Restrict swap usage at the cgroup level

Setting a swap limit at the cgroup level would allow us to restrict the usage of swap on a pod-level, rather than container-level basis.

For alpha, we are opting for the container-level basis to simplify the implementation (as the container runtimes already support configuration of swap with the memory-swap-limit parameter). This will also provide the necessary plumbing for container-level accounting of swap, if that is proposed in the future.

In beta, we may want to revisit this.

See the Pod Resource Management design proposal for more background on the cgroup limits the kubelet currently sets based on each QoS class.

Infrastructure Needed (Optional)

We may need Linux VM images built with swap partitions for e2e testing in CI.

Files

README.md

Latest commit

History

README.md

File metadata and controls

KEP-2400: Node system swap support

Release Signoff Checklist

Summary

Motivation

Scenarios

Goals

Non-Goals

Proposal

User Stories

Improved Node Stability

Long-running applications that swap out startup memory

Memory Flexibility

Local development and systems with fast storage

Low footprint systems

Virtualization management overhead

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Enabling swap as an end user

API Changes

KubeConfig addition

CRI Changes

Test Plan

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Just set --fail-swap-on=false

Restrict swap usage at the cgroup level

Infrastructure Needed (Optional)

Just set `--fail-swap-on=false`