- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Kubernetes currently does not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, swap support was considered out of scope.
However, there are a number of use cases that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.
There are two distinct types of user for swap, who may overlap:
- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
- application developers, who have written applications that would benefit from using swap memory
There are hence a number of possible ways that one could envision swap use on a node.
- Swap is enabled on a node's host system, but the kubelet does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
- Swap is enabled at the node level. The kubelet can permit Kubernetes workloads scheduled on the node to use some quantity of swap, depending on the configuration.
- Swap is set on a per-workload basis. The kubelet sets swap limits for each individual workload.
This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.
- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
- Configuration is available for kubelet to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
- Cluster administrators can enable and configure kubelet swap utilization on a per-node basis.
- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
- Addressing non-Linux operating systems. Swap support will only be available for Linux.
- Provisioning swap. Swap must already be available on the system.
- Setting swappiness. This can already be set on a system-wide level outside of Kubernetes.
- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, swap will be an overcommitted resource in the context of this KEP.
- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
We propose that, when swap is provisioned and available on a node, cluster administrators can configure the kubelet such that:
- It can start with swap on.
- It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
- It will have configuration options to configure swap utilization for the entire node.
This proposal enables scenarios 1 and 2 above, but not 3.
cgroupsv2 improved memory management algorithms, such as oomd, strongly recommend the use of swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.
- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
- https://chrisdown.name/2018/01/02/in-defence-of-swap.html
- https://media.ccc.de/v/ASG2018-175-oomd
- https://github.com/facebookincubator/oomd/blob/master/docs/production_setup.md#swap
This user story is addressed by scenario 1 and 2, and could benefit from 3.
- Applications such as the Java and Node runtimes rely on swap for optimal performance kubernetes/kubernetes#53533 (comment)
- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).
- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs kubernetes/kubernetes#53533 (comment)
- Lack of swap support would require provisioning 3x the amount of memory as required with swap kubernetes/kubernetes#53533 (comment)
- On-premise deployment can’t horizontally scale available memory based on load kubernetes/kubernetes#53533 (comment)
- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).
- Single node, local Kubernetes deployment on laptop kubernetes/kubernetes#53533 (comment)
- Linux has optimizations for swap on SSD, allowing for performance boosts kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenarios 1 and 2, and could benefit from 3.
For example, edge devices with limited memory.
- Edge compute systems/devices with small memory footprints (<2Gi) kubernetes/kubernetes#53533 (comment)
- Clusters with nodes <4Gi memory kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.
Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.
With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.
- Required for live migration of VMs kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
In updating the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
We considered adding parameters for both per-workload memory-swap
and
swappiness
. These are documented as part of the Open Containers runtime
specification for Linux memory configuration. Since memory-swap
is a
per-workload parameter, and swappiness
is optional and can be set globally,
we are choosing to only expose memory-swap
which will adjust swap available
to workloads.
Since we are not currently setting memory-swap
in the CRI, the current
default behaviour when --fail-swap-on=false
is set is to allocate the same
amount of swap for a workload as memory requested. We will update the default
to not permit the use of swap by setting memory-swap
equal to limit
.
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
We summarize the implementation plan as following:
- Add a feature gate
NodeSwap
to enable swap support. - Leave the default value of kubelet flag
--fail-on-swap
totrue
, to avoid changing default behaviour. - Introduce a new kubelet config parameter,
MemorySwap
, which configures how much swap Kubernetes workloads can use on the node. - Introduce a new CRI parameter,
memory_swap_limit_in_bytes
. - Ensure container runtimes are updated so they can make use of the new CRI configuration.
- Based on the behaviour set in the kubelet config, the kubelet will instruct the CRI on the amount of swap to allocate to each container. The container runtime will then write the swap settings to the container level cgroup.
Swap can be enabled as follows:
- Provision swap on the target worker nodes,
- Enable the
NodeMemorySwap
feature flag on the kubelet, - Set
--fail-on-swap
flag tofalse
, and - (Optional) Allow Kubernetes workloads to use swap by setting
MemorySwap.SwapBehavior=UnlimitedSwap
in the kubelet config.
We will add an optional MemorySwap
value to the KubeletConfig
struct
in pkg/kubelet/apis/config/types.go as follows:
// KubeletConfiguration contains the configuration for the Kubelet
type KubeletConfiguration struct {
metav1.TypeMeta
...
// Configure swap memory available to container workloads.
// +featureGate=NodeSwap
// +optional
MemorySwap MemorySwapConfiguration
}
type MemorySwapConfiguration struct {
// Configure swap memory available to container workloads. May be one of
// "", "LimitedSwap": workload combined memory and swap usage cannot exceed pod memory limit
// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
SwapBehavior string
}
We want to expose common swap configurations based on the Docker and open
container specification for the --memory-swap
flag. Thus, the
MemorySwapConfiguration.SwapBehavior
setting will have the following effects:
- If
SwapBehavior
is not set or set to"LimitedSwap"
, containers do not have access to swap beyond their memory limit. This value prevents a container from using swap in excess of their memory limit, even if it is enabled on a system.- With cgroups v1, it is possible for a container to use some swap if its
combined memory and swap usage do not exceed the
memory.memsw.limit_in_bytes
limit. - With cgroups v2, swap is configured independently from memory. Thus, the
container runtimes can set
memory.swap.max
to 0 in this case, and no swap usage will be permitted.
- With cgroups v1, it is possible for a container to use some swap if its
combined memory and swap usage do not exceed the
- If
SwapBehavior
is set to"UnlimitedSwap"
, the container is allowed to use unlimited swap, up to the maximum amount available on the host system.
The CRI requires a corresponding change in order to allow the kubelet to set
swap usage in container runtimes. We will introduce a parameter
memory_swap_limit_in_bytes
to the CRI API (found in
k8s.io/cri-api/pkg/apis/runtime/v1/api.proto):
// LinuxContainerResources specifies Linux specific configuration for
// resources.
message LinuxContainerResources {
...
// Memory + swap limit in bytes. Default: 0 (not specified).
int64 memory_swap_limit_in_bytes = 9;
...
}
For alpha:
- Swap scenarios are enabled in test-infra for at least two Linux
distributions. e2e suites will be run against them.
- Container runtimes must be bumped in CI to use the new CRI.
- Data should be gathered from a number of use cases to guide beta graduation
and further development efforts.
- Focus should be on supported user stories as listed above.
For beta:
- Add e2e tests that exercise all available swap configurations via the CRI.
- Add e2e tests that verify pod-level control of swap utilization.
- Add e2e tests that verify swap performance with pods using a tmpfs.
- Verify new system-reserved settings for swap memory.
- Kubelet can be started with swap enabled and will support two configurations
for Kubernetes workloads:
LimitedSwap
andUnlimitedSwap
. - Kubelet can configure CRI to allocate swap to Kubernetes workloads. By default, workloads will not be allocated any swap.
- e2e test jobs are configured for Linux systems with swap enabled.
- Add support for controlling swap consumption at the pod level via cgroups.
- Handle usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at container cgroup boundary).
- Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
- Consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
- Determine a set of metrics for node QoS in order to evaluate the performance
of nodes with and without swap enabled.
- Better understand relationship of swap with memory QoS in cgroup v2
(particularly
memory.high
usage).
- Better understand relationship of swap with memory QoS in cgroup v2
(particularly
- Collect feedback from test user cases.
- Improve coverage for appropriate scenarios in testgrid.
(Tentative.)
- Test a wide variety of scenarios that may be affected by swap support.
- Remove feature flag.
No changes are required on upgrade to maintain previous behaviour.
It is possible to downgrade a kubelet on a node that was using swap, but this
would require disabling the use of swap and setting swapoff
on the node.
Feature flag will apply to kubelet only, so version skew strategy is N/A.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: NodeSwap
- Components depending on the feature gate: API Server, Kubelet
- Other
- Describe the mechanism:
--fail-swap-on=false
flag for kubelet must also be set at kubelet start - Will enabling / disabling the feature require downtime of the control plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be restarted. Hence, there would be brief control component downtime on a given node.
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Config
feature is enabled). Yes. See above; disabling would require brief node downtime.
- Describe the mechanism:
No. If the feature flag is enabled, the user must still set
--fail-swap-on=false
to adjust the default behaviour.
A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.
No. The feature flag can be disabled while the --fail-swap-on=false
flag is
set, but this would result in undefined behaviour.
To turn this off, the kubelet would need to be restarted. If a cluster admin
wants to disable swap on the node without repartitioning the node, they could
stop the kubelet, set swapoff
on the node, and restart the kubelet with
--fail-swap-on=true
. The setting of the feature flag will be ignored in this
case.
N/A
N/A. This should be tested separately for scenarios with the flag enabled and disabled.
If a new node with swap memory fails to come online, it will not impact any running components.
It is possible that if a cluster administrator adds swap memory to an already running node, and then performs an in-place upgrade, the new kubelet could fail to start unless the configuration was modified to tolerate swap. However, we would expect that if a cluster admin is adding swap to the node, they will also update the kubelet's configuration to not fail with swap present.
Generally, it is considered best practice to add a swap memory partition at node image/boot time and not provision it dynamically after a kubelet is already running and reporting Ready on a node.
Workload churn or performance degradations on nodes. The metrics will be application/use-case specific, but we can provide some suggestions, based on the stability metrics identified earlier.
N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must be restarted with or without swap support.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
KubeletConfiguration has set failOnSwap: false
.
The prometheus node_exporter
will also export stats on swap memory
utilization.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
TBD. We will determine a set of metrics as a requirement for beta graduation. We will need more production data; there is not a single metric or set of metrics that can be used to generally quantify node performance.
This section to be updated before the feature can be marked as graduated, and to be worked on during 1.23 development.
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
N/A
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A
No.
No.
No.
No.
The KubeletConfig API object may slightly increase in size due to new config fields.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
It is possible for this feature to affect performance of some worker node-level SLIs/SLOs. We will need to monitor for differences, particularly during beta testing, when evaluating this feature for beta and graduation.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.
No change. Feature is specific to individual nodes.
Individual nodes with swap memory enabled may experience performance degradations under load. This could potentially cause a cascading failure on nodes without swap: if nodes with swap fail Ready checks, workloads may be rescheduled en masse.
Thus, cluster administrators should be careful while enabling swap. To minimize disruption, you may want to taint nodes with swap available to protect against this problem. Taints will ensure that workloads which tolerate swap will not spill onto nodes without swap under load.
It is suggested that if nodes with swap memory enabled cause performance or stability degradations, those nodes are cordoned, drained, and replaced with nodes that do not use swap memory.
- 2015-04-24: Discussed in #7294.
- 2017-10-06: Discussed in #53533.
- 2021-01-05: Initial design discussion document for swap support and use cases.
- 2021-04-05: Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).
When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.
Currently, there exists an unsupported workaround, which is setting the kubelet
flag --fail-swap-on
to false.
This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim.
Setting a swap limit at the cgroup level would allow us to restrict the usage of swap on a pod-level, rather than container-level basis.
For alpha, we are opting for the container-level basis to simplify the
implementation (as the container runtimes already support configuration of swap
with the memory-swap-limit
parameter). This will also provide the necessary
plumbing for container-level accounting of swap, if that is proposed in the
future.
In beta, we may want to revisit this.
See the Pod Resource Management design proposal for more background on the cgroup limits the kubelet currently sets based on each QoS class.
We may need Linux VM images built with swap partitions for e2e testing in CI.