Skip to content

Commit f4af444

Browse files
committed
Complete PRR questionaire
Signed-off-by: Itamar Holder <iholder@redhat.com>
1 parent 017778c commit f4af444

File tree

1 file changed

+70
-54
lines changed

1 file changed

+70
-54
lines changed

keps/sig-node/2400-node-swap/README.md

+70-54
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@
6262
- [Drawbacks](#drawbacks)
6363
- [Alternatives](#alternatives)
6464
- [Just set <code>--fail-swap-on=false</code>](#just-set---fail-swap-onfalse)
65-
- [Restrict swap usage at the cgroup level](#restrict-swap-usage-at-the-cgroup-level)
6665
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
6766
<!-- /toc -->
6867

@@ -927,6 +926,10 @@ automations, so be extremely careful here.
927926

928927
No. If the feature flag is enabled, the user must still set
929928
`--fail-swap-on=false` to adjust the default behaviour.
929+
In addition, since the default "swap behavior" is "NoSwap",
930+
by default containers would not be able to access swap. Instead,
931+
the administrator would need to set a non-default behavior in order
932+
for swap to be accessible.
930933

931934
A node must have swap provisioned and available for this feature to work. If
932935
there is no swap available, but the feature flag is set to true, there will
@@ -959,7 +962,8 @@ for workloads.
959962

960963
###### What happens if we reenable the feature if it was previously rolled back?
961964

962-
N/A
965+
As described above, swap can be turned on and off, although kubelet would need to be
966+
restarted.
963967

964968
###### Are there any tests for feature enablement/disablement?
965969

@@ -970,8 +974,18 @@ with and without the feature, are necessary. At the very least, think about
970974
conversion tests if API types are being modified.
971975
-->
972976

973-
N/A. This should be tested separately for scenarios with the flag enabled and
974-
disabled.
977+
There are extensive tests to ensure that the swap feature as expected.
978+
979+
Unit tests are in place to test that this feature operates as expected with
980+
cgroup v1/v2, the feature gate being on/off, and different swap behaviors defined.
981+
982+
In addition, node e2e tests are added and run as part of the node-conformance
983+
suite. These tests ensure that the underlying cgroup knobs are being configured
984+
as expected.
985+
986+
Furthermore, "swap-conformance" periodic lanes have been introduced for the purpose
987+
testing swap on a stressed environment. These tests ensure that swap kicks in when
988+
expected, tested while stressing both on the node-level and container-level.
975989

976990
### Rollout, Upgrade and Rollback Planning
977991

@@ -1037,9 +1051,8 @@ This section must be completed when targeting beta to a release.
10371051

10381052
###### How can someone using this feature know that it is working for their instance?
10391053

1040-
See #swap-metrics
1041-
1042-
1. Kubelet stats API will be extended to show swap usage details.
1054+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1055+
which provide how and if swap is utilized in the node, pod and container level.
10431056

10441057
###### How can an operator determine if the feature is in use by workloads?
10451058

@@ -1049,6 +1062,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
10491062
logs or events for this purpose.
10501063
-->
10511064

1065+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1066+
which provide how and if swap is utilized in the node, pod and container level.
1067+
10521068
KubeletConfiguration has set `failOnSwap: false`.
10531069

10541070
The prometheus `node_exporter` will also export stats on swap memory
@@ -1060,19 +1076,22 @@ utilization.
10601076
Pick one more of these and delete the rest.
10611077
-->
10621078

1063-
TBD. We will determine a set of metrics as a requirement for beta graduation.
1064-
We will need more production data; there is not a single metric or set of
1065-
metrics that can be used to generally quantify node performance.
1066-
1067-
This section to be updated before the feature can be marked as graduated, and
1068-
to be worked on during 1.23 development.
1069-
1070-
We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.
1071-
1072-
- [ ] Metrics
1073-
- Metric name:
1074-
- [Optional] Aggregation method:
1075-
- Components exposing the metric:
1079+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1080+
which provide how and if swap is utilized in the node, pod and container level.
1081+
1082+
- [X] Metrics
1083+
- Metric names:
1084+
- `container_swap_usage_bytes`
1085+
- `pod_swap_usage_bytes`
1086+
- `node_swap_usage_bytes`
1087+
Components exposing the metric: `/metrics/resource` endpoint
1088+
- Metric names:
1089+
- `node.swap.swapUsageBytes`
1090+
- `node.swap.swapAvailableBytes`
1091+
- `node.systemContainers.swap.swapUsageBytes`
1092+
- `pods[i].swap.swapUsageBytes`
1093+
- `pods[i].containers[i].swap.swapUsageBytes`
1094+
Components exposing the metric: `/stats/summary` endpoint
10761095
- [ ] Other (treat as last resort)
10771096
- Details:
10781097

@@ -1088,7 +1107,14 @@ high level (needs more precise definitions) those may be things like:
10881107
- 99,9% of /health requests per day finish with 200 code
10891108
-->
10901109

1091-
N/A
1110+
Swap is being managed by the kernel, depends on many factors and configurations
1111+
that are outside of kubelet's reach like the nature of the workloads running on the node,
1112+
swap capacity, memory capacity and other distro-specific configurations. However, generally:
1113+
1114+
- Nodes with swap enabled -> `node.swap.swapAvailableBytes` should be non-zero.
1115+
- Nodes with memory pressure -> `node.swap.swapUsageBytes` should be non-zero.
1116+
- Containers that reach their memory limit threshold -> `pods[i].containers[i].swap.swapUsageBytes` should be non-zero.
1117+
- Pods with containers that reach their memory limit threshold -> `pods[i].swap.swapUsageBytes` should be non-zero.
10921118

10931119
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10941120

@@ -1203,9 +1229,11 @@ Think about adding additional work or introducing new steps in between
12031229
-->
12041230

12051231
Yes, enabling swap can affect performance of other critical daemons on the system.
1206-
Any scenario where swap memory gets utilized is a result of system running out of physical RAM.
1232+
Any scenario where swap memory gets utilized is a result of system running out of physical RAM,
1233+
or a container reaching its memory limit threshold.
12071234
Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice
1208-
along with reserving adequate enough system reserved memory.
1235+
along with reserving adequate enough system reserved memory, giving io latency precedence to the system.slice, and more.
1236+
See #best practices for more info.
12091237

12101238
The SLI that could potentially be impacted is [pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md).
12111239
If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted.
@@ -1283,6 +1311,8 @@ nodes that do not use swap memory.
12831311
- **2023-04-17:** KEP update for beta1 [#3957](https://github.com/kubernetes/enhancements/pull/3957).
12841312
- **2023-08-15:** Beta1 released in kubernetes 1.28
12851313
- **2024-01-12:** Updates to Beta2 KEP.
1314+
- **2024-01-08:** Beta2 released in kubernetes 1.30.
1315+
- **2024-06-18:** Updates to KEP, GA requirements and intention to release in version 1.32.
12861316

12871317
## Drawbacks
12881318

@@ -1294,41 +1324,24 @@ When swap is enabled, particularly for workloads, the kubelet’s resource
12941324
accounting may become much less accurate. This may make cluster administration
12951325
more difficult and less predictable.
12961326

1297-
Currently, there exists an unsupported workaround, which is setting the kubelet
1298-
flag `--fail-swap-on` to false.
1327+
In general, swap is less predictable and might cause performance degradation.
1328+
It also might be hard in certain scenarios to understand why certain workloads
1329+
are the chosen candidates for swapping, which could occur for reasons external
1330+
to the workload.
1331+
1332+
In addition, containers with memory limits would be killed less frequently
1333+
since with swap enabled the kernel can usually reclaim a lot more memory.
1334+
While this can help to avoid crashes, it could also "hide a problem" of a container
1335+
reaching its memory limits.
12991336

13001337
## Alternatives
13011338

13021339
### Just set `--fail-swap-on=false`
13031340

1304-
This is insufficient for most use cases because there is inconsistent control
1305-
over how swap will be used by various container runtimes. Dockershim currently
1306-
sets swap available for workloads to 0. The CRI does not restrict it at all.
1307-
This inconsistency makes it difficult or impossible to use swap in production,
1308-
particularly if a user wants to restrict workloads from using swap when using
1309-
the CRI rather than dockershim.
1310-
1311-
This is also a breaking change.
1312-
Users have used --fail-swap-on=false to allow for kubernetes to run
1313-
on a swap enabled node.
1314-
1315-
### Restrict swap usage at the cgroup level
1316-
1317-
Setting a swap limit at the cgroup level would allow us to restrict the usage
1318-
of swap on a pod-level, rather than container-level basis.
1319-
1320-
For alpha, we are opting for the container-level basis to simplify the
1321-
implementation (as the container runtimes already support configuration of swap
1322-
with the `memory-swap-limit` parameter). This will also provide the necessary
1323-
plumbing for container-level accounting of swap, if that is proposed in the
1324-
future.
1325-
1326-
In beta, we may want to revisit this.
1327-
1328-
See the [Pod Resource Management design proposal] for more background on the
1329-
cgroup limits the kubelet currently sets based on each QoS class.
1330-
1331-
[Pod Resource Management design proposal]: https://github.com/kubernetes/design-proposals-archive/blob/master/node/pod-resource-management.md#pod-level-cgroups
1341+
When `--fail-swap-on=false` is provided to Kubelet but swap is not configured
1342+
otherwise it is guaranteed that, by default, no Kubernetes workloads would
1343+
be able to utilize swap. However, everything outside of kubelet's reach
1344+
(e.g. system daemons, kubelet, etc) would be able to use swap.
13321345

13331346
## Infrastructure Needed (Optional)
13341347

@@ -1338,4 +1351,7 @@ new subproject, repos requested, or GitHub details. Listing these here allows a
13381351
SIG to get the process for these resources started right away.
13391352
-->
13401353

1341-
We may need Linux VM images built with swap partitions for e2e testing in CI.
1354+
Added the "swap-conformance" lane for extensive swap testing under node pressure: [kubelet-swap-conformance-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-fedora-serial),
1355+
kubelet-swap-conformance-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-ubuntu-serial).
1356+
1357+
See #e2e tests above for more information

0 commit comments

Comments
 (0)