You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -927,6 +926,10 @@ automations, so be extremely careful here.
927
926
928
927
No. If the feature flag is enabled, the user must still set
929
928
`--fail-swap-on=false` to adjust the default behaviour.
929
+
In addition, since the default "swap behavior" is "NoSwap",
930
+
by default containers would not be able to access swap. Instead,
931
+
the administrator would need to set a non-default behavior in order
932
+
for swap to be accessible.
930
933
931
934
A node must have swap provisioned and available for this feature to work. If
932
935
there is no swap available, but the feature flag is set to true, there will
@@ -959,7 +962,8 @@ for workloads.
959
962
960
963
###### What happens if we reenable the feature if it was previously rolled back?
961
964
962
-
N/A
965
+
As described above, swap can be turned on and off, although kubelet would need to be
966
+
restarted.
963
967
964
968
###### Are there any tests for feature enablement/disablement?
965
969
@@ -970,8 +974,18 @@ with and without the feature, are necessary. At the very least, think about
970
974
conversion tests if API types are being modified.
971
975
-->
972
976
973
-
N/A. This should be tested separately for scenarios with the flag enabled and
974
-
disabled.
977
+
There are extensive tests to ensure that the swap feature as expected.
978
+
979
+
Unit tests are in place to test that this feature operates as expected with
980
+
cgroup v1/v2, the feature gate being on/off, and different swap behaviors defined.
981
+
982
+
In addition, node e2e tests are added and run as part of the node-conformance
983
+
suite. These tests ensure that the underlying cgroup knobs are being configured
984
+
as expected.
985
+
986
+
Furthermore, "swap-conformance" periodic lanes have been introduced for the purpose
987
+
testing swap on a stressed environment. These tests ensure that swap kicks in when
988
+
expected, tested while stressing both on the node-level and container-level.
975
989
976
990
### Rollout, Upgrade and Rollback Planning
977
991
@@ -1037,9 +1051,8 @@ This section must be completed when targeting beta to a release.
1037
1051
1038
1052
###### How can someone using this feature know that it is working for their instance?
1039
1053
1040
-
See #swap-metrics
1041
-
1042
-
1. Kubelet stats API will be extended to show swap usage details.
1054
+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1055
+
which provide how and if swap is utilized in the node, pod and container level.
1043
1056
1044
1057
###### How can an operator determine if the feature is in use by workloads?
1045
1058
@@ -1049,6 +1062,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
1049
1062
logs or events for this purpose.
1050
1063
-->
1051
1064
1065
+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1066
+
which provide how and if swap is utilized in the node, pod and container level.
1067
+
1052
1068
KubeletConfiguration has set `failOnSwap: false`.
1053
1069
1054
1070
The prometheus `node_exporter` will also export stats on swap memory
@@ -1060,19 +1076,22 @@ utilization.
1060
1076
Pick one more of these and delete the rest.
1061
1077
-->
1062
1078
1063
-
TBD. We will determine a set of metrics as a requirement for beta graduation.
1064
-
We will need more production data; there is not a single metric or set of
1065
-
metrics that can be used to generally quantify node performance.
1066
-
1067
-
This section to be updated before the feature can be marked as graduated, and
1068
-
to be worked on during 1.23 development.
1069
-
1070
-
We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.
1071
-
1072
-
-[ ] Metrics
1073
-
- Metric name:
1074
-
-[Optional] Aggregation method:
1075
-
- Components exposing the metric:
1079
+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1080
+
which provide how and if swap is utilized in the node, pod and container level.
1081
+
1082
+
-[X] Metrics
1083
+
- Metric names:
1084
+
-`container_swap_usage_bytes`
1085
+
-`pod_swap_usage_bytes`
1086
+
-`node_swap_usage_bytes`
1087
+
Components exposing the metric: `/metrics/resource` endpoint
1088
+
- Metric names:
1089
+
-`node.swap.swapUsageBytes`
1090
+
-`node.swap.swapAvailableBytes`
1091
+
-`node.systemContainers.swap.swapUsageBytes`
1092
+
-`pods[i].swap.swapUsageBytes`
1093
+
-`pods[i].containers[i].swap.swapUsageBytes`
1094
+
Components exposing the metric: `/stats/summary` endpoint
1076
1095
-[ ] Other (treat as last resort)
1077
1096
- Details:
1078
1097
@@ -1088,7 +1107,14 @@ high level (needs more precise definitions) those may be things like:
1088
1107
- 99,9% of /health requests per day finish with 200 code
1089
1108
-->
1090
1109
1091
-
N/A
1110
+
Swap is being managed by the kernel, depends on many factors and configurations
1111
+
that are outside of kubelet's reach like the nature of the workloads running on the node,
1112
+
swap capacity, memory capacity and other distro-specific configurations. However, generally:
1113
+
1114
+
- Nodes with swap enabled -> `node.swap.swapAvailableBytes` should be non-zero.
1115
+
- Nodes with memory pressure -> `node.swap.swapUsageBytes` should be non-zero.
1116
+
- Containers that reach their memory limit threshold -> `pods[i].containers[i].swap.swapUsageBytes` should be non-zero.
1117
+
- Pods with containers that reach their memory limit threshold -> `pods[i].swap.swapUsageBytes` should be non-zero.
1092
1118
1093
1119
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1094
1120
@@ -1203,9 +1229,11 @@ Think about adding additional work or introducing new steps in between
1203
1229
-->
1204
1230
1205
1231
Yes, enabling swap can affect performance of other critical daemons on the system.
1206
-
Any scenario where swap memory gets utilized is a result of system running out of physical RAM.
1232
+
Any scenario where swap memory gets utilized is a result of system running out of physical RAM,
1233
+
or a container reaching its memory limit threshold.
1207
1234
Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice
1208
-
along with reserving adequate enough system reserved memory.
1235
+
along with reserving adequate enough system reserved memory, giving io latency precedence to the system.slice, and more.
1236
+
See #best practices for more info.
1209
1237
1210
1238
The SLI that could potentially be impacted is [pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md).
1211
1239
If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted.
@@ -1283,6 +1311,8 @@ nodes that do not use swap memory.
1283
1311
-**2023-04-17:** KEP update for beta1 [#3957](https://github.com/kubernetes/enhancements/pull/3957).
1284
1312
-**2023-08-15:** Beta1 released in kubernetes 1.28
1285
1313
-**2024-01-12:** Updates to Beta2 KEP.
1314
+
-**2024-01-08:** Beta2 released in kubernetes 1.30.
1315
+
-**2024-06-18:** Updates to KEP, GA requirements and intention to release in version 1.32.
1286
1316
1287
1317
## Drawbacks
1288
1318
@@ -1294,41 +1324,24 @@ When swap is enabled, particularly for workloads, the kubelet’s resource
1294
1324
accounting may become much less accurate. This may make cluster administration
1295
1325
more difficult and less predictable.
1296
1326
1297
-
Currently, there exists an unsupported workaround, which is setting the kubelet
1298
-
flag `--fail-swap-on` to false.
1327
+
In general, swap is less predictable and might cause performance degradation.
1328
+
It also might be hard in certain scenarios to understand why certain workloads
1329
+
are the chosen candidates for swapping, which could occur for reasons external
1330
+
to the workload.
1331
+
1332
+
In addition, containers with memory limits would be killed less frequently
1333
+
since with swap enabled the kernel can usually reclaim a lot more memory.
1334
+
While this can help to avoid crashes, it could also "hide a problem" of a container
1335
+
reaching its memory limits.
1299
1336
1300
1337
## Alternatives
1301
1338
1302
1339
### Just set `--fail-swap-on=false`
1303
1340
1304
-
This is insufficient for most use cases because there is inconsistent control
1305
-
over how swap will be used by various container runtimes. Dockershim currently
1306
-
sets swap available for workloads to 0. The CRI does not restrict it at all.
1307
-
This inconsistency makes it difficult or impossible to use swap in production,
1308
-
particularly if a user wants to restrict workloads from using swap when using
1309
-
the CRI rather than dockershim.
1310
-
1311
-
This is also a breaking change.
1312
-
Users have used --fail-swap-on=false to allow for kubernetes to run
1313
-
on a swap enabled node.
1314
-
1315
-
### Restrict swap usage at the cgroup level
1316
-
1317
-
Setting a swap limit at the cgroup level would allow us to restrict the usage
1318
-
of swap on a pod-level, rather than container-level basis.
1319
-
1320
-
For alpha, we are opting for the container-level basis to simplify the
1321
-
implementation (as the container runtimes already support configuration of swap
1322
-
with the `memory-swap-limit` parameter). This will also provide the necessary
1323
-
plumbing for container-level accounting of swap, if that is proposed in the
1324
-
future.
1325
-
1326
-
In beta, we may want to revisit this.
1327
-
1328
-
See the [Pod Resource Management design proposal] for more background on the
1329
-
cgroup limits the kubelet currently sets based on each QoS class.
When `--fail-swap-on=false` is provided to Kubelet but swap is not configured
1342
+
otherwise it is guaranteed that, by default, no Kubernetes workloads would
1343
+
be able to utilize swap. However, everything outside of kubelet's reach
1344
+
(e.g. system daemons, kubelet, etc) would be able to use swap.
1332
1345
1333
1346
## Infrastructure Needed (Optional)
1334
1347
@@ -1338,4 +1351,7 @@ new subproject, repos requested, or GitHub details. Listing these here allows a
1338
1351
SIG to get the process for these resources started right away.
1339
1352
-->
1340
1353
1341
-
We may need Linux VM images built with swap partitions for e2e testing in CI.
1354
+
Added the "swap-conformance" lane for extensive swap testing under node pressure: [kubelet-swap-conformance-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-fedora-serial),
0 commit comments