Updating as per comments and filling other parts of the template

kk-src · kk-src · commit d9f074455b10 · 2020-10-06T10:48:04.000-07:00
diff --git a/keps/sig-storage/1847-autoremove-statefulset-pvcs/README.md b/keps/sig-storage/1847-autoremove-statefulset-pvcs/README.md
@@ -98,10 +98,6 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
-  - [Feature enablement and rollback](#feature-enablement-and-rollback)
-  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
-  - [Monitoring requirements](#monitoring-requirements)
-  - [Dependencies](#dependencies)
   - [Scalability](#scalability)
   - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
@@ -312,11 +308,18 @@ to leave the PVCs as is during the StatefulSet deletion.
 
 If `VolumeReclaimPolicy` is set to `RemoveOnScaledown` Pod is set as the owner of the PVCs created
 from the `VolumeClaimTemplates`. When a Pod is deleted, the PVC owned by the Pod is 
-also deleted. During scale-up, if a PVC has an OwnerRef that does not match the Pod, it 
+also deleted. 
+
+During scale-up, if a PVC has an OwnerRef that does not match the Pod, it 
 potentially indicates that the PVC is referred by the deleted Pod and is in the process of 
 getting deleted. Controller will exit the current reconcile loop and attempt to reconcile in the 
 next iteration. This avoids a race with PVC deletion.
 
+Current scaleset controller implementation ensures that the manually deleted pods are restored
+before the scale down logic is run. The Pod owner reference is only added to the PVC just 
+before the scaling down by the controller. This ensures that the manual deletions do not 
+automatically delete the PVCs in question.
+
 When `VolumeReclaimPolicy` is set to `RemoveOnStatefulSetDeletion` the owner reference in 
 PVC points to the StatefulSet. When a scale up or down occurs, the PVC would remain the same. 
 PVCs previously in use before scale down will be used again when the scale up occurs. The PVC deletion 
@@ -334,6 +337,28 @@ Inorder to update the PVC ownerreference, the `buildControllerRoles` will be upd
 
 ### Test Plan
 
+1. Unit tests
+
+1. e2e tests
+    - RemoveOnScaleDown
+      1. Create 2 pod stateful set, scale to 1 pod, confirm PV deleted
+      1. Create 2 pod stateful set, add data to PVs, scale to 1 pod, scale back to 2, confirm PV empty
+      1. Create 2 pod stateful set, delete stateful set, confirm PVs deleted
+      1.Create 2 pod stateful set, add data to PVs, manually delete one pod, confirm pod comes back and PV has data (PV not deleted)
+      1. As above, but manually delete all pods in stateful set
+      1. Create 2 pod stateful set, add data to PVs, manually delete one pod, immediately scale down to one pod, confirm PV is deleted
+      1. Create 2 pod stateful set, add data to PVs, manually delete one pod, immediately scale down to one pod, scale back to two pods, confirm PV is empty
+    - RemoveOnStatefulSetDeletion
+      1. Create 2 pod stateful set, scale to 1 pod, confirm PV still exists
+      1. Create 2 pod stateful set, add data to PVs, scale to 1 pod, scale back to 2, confirm PV has data (PV not deleted)
+      1. Create 2 pod stateful set, delete stateful set, confirm PVs deleted
+      1. Create 2 pod stateful set, add data to PVs, manually delete one pod, confirm pod comes back and PV has data (PV not deleted)
+      1. As above, but manually delete all pods in stateful set
+      1. Create 2 pod stateful set, add data to PVs, manually delete one pod, immediately scale down to one pod, confirm PV exists
+      1. Create 2 pod stateful set, add data to PVs, manually delete one pod, immediately scale down to one pod, scale back to two pods, confirm PV has data
+    - Retain: 
+      1. same tests as above, but PVs not removed in any case
+
 <!--
 **Note:** *Not required until targeted at a release.*
 
@@ -463,186 +488,6 @@ you need any help or guidance.
 
 -->
 
-### Feature enablement and rollback
-
-_This section must be completed when targeting alpha to a release._
-
-* **How can this feature be enabled / disabled in a live cluster?**
-  - [ ] Feature gate (also fill in values in `kep.yaml`)
-    - Feature gate name:
-    - Components depending on the feature gate:
-  - [ ] Other
-    - Describe the mechanism:
-    - Will enabling / disabling the feature require downtime of the control
-      plane?
-    - Will enabling / disabling the feature require downtime or reprovisioning
-      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
-
-* **Does enabling the feature change any default behavior?**
-  Any change of default behavior may be surprising to users or break existing
-  automations, so be extremely careful here.
-
-* **Can the feature be disabled once it has been enabled (i.e. can we rollback
-  the enablement)?**
-  Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-  Describe the consequences on existing workloads (e.g. if this is runtime
-  feature, can it break the existing applications?).
-
-* **What happens if we reenable the feature if it was previously rolled back?**
-
-* **Are there any tests for feature enablement/disablement?**
-  The e2e framework does not currently support enabling and disabling feature
-  gates. However, unit tests in each component dealing with managing data created
-  with and without the feature are necessary. At the very least, think about
-  conversion tests if API types are being modified.
-
-### Rollout, Upgrade and Rollback Planning
-
-_This section must be completed when targeting beta graduation to a release._
-
-* **How can a rollout fail? Can it impact already running workloads?**
-  Try to be as paranoid as possible - e.g. what if some components will restart
-  in the middle of rollout?
-
-* **What specific metrics should inform a rollback?**
-
-* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
-  Describe manual testing that was done and the outcomes.
-  Longer term, we may want to require automated upgrade/rollback tests, but we
-  are missing a bunch of machinery and tooling and do that now.
-
-* **Is the rollout accompanied by any deprecations and/or removals of features,
-  APIs, fields of API types, flags, etc.?**
-  Even if applying deprecation policies, they may still surprise some users.
-
-### Monitoring requirements
-
-_This section must be completed when targeting beta graduation to a release._
-
-* **How can an operator determine if the feature is in use by workloads?**
-  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
-  checking if there are objects with field X set) may be last resort. Avoid
-  logs or events for this purpose.
-
-* **What are the SLIs (Service Level Indicators) an operator can use to
-  determine the health of the service?**
-  - [ ] Metrics
-    - Metric name:
-    - [Optional] Aggregation method:
-    - Components exposing the metric:
-  - [ ] Other (treat as last resort)
-    - Details:
-
-* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  At the high-level this usually will be in the form of "high percentile of SLI
-  per day <= X". It's impossible to provide a comprehensive guidance, but at the very
-  high level (they needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99,9% of /health requests per day finish with 200 code
-
-* **Are there any missing metrics that would be useful to have to improve
-  observability if this feature?**
-  Describe the metrics themselves and the reason they weren't added (e.g. cost,
-  implementation difficulties, etc.).
-
-### Dependencies
-
-_This section must be completed when targeting beta graduation to a release._
-
-* **Does this feature depend on any specific services running in the cluster?**
-  Think about both cluster-level services (e.g. metrics-server) as well
-  as node-level agents (e.g. specific version of CRI). Focus on external or
-  optional services that are needed. For example, if this feature depends on
-  a cloud provider API, or upon an external software-defined storage or network
-  control plane.
-
-  For each of the fill in the following, thinking both about running user workloads
-  and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high error rates on the feature:
-
-
-### Scalability
-
-_For alpha, this section is encouraged: reviewers should consider these questions
-and attempt to answer them._
-
-_For beta, this section is required: reviewers must answer these questions._
-
-_For GA, this section is required: approvers should be able to confirms the
-previous answers based on experience in the field._
-
-* **Will enabling / using this feature result in any new API calls?**
-  Describe them, providing:
-  - API call type (e.g. PATCH pods)
-  - estimated throughput
-  - originating component(s) (e.g. Kubelet, Feature-X-controller)
-  focusing mostly on:
-  - components listing and/or watching resources they didn't before
-  - API calls that may be triggered by changes of some Kubernetes resources
-    (e.g. update of object X triggers new updates of object Y)
-  - periodic API calls to reconcile state (e.g. periodic fetching state,
-    heartbeats, leader election, etc.)
-
-* **Will enabling / using this feature result in introducing new API types?**
-  Describe them providing:
-  - API type
-  - Supported number of objects per cluster
-  - Supported number of objects per namespace (for namespace-scoped objects)
-
-* **Will enabling / using this feature result in any new calls to cloud
-  provider?**
-
-* **Will enabling / using this feature result in increasing size or count
-  of the existing API objects?**
-  Describe them providing:
-  - API type(s):
-  - Estimated increase in size: (e.g. new annotation of size 32B)
-  - Estimated amount of new objects: (e.g. new Object X for every existing Pod)
-
-* **Will enabling / using this feature result in increasing time taken by any
-  operations covered by [existing SLIs/SLOs][]?**
-  Think about adding additional work or introducing new steps in between
-  (e.g. need to do X to start a container), etc. Please describe the details.
-
-* **Will enabling / using this feature result in non-negligible increase of
-  resource usage (CPU, RAM, disk, IO, ...) in any components?**
-  Things to keep in mind include: additional in-memory state, additional
-  non-trivial computations, excessive access to disks (including increased log
-  volume), significant amount of data send and/or received over network, etc.
-  This through this both in small and large cases, again with respect to the
-  [supported limits][].
-
-### Troubleshooting
-
-Troubleshooting section serves the `Playbook` role as of now. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now we leave it here though.
-
-_This section must be completed when targeting beta graduation to a release._
-
-* **How does this feature react if the API server and/or etcd is unavailable?**
-
-* **What are other known failure modes?**
-  For each of them fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without loogging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debugging the issue?
-      Not required until feature graduated to Beta.
-    - Testing: Are there any tests for failure mode? If not describe why.
-
-* **What steps should be taken if SLOs are not being met to determine the problem?**
-
-[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 
 ## Implementation History
 
@@ -658,23 +503,15 @@ Major milestones might include
 -->
 
 ## Drawbacks
-
+The Statefulset field update is required.
 <!--
 Why should this KEP _not_ be implemented?
 -->
 
 ## Alternatives
-
+Users can delete the PVC manually. This is the motivation of the KEP.
 <!--
 What other approaches did you consider and why did you rule them out?  These do
 not need to be as detailed as the proposal, but should include enough
 information to express the idea and why it was not acceptable.
 -->
-
-## Infrastructure Needed (optional)
-
-<!--
-Use this section if you need things from the project/SIG.  Examples include a
-new subproject, repos requested, github details.  Listing these here allows a
-SIG to get the process for these resources started right away.
--->
diff --git a/keps/sig-storage/1847-autoremove-statefulset-pvcs/kep.yaml b/keps/sig-storage/1847-autoremove-statefulset-pvcs/kep.yaml
@@ -13,5 +13,31 @@ reviewers:
   - "@kow3ns"  
   - "@xing-yang"
   - "@msau42"
+  - "@janetkuo"
 approvers:
-  - TBD
+  - "@msau42"
+  - "@janetkuo"
+
+#The target maturity stage in the current dev cycle for this KEP.
+stage: alpha
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.20"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.20"
+  beta: "v1.21"
+  stable: "v1.22"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+#feature-gates:
+#  - default is existing behaviour. Only if retention flags are enabled does 
+#    the feature come into action, hence not adding additional feature gate.
+
+# The following PRR answers are required at beta release
+metrics:
+  - existing metrics would be enough