-
Notifications
You must be signed in to change notification settings - Fork 218
Conversation
Also, running conformance tests separately - will hold off on merging until those pass too |
lgtm |
Tested PXE booted bare-metal setups with both This didn't cover controller reboot survival (i.e. checkpointing). Seems to be where tests are failing. |
Reboots were ok for me, maybe a testing artifact |
Conformance has 2 failures need to check out:
@dghubble I think we might have to bump the vendoring after all (or update the tests). Not super clear about the interaction of using "extensions/v1beta1" in the tests, vs creating the objects with v1.8 "apps/v1beta2" |
Any idea if my suggestion will unblock the e2e tests, might be worth a shot. I hadn't looked at conformance tests yet. |
I believe the intention there is to remove the nodeSelector altogether (so the daemonset ends up on all nodes) - we're not using node affinity in this case. I have a feeling the actual failure is due to some versioning interplay (e.g. using |
coreosbot run e2e |
@dghubble do you know how to re-trigger the e2e-calico test? |
coreosbot run e2e calico |
Happy to change it, its just the first phrase I defined. |
@dghubble that's fine. Mind adding it to https://github.com/kubernetes-incubator/bootkube/blob/master/Documentation/development.md#running-pr-tests |
@aaronlevy I see e2e and self-hosted etcd e2e are using "coreosbot run e2e", so I've changed calico-e2e to use the same. Docs are already correct. |
My changes wouldn't be a fix for those failures. I believe I've seen those failures before, seems like a timing failure |
I don't think it's related to checkpointer itself, but rather behavior that touches the checkpointer tests. It seems like there's something related to removing pods that has changed (similarly got same conformance test failures twice now - so not outright flakes). |
Looking at the tests, we're rebooting masters, starting workers, verifying "pod-checkpointer" is present on a worker ( I'm guessing this is done to test the idea that once an apiserver can be contacted, the checkpointer doesn't need to be running on that worker per doc, but I'm not 100%. cc @yifan-gu |
Added a sleep and the tests pass. I have a feeling removing a pod might have got slower? One option might just be to increase the retry window on |
Increasing the retries allowed does seem nicer than a fixed sleep. |
@@ -161,7 +169,7 @@ spec: | |||
- /var/lock/api-server.lock | |||
- /hyperkube | |||
- apiserver | |||
- --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota | |||
- --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question: do we have to take over these admission-control
setting changes also in the Tectonic installer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cc'd you on an internal issue and will send a follow up email.
Looks like that cluster apiserver failed to come back and subsequent tests failed. Hm, known or new? |
The self-hosted etcd is actually still occasionally flaky - and not really maintained. I'll re-run for now (last test with the sleep passed, and this is functionally the same change). |
On bare-metal with
|
On CI, TestReboot failed once and TestCheckpointerUnscheduleCheckpointer failed once on separate runs. So its a flaking issue rather than being broken outright. I'll point out, the reboot seems to fail sometimes at the testing code's call to node.Reboot, not from a cluster behavior itself. |
coreosbot run e2e |
Tested on a cloud provider as well |
e2e-selfetcd failed on Terraform creation of AWS resources. |
coreosbot run e2e |
Any thoughts on how we might make this less flaky? |
At a higher level and longer term level, I'd like to strip out functionality we don't intend to go forward with such as self-hosted kubelet (#735, but not an issue here) and self-hosted etcd. Most of these issues boil down to crash loops being a normal part of self-hosted control planes, our control plane growing more complex, and there being no precise definition of when setup is complete and e2e/conformance testing should begin. It'd also be valuable if we had some better way to show how long a step normally takes, we kinda go by instinct and maybe these values need re-tuning. |
coreosbot run e2e |
I'm really hesitant to merge this even if tests finally pass (without addressing some of the underlying issues). Otherwise every subsequent PR will have to run tests 5 times to pass (which isn't the case prior to this PR). So that might mean that we need to increase timeouts. |
Agreed. At this point its mainly useful to see where we're hitting failures. It'd be different if it were the same error every time. |
Sorry for hijacking this issue, but why do you want to drop self-hosted etcd? Isn't it working? Too unstable? I can see it was just removed from the tectonic UI. |
This isn't the place for that discussion, it would hijack this. I'll open an issue later to communicate the plan more broadly. EDIT: See #738 |
Summarizing here, in the 5 builds on v1.8.1 (rather than v1.8.0)
|
All conformance tests passed once bumped to v1.8.1 |
* Latest license-bill-of-materials reformats list
* Increase timeout allowed for the TestSmoke nginx workload
Added some fixes which should address the TestReboot and TestSmoke flakes. I don't have anything for the Terraform or occassional self-hosted etcd TestDeleteAPI failures. I haven't been able to reproduce them on my clusters running v1.8.1 with self-hosted etcd. I'll note the flakes are in e2e-calico and e2e-selfetcd which are experimental, and we haven't seen any in e2e (standard options). Not that it justifies them. |
Ok, so we've seen flaky failures in TestDeleteAPI "non-checkpoint apiserver did not come back" on both e2e-calico and e2e-selfetcd |
Looks like all passed. Good to merge @dghubble ? |
I'll keep trying to repro the flakes in experimental features. Yes, good to merge. |
I'll wait on bumping vendor libs to v1.8 until client-go has a more unified v1.8 targeted release.
cc @rphillips @dghubble