-
Notifications
You must be signed in to change notification settings - Fork 307
Containers don't restart after CAPD install #832
Comments
Thanks for the feedback! There was some outside discussion on this, but looks like Cluster API does provide the "restart" annotations, but we aren't using those in the Will leave this open to keep track of the issue |
Is this the same as #770 ? |
Did some digging on this (and thanks to @stmcginnis for all the context!) and there is a path forward Docker containers can be restarted automatically and kind has supported thisfor the last year. The problem is with the Cluster API docker provider (CAPD) here. It needs to be added to the createNode method but kubernetes-sigs/cluster-api#4413 will enable us to add additional properties to the configuration using the Docker SDK (instead of calling out to the docker CLI). So, TLDR, this is a problem with CAPD and will be configurable soon. We can use this issue as a high level tracker for getting this to eventually work in TCE |
This should be fixed with kubernetes-sigs/cluster-api#5021 merged to CAPD. We still need to update our dependency to CAPI v0.4.1 or later though. Since there are some major changes between our current v0.3.x dependency and v0.4.x, this may take some time. |
Depends on #1431 |
Note that even after manually starting the clusters ( |
Certificate errors are likely to be related to IP changes, as per kubernetes-sigs/kind#1689 |
Also you may see this as well kubernetes-sigs/cluster-api#4874 (comment) |
I suspect this is in the scope of this issue, but ideally, we should be able to stop and start Docker-based clusters as well. I'm personally not interested in having clusters start automatically when I reboot my laptop, but I'd like to be able to start them manually. Let me know if this needs to be tracked in a separate issue and I will create one. |
I think it's the same underlying problem, but you might want to open an issue for enabling the use case via the CLI/UX. After looking into this some more, I think we discovered the root of the problem in etcd-io/etcd#13340 . There's still some debate about the path forward, but we're dependent on some k8s & etcd changes upstream. |
Just to note, I restarted the docker containers (TCE management cluster) in the order that leaves the IP addresses for each container the same as before docker shutdown. I had to create a dummy container to take the .2 address that kind used during the build. This solves the certificate issue but the cluster still does not respond to tanzu nor kubectl commands. |
Yeah, I'm pretty certain this is the etcd issues. Given the work being done on local clusters, I'm not sure how important this is anymore from a TCE perspective. |
Leaving this open regardless of the standalone cluster overhaul: we should investigate how this works for the new standalone cluster model, adjust our kind provider as needed, and gather community feedback For reference, here is the new standalone-cluster proposal that uses a different model with a much liter weight methodology. Please look at the proposal, try it out, and give your feedback there! |
I take it this is still an issue? I spent the last hour trying to figure out why my newly-created TCE cluster seems to be completely broken after I restarted my laptop. I get the reasons why this happens (sort of), but I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments? |
Well, yes and no. With our @grtrout were you using the There are issues with the way If you use the So the command |
Hey @stmcginnis, thanks for the quick reply. I've used Tanzu/TKG at a previous job and since I want to continue keeping up with the Tanzu ecosystem at my new job, I thought using TCE in my local environment might be smart. I'm also thinking about using it in a homelab, but that's a different thing... In any case, yes, I just followed along the "getting started" docs I created a standalone cluster. I had to hack a few things to get it working (e.g., toggling the deprecatedCgroupv1 value to true, restarting the process 3 or 4 times, etc.), but ultimately it was up and running and functional...until I restarted my laptop. It looks like I should check out the unmanaged-cluster, but until now I was not aware of that. I would rather use Calico over Antrea anyway, so I'm good with that. I'll try this out later today. Thanks again! |
Looks like the |
@seemiller any tips on troubleshooting kapp-controller? Or someone we can pull in from the carvel project to take a look? |
@butch7903 when you say it's crashing, can you help us understand:
|
Looks like the events are cut off, as I don't see the Error in here.
Also, if you want to join us in Slack, we can probably help you troubleshoot there. Thanks! |
Looks like I should have tested a bit more. I blew that tanzu unmanged-cluster away and simply rebuilt it, waited for all of it to completely come up, and then rebooted. The kapp controller is now no longer having the issue of restarting, so all I can think of is that I must have rebooted before it had time to complete the setup of the kapp controller causing it to go into a bad state whenever I rebooted in the future. |
Since this is referring to standalone rather than unmanaged perhaps we should close this one. |
Yes. Reference #3564 for further details |
@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359 |
Agreed. Reopening. To be entirely transparent, I don’t foresee CAPD-based restart support being in our near-term future. |
Then I would make it more prominent at the top of the doc page and not at the very bottom https://tanzucommunityedition.io/docs/v0.11/docker-install-mgmt/ |
It appears that this issue with a management cluster reboot is that when docker comes back up, the containers do not start in the same order or do not get the same IP addresses. My Build Process for a management cluster today in Docker: tanzu management-cluster create -f tce-mgmt.yaml --cni=calico tanzu management-cluster kubeconfig get tce-mgmt --admin docker network inspect kind kubectl config use-context tce-mgmt-admin@tce-mgmt kubectl get nodes kubectl get po -A` Is there a way we could force the tanzu mgmt docker containers to retain their IPs possibly? If so, this might be an easy fix. |
If I remember right, this was the case when I looked closer and what I've heard from other upstream projects like |
Looks like they already have that... |
Just need to have VMware TCE update their tanzu mgmt creation to include the specific IPs |
Yeah, the capability exists to set an address. The missing piece is an IPAM to manage what to set. |
I dont think IPAM is needed, you just need to set the IPs and network during the build process somehow. Where is the tanzu management yaml file stored after cluster creation? |
my temporary work around for this until I can find a better method is this: #Restart process This causes the loadbalancer to always grab the .3 IP out of the IP pool instead of something else. The better thing long term will be to set the IP somehow for each container so that they come back up with the right IP on reboot. I will further look into this. |
Hi @RussellHamker or is there another way to run tanzu mgm and workload clusters within a linux server?, because i found the Docker option only in the TCE version.... thanks in advance. |
Bug Report
After a local standalone install, if the computer is shutdown or restarted - the containers for the kube cluster don't come back up.
Expected Behavior
By default the TCE containers in the standalone Kube cluster should start every time the docker engine starts.
There should be a command or flag to turn off that default behavior
Steps to Reproduce the Bug
Install TCE standalone, turn off the computer, start it again, wait for docker to boot. There will be no cluster
Environment Details
tanzu version
):The text was updated successfully, but these errors were encountered: