Containers don't restart after CAPD install #832

thesteve0 · 2021-06-22T22:36:47Z

Bug Report

After a local standalone install, if the computer is shutdown or restarted - the containers for the kube cluster don't come back up.

Expected Behavior

By default the TCE containers in the standalone Kube cluster should start every time the docker engine starts.
There should be a command or flag to turn off that default behavior

Steps to Reproduce the Bug

Install TCE standalone, turn off the computer, start it again, wait for docker to boot. There will be no cluster

Environment Details

Build version (tanzu version):
Operating System (client):

The text was updated successfully, but these errors were encountered:

jpmcb · 2021-06-24T22:34:02Z

Thanks for the feedback!

There was some outside discussion on this, but looks like Cluster API does provide the "restart" annotations, but we aren't using those in the core library. I think there's an opportunity to make this better in the future by bringing restart into core

Will leave this open to keep track of the issue

karuppiah7890 · 2021-06-25T03:19:10Z

Is this the same as #770 ?

jpmcb · 2021-06-30T20:55:04Z

Did some digging on this (and thanks to @stmcginnis for all the context!) and there is a path forward

Docker containers can be restarted automatically and kind has supported thisfor the last year.

The problem is with the Cluster API docker provider (CAPD) here. It needs to be added to the createNode method but kubernetes-sigs/cluster-api#4413 will enable us to add additional properties to the configuration using the Docker SDK (instead of calling out to the docker CLI).

So, TLDR, this is a problem with CAPD and will be configurable soon. We can use this issue as a high level tracker for getting this to eventually work in TCE

stmcginnis · 2021-08-26T09:29:39Z

This should be fixed with kubernetes-sigs/cluster-api#5021 merged to CAPD. We still need to update our dependency to CAPI v0.4.1 or later though. Since there are some major changes between our current v0.3.x dependency and v0.4.x, this may take some time.

thesteve0 · 2021-08-26T19:01:46Z

Depends on #1431

jbeda · 2021-09-07T23:26:34Z

Note that even after manually starting the clusters (docker start $(docker ps -qa)) there are somehow certificate errors when talking to workload clusters.

randomvariable · 2021-09-08T12:02:46Z

Certificate errors are likely to be related to IP changes, as per kubernetes-sigs/kind#1689

randomvariable · 2021-09-10T12:04:30Z

Also you may see this as well kubernetes-sigs/cluster-api#4874 (comment)

nimbusscale · 2021-10-16T18:12:06Z

I suspect this is in the scope of this issue, but ideally, we should be able to stop and start Docker-based clusters as well. I'm personally not interested in having clusters start automatically when I reboot my laptop, but I'd like to be able to start them manually. Let me know if this needs to be tracked in a separate issue and I will create one.

randomvariable · 2021-10-18T14:32:49Z

Let me know if this needs to be tracked in a separate issue and I will create one.

I think it's the same underlying problem, but you might want to open an issue for enabling the use case via the CLI/UX. After looking into this some more, I think we discovered the root of the problem in etcd-io/etcd#13340 . There's still some debate about the path forward, but we're dependent on some k8s & etcd changes upstream.

bradwinfield · 2021-11-02T12:16:33Z

Just to note, I restarted the docker containers (TCE management cluster) in the order that leaves the IP addresses for each container the same as before docker shutdown. I had to create a dummy container to take the .2 address that kind used during the build. This solves the certificate issue but the cluster still does not respond to tanzu nor kubectl commands.

randomvariable · 2021-11-02T12:34:00Z

Yeah, I'm pretty certain this is the etcd issues. Given the work being done on local clusters, I'm not sure how important this is anymore from a TCE perspective.

jpmcb · 2021-11-18T21:39:32Z

Leaving this open regardless of the standalone cluster overhaul: we should investigate how this works for the new standalone cluster model, adjust our kind provider as needed, and gather community feedback

For reference, here is the new standalone-cluster proposal that uses a different model with a much liter weight methodology. Please look at the proposal, try it out, and give your feedback there!

grtrout · 2022-02-16T14:56:07Z

I take it this is still an issue? I spent the last hour trying to figure out why my newly-created TCE cluster seems to be completely broken after I restarted my laptop. I get the reasons why this happens (sort of), but I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

stmcginnis · 2022-02-16T15:11:23Z

I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

Well, yes and no. With our standalone-cluster and unmanaged-cluster implementations, we are targeting the ability to deploy local clusters for development. But TCE as a whole, no.

@grtrout were you using the standalone-cluster command, or the new unmanaged-cluster command in the v0.10.0 release candidate?

There are issues with the way standalone-cluster works, so there is not a solution there. That command is being deprecated and replaced by unamanged-clusters.

If you use the unmanaged-cluster command, this can work but there are a couple gotchas. When deploying the cluster you will need to specify calico for the cni. The version of antrea we use by default is not able to handle the restart. On restart, the containers usually get assigned new IP addresses that cause problems there.

So the command tanzu unmanaged-cluster create --cni=calico foo should get a working cluster that (in most cases) should be able to survive a reboot. There may be some other conditions that cause problems with this, but so far in my experience it has been working fine.

grtrout · 2022-02-16T16:17:12Z

Hey @stmcginnis, thanks for the quick reply. I've used Tanzu/TKG at a previous job and since I want to continue keeping up with the Tanzu ecosystem at my new job, I thought using TCE in my local environment might be smart. I'm also thinking about using it in a homelab, but that's a different thing...

In any case, yes, I just followed along the "getting started" docs I created a standalone cluster. I had to hack a few things to get it working (e.g., toggling the deprecatedCgroupv1 value to true, restarting the process 3 or 4 times, etc.), but ultimately it was up and running and functional...until I restarted my laptop.

It looks like I should check out the unmanaged-cluster, but until now I was not aware of that. I would rather use Calico over Antrea anyway, so I'm good with that. I'll try this out later today. Thanks again!

butch7903 · 2022-03-07T19:11:33Z

I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

Well, yes and no. With our standalone-cluster and unmanaged-cluster implementations, we are targeting the ability to deploy local clusters for development. But TCE as a whole, no.

@grtrout were you using the standalone-cluster command, or the new unmanaged-cluster command in the v0.10.0 release candidate?

There are issues with the way standalone-cluster works, so there is not a solution there. That command is being deprecated and replaced by unamanged-clusters.

If you use the unmanaged-cluster command, this can work but there are a couple gotchas. When deploying the cluster you will need to specify calico for the cni. The version of antrea we use by default is not able to handle the restart. On restart, the containers usually get assigned new IP addresses that cause problems there.

So the command tanzu unmanaged-cluster create --cni=calico foo should get a working cluster that (in most cases) should be able to survive a reboot. There may be some other conditions that cause problems with this, but so far in my experience it has been working fine.

Looks like the tanzu unmanaged-cluster create --cni=calico foo is the fix. My issue I am seeing now is that the kapp-controller is crashing every so often post reboot. Any ideas why that would be or how we could troubleshoot it further?

stmcginnis · 2022-03-07T19:39:26Z

@seemiller any tips on troubleshooting kapp-controller? Or someone we can pull in from the carvel project to take a look?

joshrosso · 2022-03-07T21:24:39Z

@butch7903 when you say it's crashing, can you help us understand:

what symptoms are you observing?
- For example, are you seeing the pod go in CrashLoopBackOff?
- Does this only occur shortly after the cluster reboots. or is it something that occurs hours after the cluster has been running healthily?
- If you kubectl describe ... the kapp-controller pod, are any crash events displayed?

butch7903 · 2022-03-08T14:15:45Z

Yes, having CrashLoopBackOff. The crashes begin right after reboot and continue infinitely. So far at 211 restarts.

joshrosso · 2022-03-08T14:38:31Z

Looks like the events are cut off, as I don't see the Error in here.

Can you copy and paste the output of the describe command?
Can you copy and paste the output of kubectl logs -n tkg-system $KAPP_CONTROLLER_POD_NAME -p

Also, if you want to join us in Slack, we can probably help you troubleshoot there.

Thanks!

butch7903 · 2022-03-08T16:26:26Z

Looks like I should have tested a bit more. I blew that tanzu unmanged-cluster away and simply rebuilt it, waited for all of it to completely come up, and then rebooted. The kapp controller is now no longer having the issue of restarting, so all I can think of is that I must have rebooted before it had time to complete the setup of the kapp controller causing it to go into a bad state whenever I rebooted in the future.

thesteve0 · 2022-03-09T20:26:54Z

Since this is referring to standalone rather than unmanaged perhaps we should close this one.

jpmcb · 2022-03-21T15:26:48Z

Since this is referring to standalone rather than unmanaged perhaps we should close this one.

Yes. unmanaged-cluster doesn't suffer from this original issue. If users are using antrea CNI at the time of this writing, they may still experience issues when restarting their clusters.

Reference #3564 for further details

jorgemoralespou · 2022-03-31T14:49:40Z

@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359

joshrosso · 2022-03-31T15:57:28Z

@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359

Agreed. Reopening.

To be entirely transparent, I don’t foresee CAPD-based restart support being in our near-term future.

jorgemoralespou · 2022-03-31T16:23:37Z

Then I would make it more prominent at the top of the doc page and not at the very bottom https://tanzucommunityedition.io/docs/v0.11/docker-install-mgmt/

RussellHamker · 2022-03-31T19:00:10Z

It appears that this issue with a management cluster reboot is that when docker comes back up, the containers do not start in the same order or do not get the same IP addresses.

My Build Process for a management cluster today in Docker:
`cat < tce-mgmt.yaml
CLUSTER_CIDR: 100.96.0.0/11
CLUSTER_NAME: tce-mgmt
CLUSTER_PLAN: dev
ENABLE_MHC: "false"
IDENTITY_MANAGEMENT_TYPE: none
INFRASTRUCTURE_PROVIDER: docker
LDAP_BIND_DN: ""
LDAP_BIND_PASSWORD: ""
LDAP_GROUP_SEARCH_BASE_DN: ""
LDAP_GROUP_SEARCH_FILTER: ""
LDAP_GROUP_SEARCH_GROUP_ATTRIBUTE: ""
LDAP_GROUP_SEARCH_NAME_ATTRIBUTE: cn
LDAP_GROUP_SEARCH_USER_ATTRIBUTE: DN
LDAP_HOST: ""
LDAP_ROOT_CA_DATA_B64: ""
LDAP_USER_SEARCH_BASE_DN: ""
LDAP_USER_SEARCH_FILTER: ""
LDAP_USER_SEARCH_NAME_ATTRIBUTE: ""
LDAP_USER_SEARCH_USERNAME: userPrincipalName
OIDC_IDENTITY_PROVIDER_CLIENT_ID: ""
OIDC_IDENTITY_PROVIDER_CLIENT_SECRET: ""
OIDC_IDENTITY_PROVIDER_GROUPS_CLAIM: ""
OIDC_IDENTITY_PROVIDER_ISSUER_URL: ""
OIDC_IDENTITY_PROVIDER_NAME: ""
OIDC_IDENTITY_PROVIDER_SCOPES: ""
OIDC_IDENTITY_PROVIDER_USERNAME_CLAIM: ""
OS_ARCH: ""
OS_NAME: ""
OS_VERSION: ""
SERVICE_CIDR: 100.64.0.0/13
TKG_HTTP_PROXY_ENABLED: "false"
EOF

tanzu management-cluster create -f tce-mgmt.yaml --cni=calico

tanzu management-cluster kubeconfig get tce-mgmt --admin

docker network inspect kind

kubectl config use-context tce-mgmt-admin@tce-mgmt

kubectl get nodes

kubectl get po -A`

Prior to Reboot:

Post Reboot:

Is there a way we could force the tanzu mgmt docker containers to retain their IPs possibly? If so, this might be an easy fix.

stmcginnis · 2022-03-31T19:04:36Z

the containers do not start in the same order or do not get the same IP addresses

If I remember right, this was the case when I looked closer and what I've heard from other upstream projects like kind. There really needs to be some sort of IPAM integration in docker that would allow for assigning persistent IP addresses to containers for multi-node clusters to reliable survive docker engine restarts. Without that, it's possible you can restart and get your full cluster back, but it's not very likely.

RussellHamker · 2022-03-31T19:06:31Z

Looks like they already have that...
https://www.cloudsavvyit.com/14508/how-to-assign-a-static-ip-to-a-docker-container/

RussellHamker · 2022-03-31T19:07:21Z

Just need to have VMware TCE update their tanzu mgmt creation to include the specific IPs
Or
Offer flags to allow us to set the IPs/network for the 3 distinct docker containers....

stmcginnis · 2022-03-31T19:14:23Z

Looks like they already have that...

Yeah, the capability exists to set an address. The missing piece is an IPAM to manage what to set.

RussellHamker · 2022-03-31T19:15:57Z

I dont think IPAM is needed, you just need to set the IPs and network during the build process somehow. Where is the tanzu management yaml file stored after cluster creation?

RussellHamker · 2022-03-31T22:25:41Z

my temporary work around for this until I can find a better method is this:

#Restart process
CLUSTER="tce-mgmt"
docker stop $(docker ps -a -q)
docker start "$CLUSTER-lb"
docker start $(docker ps -a -q)
docker network inspect kind
kubectl config use-context $CLUSTER-admin@$CLUSTER
kubectl get nodes
kubectl get po -A

This causes the loadbalancer to always grab the .3 IP out of the IP pool instead of something else. The better thing long term will be to set the IP somehow for each container so that they come back up with the right IP on reboot. I will further look into this.

opsline-jvarelas · 2023-04-01T01:44:47Z

Hi @RussellHamker
I've been trying to execute your workaround without success, do you remember the exact order for all containers?, because i had a management and workloads clusters running .... but after the reboot are not working......

or

is there another way to run tanzu mgm and workload clusters within a linux server?, because i found the Docker option only in the TCE version.... thanks in advance.

thesteve0 added kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers labels Jun 22, 2021

jorgemoralespou mentioned this issue Aug 26, 2021

Standalone workload on Windows with Docker does not come up after windows restart #1463

Closed

kcoriordan mentioned this issue Aug 26, 2021

Doc does not tell the user that a windows standalone cluster will not be usable if you reboot your machine #1464

Closed

joshrosso added this to the icebox milestone Aug 26, 2021

joshrosso added owner/core-eng Work executed by TCE's core engineering team and removed area/cluster-lifecycle labels Sep 19, 2021

jpmcb self-assigned this Feb 24, 2022

jpmcb closed this as completed Mar 21, 2022

joshrosso reopened this Mar 31, 2022

joshrosso unassigned jpmcb Mar 31, 2022

joshrosso added owner/upstream Work executed in an upstream project (not Carvel or tanzu-framework) and removed owner/core-eng Work executed by TCE's core engineering team labels Mar 31, 2022

joshrosso changed the title ~~Containers don't restart after standalone CAPD install~~ Containers don't restart after CAPD install Mar 31, 2022

jpmcb mentioned this issue Mar 31, 2022

TCE Management Cluster will not survive reboot/restart in Docker #3842

Closed

jdumars closed this as completed Oct 31, 2022

Containers don't restart after CAPD install #832

Containers don't restart after CAPD install #832

Comments

thesteve0 commented Jun 22, 2021

Bug Report

Expected Behavior

Steps to Reproduce the Bug

Environment Details

jpmcb commented Jun 24, 2021

karuppiah7890 commented Jun 25, 2021

jpmcb commented Jun 30, 2021

stmcginnis commented Aug 26, 2021

thesteve0 commented Aug 26, 2021

jbeda commented Sep 7, 2021

randomvariable commented Sep 8, 2021

randomvariable commented Sep 10, 2021

nimbusscale commented Oct 16, 2021

randomvariable commented Oct 18, 2021 • edited Loading

bradwinfield commented Nov 2, 2021

randomvariable commented Nov 2, 2021

jpmcb commented Nov 18, 2021

grtrout commented Feb 16, 2022

stmcginnis commented Feb 16, 2022

grtrout commented Feb 16, 2022

butch7903 commented Mar 7, 2022

stmcginnis commented Mar 7, 2022

joshrosso commented Mar 7, 2022

butch7903 commented Mar 8, 2022 • edited Loading

joshrosso commented Mar 8, 2022

butch7903 commented Mar 8, 2022

thesteve0 commented Mar 9, 2022

jpmcb commented Mar 21, 2022

jorgemoralespou commented Mar 31, 2022

joshrosso commented Mar 31, 2022

jorgemoralespou commented Mar 31, 2022

RussellHamker commented Mar 31, 2022

stmcginnis commented Mar 31, 2022

RussellHamker commented Mar 31, 2022

RussellHamker commented Mar 31, 2022 • edited Loading

stmcginnis commented Mar 31, 2022

RussellHamker commented Mar 31, 2022

RussellHamker commented Mar 31, 2022 • edited Loading

opsline-jvarelas commented Apr 1, 2023 • edited Loading

randomvariable commented Oct 18, 2021 •

edited

Loading

butch7903 commented Mar 8, 2022 •

edited

Loading

RussellHamker commented Mar 31, 2022 •

edited

Loading

RussellHamker commented Mar 31, 2022 •

edited

Loading

opsline-jvarelas commented Apr 1, 2023 •

edited

Loading