Skip to content
This repository was archived by the owner on Jun 28, 2023. It is now read-only.

Containers don't restart after CAPD install #832

Closed
thesteve0 opened this issue Jun 22, 2021 · 35 comments
Closed

Containers don't restart after CAPD install #832

thesteve0 opened this issue Jun 22, 2021 · 35 comments
Labels
kind/feature A request for a new feature owner/upstream Work executed in an upstream project (not Carvel or tanzu-framework)
Milestone

Comments

@thesteve0
Copy link

Bug Report

After a local standalone install, if the computer is shutdown or restarted - the containers for the kube cluster don't come back up.

Expected Behavior

By default the TCE containers in the standalone Kube cluster should start every time the docker engine starts.
There should be a command or flag to turn off that default behavior

Steps to Reproduce the Bug

Install TCE standalone, turn off the computer, start it again, wait for docker to boot. There will be no cluster

Environment Details

  • Build version (tanzu version):
  • Operating System (client):
@thesteve0 thesteve0 added kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers labels Jun 22, 2021
@jpmcb
Copy link
Contributor

jpmcb commented Jun 24, 2021

Thanks for the feedback!

There was some outside discussion on this, but looks like Cluster API does provide the "restart" annotations, but we aren't using those in the core library. I think there's an opportunity to make this better in the future by bringing restart into core

Will leave this open to keep track of the issue

@karuppiah7890
Copy link
Contributor

Is this the same as #770 ?

@jpmcb
Copy link
Contributor

jpmcb commented Jun 30, 2021

Did some digging on this (and thanks to @stmcginnis for all the context!) and there is a path forward

Docker containers can be restarted automatically and kind has supported thisfor the last year.

The problem is with the Cluster API docker provider (CAPD) here. It needs to be added to the createNode method but kubernetes-sigs/cluster-api#4413 will enable us to add additional properties to the configuration using the Docker SDK (instead of calling out to the docker CLI).

So, TLDR, this is a problem with CAPD and will be configurable soon. We can use this issue as a high level tracker for getting this to eventually work in TCE

@stmcginnis
Copy link
Contributor

This should be fixed with kubernetes-sigs/cluster-api#5021 merged to CAPD. We still need to update our dependency to CAPI v0.4.1 or later though. Since there are some major changes between our current v0.3.x dependency and v0.4.x, this may take some time.

@joshrosso joshrosso added kind/feature A request for a new feature owner/core-eng Work executed by TCE's core engineering team area/cluster-lifecycle and removed kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers owner/core-eng Work executed by TCE's core engineering team labels Aug 26, 2021
@joshrosso joshrosso added this to the icebox milestone Aug 26, 2021
@thesteve0
Copy link
Author

Depends on #1431

@jbeda
Copy link

jbeda commented Sep 7, 2021

Note that even after manually starting the clusters (docker start $(docker ps -qa)) there are somehow certificate errors when talking to workload clusters.

@randomvariable
Copy link
Contributor

Certificate errors are likely to be related to IP changes, as per kubernetes-sigs/kind#1689

@randomvariable
Copy link
Contributor

Also you may see this as well kubernetes-sigs/cluster-api#4874 (comment)

@joshrosso joshrosso added owner/core-eng Work executed by TCE's core engineering team and removed area/cluster-lifecycle labels Sep 19, 2021
@nimbusscale
Copy link

I suspect this is in the scope of this issue, but ideally, we should be able to stop and start Docker-based clusters as well. I'm personally not interested in having clusters start automatically when I reboot my laptop, but I'd like to be able to start them manually. Let me know if this needs to be tracked in a separate issue and I will create one.

@randomvariable
Copy link
Contributor

randomvariable commented Oct 18, 2021

Let me know if this needs to be tracked in a separate issue and I will create one.

I think it's the same underlying problem, but you might want to open an issue for enabling the use case via the CLI/UX. After looking into this some more, I think we discovered the root of the problem in etcd-io/etcd#13340 . There's still some debate about the path forward, but we're dependent on some k8s & etcd changes upstream.

@bradwinfield
Copy link

Just to note, I restarted the docker containers (TCE management cluster) in the order that leaves the IP addresses for each container the same as before docker shutdown. I had to create a dummy container to take the .2 address that kind used during the build. This solves the certificate issue but the cluster still does not respond to tanzu nor kubectl commands.

@randomvariable
Copy link
Contributor

Yeah, I'm pretty certain this is the etcd issues. Given the work being done on local clusters, I'm not sure how important this is anymore from a TCE perspective.

@jpmcb
Copy link
Contributor

jpmcb commented Nov 18, 2021

Leaving this open regardless of the standalone cluster overhaul: we should investigate how this works for the new standalone cluster model, adjust our kind provider as needed, and gather community feedback

For reference, here is the new standalone-cluster proposal that uses a different model with a much liter weight methodology. Please look at the proposal, try it out, and give your feedback there!

@grtrout
Copy link

grtrout commented Feb 16, 2022

I take it this is still an issue? I spent the last hour trying to figure out why my newly-created TCE cluster seems to be completely broken after I restarted my laptop. I get the reasons why this happens (sort of), but I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

@stmcginnis
Copy link
Contributor

I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

Well, yes and no. With our standalone-cluster and unmanaged-cluster implementations, we are targeting the ability to deploy local clusters for development. But TCE as a whole, no.

@grtrout were you using the standalone-cluster command, or the new unmanaged-cluster command in the v0.10.0 release candidate?

There are issues with the way standalone-cluster works, so there is not a solution there. That command is being deprecated and replaced by unamanged-clusters.

If you use the unmanaged-cluster command, this can work but there are a couple gotchas. When deploying the cluster you will need to specify calico for the cni. The version of antrea we use by default is not able to handle the restart. On restart, the containers usually get assigned new IP addresses that cause problems there.

So the command tanzu unmanaged-cluster create --cni=calico foo should get a working cluster that (in most cases) should be able to survive a reboot. There may be some other conditions that cause problems with this, but so far in my experience it has been working fine.

@grtrout
Copy link

grtrout commented Feb 16, 2022

Hey @stmcginnis, thanks for the quick reply. I've used Tanzu/TKG at a previous job and since I want to continue keeping up with the Tanzu ecosystem at my new job, I thought using TCE in my local environment might be smart. I'm also thinking about using it in a homelab, but that's a different thing...

In any case, yes, I just followed along the "getting started" docs I created a standalone cluster. I had to hack a few things to get it working (e.g., toggling the deprecatedCgroupv1 value to true, restarting the process 3 or 4 times, etc.), but ultimately it was up and running and functional...until I restarted my laptop.

It looks like I should check out the unmanaged-cluster, but until now I was not aware of that. I would rather use Calico over Antrea anyway, so I'm good with that. I'll try this out later today. Thanks again!

@jpmcb jpmcb self-assigned this Feb 24, 2022
@butch7903
Copy link

I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

Well, yes and no. With our standalone-cluster and unmanaged-cluster implementations, we are targeting the ability to deploy local clusters for development. But TCE as a whole, no.

@grtrout were you using the standalone-cluster command, or the new unmanaged-cluster command in the v0.10.0 release candidate?

There are issues with the way standalone-cluster works, so there is not a solution there. That command is being deprecated and replaced by unamanged-clusters.

If you use the unmanaged-cluster command, this can work but there are a couple gotchas. When deploying the cluster you will need to specify calico for the cni. The version of antrea we use by default is not able to handle the restart. On restart, the containers usually get assigned new IP addresses that cause problems there.

So the command tanzu unmanaged-cluster create --cni=calico foo should get a working cluster that (in most cases) should be able to survive a reboot. There may be some other conditions that cause problems with this, but so far in my experience it has been working fine.

Looks like the tanzu unmanaged-cluster create --cni=calico foo is the fix. My issue I am seeing now is that the kapp-controller is crashing every so often post reboot. Any ideas why that would be or how we could troubleshoot it further?

@stmcginnis
Copy link
Contributor

@seemiller any tips on troubleshooting kapp-controller? Or someone we can pull in from the carvel project to take a look?

@joshrosso
Copy link
Contributor

@butch7903 when you say it's crashing, can you help us understand:

  • what symptoms are you observing?
    • For example, are you seeing the pod go in CrashLoopBackOff?
    • Does this only occur shortly after the cluster reboots. or is it something that occurs hours after the cluster has been running healthily?
    • If you kubectl describe ... the kapp-controller pod, are any crash events displayed?

@butch7903
Copy link

butch7903 commented Mar 8, 2022

Yes, having CrashLoopBackOff. The crashes begin right after reboot and continue infinitely. So far at 211 restarts.
image

image
image
image
image
image
image

@joshrosso
Copy link
Contributor

Looks like the events are cut off, as I don't see the Error in here.

image

  • Can you copy and paste the output of the describe command?
  • Can you copy and paste the output of kubectl logs -n tkg-system $KAPP_CONTROLLER_POD_NAME -p

Also, if you want to join us in Slack, we can probably help you troubleshoot there.

Thanks!

@butch7903
Copy link

Looks like I should have tested a bit more. I blew that tanzu unmanged-cluster away and simply rebuilt it, waited for all of it to completely come up, and then rebooted. The kapp controller is now no longer having the issue of restarting, so all I can think of is that I must have rebooted before it had time to complete the setup of the kapp controller causing it to go into a bad state whenever I rebooted in the future.

@thesteve0
Copy link
Author

Since this is referring to standalone rather than unmanaged perhaps we should close this one.

@jpmcb
Copy link
Contributor

jpmcb commented Mar 21, 2022

Since this is referring to standalone rather than unmanaged perhaps we should close this one.

Yes. unmanaged-cluster doesn't suffer from this original issue. If users are using antrea CNI at the time of this writing, they may still experience issues when restarting their clusters.

Reference #3564 for further details

@jpmcb jpmcb closed this as completed Mar 21, 2022
@jorgemoralespou
Copy link
Contributor

@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359

@joshrosso
Copy link
Contributor

@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359

Agreed. Reopening.

To be entirely transparent, I don’t foresee CAPD-based restart support being in our near-term future.

@joshrosso joshrosso reopened this Mar 31, 2022
@joshrosso joshrosso added owner/upstream Work executed in an upstream project (not Carvel or tanzu-framework) and removed owner/core-eng Work executed by TCE's core engineering team labels Mar 31, 2022
@joshrosso joshrosso changed the title Containers don't restart after standalone CAPD install Containers don't restart after CAPD install Mar 31, 2022
@jorgemoralespou
Copy link
Contributor

Then I would make it more prominent at the top of the doc page and not at the very bottom https://tanzucommunityedition.io/docs/v0.11/docker-install-mgmt/

@RussellHamker
Copy link

It appears that this issue with a management cluster reboot is that when docker comes back up, the containers do not start in the same order or do not get the same IP addresses.

My Build Process for a management cluster today in Docker:
`cat < tce-mgmt.yaml
CLUSTER_CIDR: 100.96.0.0/11
CLUSTER_NAME: tce-mgmt
CLUSTER_PLAN: dev
ENABLE_MHC: "false"
IDENTITY_MANAGEMENT_TYPE: none
INFRASTRUCTURE_PROVIDER: docker
LDAP_BIND_DN: ""
LDAP_BIND_PASSWORD: ""
LDAP_GROUP_SEARCH_BASE_DN: ""
LDAP_GROUP_SEARCH_FILTER: ""
LDAP_GROUP_SEARCH_GROUP_ATTRIBUTE: ""
LDAP_GROUP_SEARCH_NAME_ATTRIBUTE: cn
LDAP_GROUP_SEARCH_USER_ATTRIBUTE: DN
LDAP_HOST: ""
LDAP_ROOT_CA_DATA_B64: ""
LDAP_USER_SEARCH_BASE_DN: ""
LDAP_USER_SEARCH_FILTER: ""
LDAP_USER_SEARCH_NAME_ATTRIBUTE: ""
LDAP_USER_SEARCH_USERNAME: userPrincipalName
OIDC_IDENTITY_PROVIDER_CLIENT_ID: ""
OIDC_IDENTITY_PROVIDER_CLIENT_SECRET: ""
OIDC_IDENTITY_PROVIDER_GROUPS_CLAIM: ""
OIDC_IDENTITY_PROVIDER_ISSUER_URL: ""
OIDC_IDENTITY_PROVIDER_NAME: ""
OIDC_IDENTITY_PROVIDER_SCOPES: ""
OIDC_IDENTITY_PROVIDER_USERNAME_CLAIM: ""
OS_ARCH: ""
OS_NAME: ""
OS_VERSION: ""
SERVICE_CIDR: 100.64.0.0/13
TKG_HTTP_PROXY_ENABLED: "false"
EOF

tanzu management-cluster create -f tce-mgmt.yaml --cni=calico

tanzu management-cluster kubeconfig get tce-mgmt --admin

docker network inspect kind

kubectl config use-context tce-mgmt-admin@tce-mgmt

kubectl get nodes

kubectl get po -A`

Prior to Reboot:
image
Post Reboot:
image

Is there a way we could force the tanzu mgmt docker containers to retain their IPs possibly? If so, this might be an easy fix.

@stmcginnis
Copy link
Contributor

the containers do not start in the same order or do not get the same IP addresses

If I remember right, this was the case when I looked closer and what I've heard from other upstream projects like kind. There really needs to be some sort of IPAM integration in docker that would allow for assigning persistent IP addresses to containers for multi-node clusters to reliable survive docker engine restarts. Without that, it's possible you can restart and get your full cluster back, but it's not very likely.

@RussellHamker
Copy link

@RussellHamker
Copy link

RussellHamker commented Mar 31, 2022

Just need to have VMware TCE update their tanzu mgmt creation to include the specific IPs
Or
Offer flags to allow us to set the IPs/network for the 3 distinct docker containers....

@stmcginnis
Copy link
Contributor

Looks like they already have that...

Yeah, the capability exists to set an address. The missing piece is an IPAM to manage what to set.

@RussellHamker
Copy link

I dont think IPAM is needed, you just need to set the IPs and network during the build process somehow. Where is the tanzu management yaml file stored after cluster creation?

@RussellHamker
Copy link

RussellHamker commented Mar 31, 2022

my temporary work around for this until I can find a better method is this:

#Restart process
CLUSTER="tce-mgmt"
docker stop $(docker ps -a -q)
docker start "$CLUSTER-lb"
docker start $(docker ps -a -q)
docker network inspect kind
kubectl config use-context $CLUSTER-admin@$CLUSTER
kubectl get nodes
kubectl get po -A

This causes the loadbalancer to always grab the .3 IP out of the IP pool instead of something else. The better thing long term will be to set the IP somehow for each container so that they come back up with the right IP on reboot. I will further look into this.

@jdumars jdumars closed this as completed Oct 31, 2022
@opsline-jvarelas
Copy link

opsline-jvarelas commented Apr 1, 2023

Hi @RussellHamker
I've been trying to execute your workaround without success, do you remember the exact order for all containers?, because i had a management and workloads clusters running .... but after the reboot are not working......

or

is there another way to run tanzu mgm and workload clusters within a linux server?, because i found the Docker option only in the TCE version.... thanks in advance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature A request for a new feature owner/upstream Work executed in an upstream project (not Carvel or tanzu-framework)
Projects
None yet
Development

No branches or pull requests