Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add restart cluster command #408

Closed

Conversation

tao12345666333
Copy link
Member

@tao12345666333 tao12345666333 commented Mar 27, 2019

xref: #148

  • Add restart cluster command
  • The container's IP will be change, when restart Nodes(containers). We need update the admin.conf and others.
    • Change HAProxy's config file. Make it can forward request to correct control plane.
    • Make control plane can work correctly. Include api-server , etcd and some certs.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 27, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tao12345666333
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: bentheelder

If they are not already assigned, you can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 27, 2019
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tao12345666333 thanks for working on this!

The container's IP will be change, when restart Nodes(containers). We need update the admin.conf.

this seems tricky. if the IP of a load balancer or a control plane node change this means that all workers have to rejoin the cluster. is there a way to make docker restart keep the old IP/port?

@tao12345666333
Copy link
Member Author

is there a way to make docker restart keep the old IP/port?

if we use docker run with --ip flag maybe resolve it.

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
@tao12345666333
Copy link
Member Author

after #461 be merged we can make this simple. although the core issue to be addressed is still IP changes.

I will complete this PR as soon as possible after the end of this holiday.

I need to complete the following sections.

  • change all old IP to new IP at /etc/kubernetes.
  • regenerate apiserver and etcd peer certs

@aojea
Copy link
Contributor

aojea commented Apr 30, 2019

just sharing my thoughts, what's the best option regarding the IP address problem?

  1. Try to keep the same ip addresses on all the nodes after restart?
  2. Regenerate all certificates after restart matching the new ip addresses?

I guess that 1 is the option closer to real user scenarios, and looking at docker networking seems possible to assign static ip addresses to the node so a possible solution can be.

  1. Get node ip address
  2. Stop node
  3. Start node with address obtained in 1

@tao12345666333
Copy link
Member Author

Thanks for your point.

My current practice is the first one.

Because the scenario we first encountered was due to "docker restart", #148 this does not guarantee that the original IP is not already occupied. It may not be suitable to start it with the original static IP.

Of course, if the restart is done in 1 , then we should select the higher IP in the IP address pool at startup to avoid being occupied.

@tao12345666333
Copy link
Member Author

Do you have any suggestions? @BenTheElder @neolit123 I think it would be easier to do this by using kubeadm.

return err
}

if !node.WaitForDocker(time.Now().Add(time.Second * 60)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this PR will be affected by the changes in:
#461

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. after the PR has been merged these codes remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to rework HAproxy in the new world ™️ as well, so those parts will be getting a PR soon and also will need updating... sorry about that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not mind. as long as we can push things forward, it will suffice.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2019
@k8s-ci-robot
Copy link
Contributor

@tao12345666333: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Contributor

aojea commented May 2, 2019

@tao12345666333 I think that the problem is that we are using docker restart that's adding the complexity of dealing with the IP assignment.
IIUIC the goal is to simulate a node restart. Since the nodes are using systemd I think that it could be possible to implement a "simulated" restart just restarting all service inside the container systemctl restart kubelet containerd ... getting rid of the docker restart command.
@neolit123 @fabriziopandini @BenTheElder any thoughts?

@neolit123
Copy link
Member

the original issue #148 (comment)
talks about docker on the host being restarted which renders the cluster broken.

with this comment:
#148 (comment)
we are one step further, but i don't think the IP update is avoidable here.

can we just use container networking instead of IPs (e.g. run --network).
kubeadm should work be fine with that...

@BenTheElder
Copy link
Member

  • we should explore networks with --network
  • the primary goal AFAIK is to survive container restarts, mostly from users on docker for mac after restarting the daemon i'd guess.
  • after we sort out networking, this should just be starting the containers with matching labels, fixmounts etc. is gone now :-)

@tao12345666333
Copy link
Member Author

I will open another PR to handle the network problem(using --network instead of IPs). After the processing is completed, I will process this PR.

Thanks all. 👍

@tao12345666333 tao12345666333 mentioned this pull request May 5, 2019
4 tasks
@BenTheElder BenTheElder added this to the 0.4 milestone May 15, 2019
@BenTheElder BenTheElder modified the milestones: v0.4.0, v0.5.0 Jun 20, 2019
@BenTheElder BenTheElder removed this from the v0.5.0 milestone Aug 20, 2019
@Ilyes512
Copy link

Ilyes512 commented Sep 15, 2019

removed this from the v0.5.0 milestone 26 days ago

It would be really awesome if it was possible to restart kind cluster(s) after for example rebooting (MacOS) your host. I (want to) use kind for local development. My end goal is to replace minikube.

Luckily recreating the cluster(s) is really fast, but it would still be nice to just (re)start it. I am currently looking at Velero and see if I can use that to restore the state after recreation.

@k8s-ci-robot
Copy link
Contributor

@tao12345666333: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kind-conformance-parallel-1-14 0947f8d link /test pull-kind-conformance-parallel-1-14
pull-kind-conformance-parallel-1-15 0947f8d link /test pull-kind-conformance-parallel-1-15
pull-kind-unit 0947f8d link /test pull-kind-unit
pull-kind-conformance-parallel-1-16 0947f8d link /test pull-kind-conformance-parallel-1-16
pull-kind-e2e-kubernetes 0947f8d link /test pull-kind-e2e-kubernetes

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@BenTheElder
Copy link
Member

this will need to be revisited on top of the provider work after we figure out the network issues

thanks again for the PR, we'll surely reference this when we've resolved the network part and come to revive restart support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants