Implement a proper topology stop strategy #1091

LucioFranco · 2019-10-25T14:48:53Z

Motivation

Currently, our topology stop method attempts to shut down the topology tasks by selecting against them and dropping any resources. This is incorrect, we should instead notify the task that we need to shut down and let it gracefully shutdown.

Prior art:

https://docs.rs/tokio-evacuate/
https://github.com/linkerd/linkerd2-proxy/blob/master/linkerd/app/core/src/serve.rs#L13
https://github.com/sfackler/futures-shutdown

Proposal

Our codebase currently does this https://github.com/timberio/vector/blob/master/src/topology/builder.rs#L128 which will force cancel a task. In the context of TCP this is fine because we create a second tripwire that when the original task gets drop will fire and thus cancel all of its child tasks.

This though does not work for something like hyper that wants a signal to shutdown as shown here https://docs.rs/hyper/0.13.0-alpha.4/hyper/server/struct.Server.html#method.with_graceful_shutdown. What this means is that we were never originally "gracefully" shutting down but forcing the task to quit.

My proposal is to invert this by providing each spawned task a handle that can produce a future that will be notified once the shutdown process has started. Each handle is correlated with one task. That task should be considered part of the group of tasks connected to that one instance of the topology. When you reload the topology we should shutdown all tasks that were spawned from that config. The shutdown should first signal each task so they can start their clean up. Once they have all finished their clean up the handle gets dropped. Once, all handles are dropped the main shutdown future can complete. This main future that is called from Topology::stop can be selected against with a timer to produce a graceful shutdown period then a force shutdown if tasks hang.

Here is an example I threw together last night that shows how we might be able to expose this concept to users that want to implement sources, transforms, and sinks. https://github.com/LucioFranco/iquit/blob/master/src/work.rs#L24

One big reason, I think we should provide a better api as well as properly implementing this is because getting shutdown right is hard. Because its hard, it will be difficult to expect contributors to properly implement it manually like how the TCP source handles it. With a nice api to use and good documentation, we can ensure that all our sources, transforms, and sinks properly shutdown in the future.

Ideally, we could provide something like this:

let shutdown = Shutdown::new();

// This will use _some_ executor to spawn a task.
//
// This internally will register a new task, get a handle to that
// registration. Then call this closure with the handle so that the
// inner task future can do something with it. This will also wrap
// the returned Future from `SourceConfig::build` with a select against a
// oneshot channel like we do now so that we can force shut it down
// if it does not gracefully shutdown.
shutdown.spawn(|handle| source.build(name, globals, out, handle));

// Now lets attempt to shut it down
if let Err(_) =
	rt.block_on(shutdown.shutdown_with_timeout(Duration::from_secs(60))) {
	// Graceful shutdown failed so lets force shutdown everything
	rt.block_on(shutdown.force());
}

cc @lukesteensen @a-rodin

The text was updated successfully, but these errors were encountered:

binarylogic · 2019-10-25T23:54:23Z

Nice write up! So it sounds like we should couple this with the upcoming contributing project, correct?

lukesteensen · 2019-10-28T22:15:53Z

👍 Your example looks good to me. Definitely want to keep the API as simple and hard to misuse as possible.

stbrody · 2020-02-22T01:40:06Z

I'm trying to understand how shutdown in the TCP source works currently. I see here we create a Trigger, Tripwire pair, and we later pass the Tripwire to the FramedRead's take_until method, but we seem to drop the Trigger on the floor without ever signaling it. My understanding of how the Trigger, Tripwire pairs are supposed to work is that the Tripwire Future is activated when the Trigger is signaled. Since the Trigger here is never signaled, doesn't that mean the Tripwire should never fire and take_until would run forever?

@LucioFranco @lukesteensen

stbrody · 2020-02-24T22:50:55Z

Ugh, nevermind. I had assumed that Trigger.cancel() did what Trigger.disable() actually does. I now see that this line is what signals the trigger to fire. I assume the future from the for_each line resolves and the inspect line activates when the socket is closed or errors.

) Phase one is signaling to all sources to shut down, phase 2 is waiting for them finish shutting down gracefully, or forcing shutdown if they don't finish within a time limit (currently 30 seconds) Signed-off-by: Spencer T Brody <spencer.t.brody@gmail.com>

…ordotdev#1091) Phase one is signaling to all sources to shut down, phase 2 is waiting for them finish shutting down gracefully, or forcing shutdown if they don't finish within a time limit (currently 30 seconds) Signed-off-by: Spencer T Brody <spencer.t.brody@gmail.com>

* chore(topology): Refactor source shutdown and make it two-phase (#1091) Phase one is signaling to all sources to shut down, phase 2 is waiting for them finish shutting down gracefully, or forcing shutdown if they don't finish within a time limit (currently 3 seconds) Signed-off-by: Spencer T Brody <spencer.t.brody@gmail.com>

LucioFranco mentioned this issue Nov 4, 2019

Upgrade to tokio v0.2 and std::future::Future #1142

Closed

binarylogic assigned LucioFranco Nov 26, 2019

binarylogic removed the needs: approval Needs review & approval before work can begin. label Nov 26, 2019

lukesteensen mentioned this issue Feb 9, 2020

Shape "Tech-debt payment #1" project #1550

Closed

binarylogic unassigned LucioFranco Feb 10, 2020

stbrody self-assigned this Feb 21, 2020

binarylogic added this to the Tech-debt payment #1: Simplifying shutdown milestone Feb 22, 2020

stbrody mentioned this issue Feb 25, 2020

enhancement(topology): POC of topology shutdown cleanup #1917

Closed

Hoverbear linked a pull request Mar 16, 2020 that will close this issue

chore(topology): Refactor source shutdown and make it two-phase #1994

Merged

stbrody closed this as completed in #1994 Mar 18, 2020

Hoverbear reopened this Mar 18, 2020

stbrody reopened this Mar 18, 2020

stbrody linked a pull request Mar 19, 2020 that will close this issue

chore(topology): Make RunningTopology::stop() signal all sources to shut down using the new ShutdownCoordinator (1091) #2098

Merged

This was referenced Mar 20, 2020

Wire ShutdownSignal into all sources #2108

Closed

Update all timeouts related to shutting down sources #2109

Closed

stbrody closed this as completed in #2098 Mar 20, 2020

stbrody mentioned this issue Apr 20, 2020

chore(topology): Refactor source shutdown and make it two-phase #1994

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a proper topology stop strategy #1091

Implement a proper topology stop strategy #1091

LucioFranco commented Oct 25, 2019 •

edited

Loading

binarylogic commented Oct 25, 2019

lukesteensen commented Oct 28, 2019

stbrody commented Feb 22, 2020

stbrody commented Feb 24, 2020

Implement a proper topology stop strategy #1091

Implement a proper topology stop strategy #1091

Comments

LucioFranco commented Oct 25, 2019 • edited Loading

Motivation

Proposal

binarylogic commented Oct 25, 2019

lukesteensen commented Oct 28, 2019

stbrody commented Feb 22, 2020

stbrody commented Feb 24, 2020

LucioFranco commented Oct 25, 2019 •

edited

Loading