Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should unwinding instance-start sagas put the instance in Stopped or Failed? #7727

Open
hawkw opened this issue Mar 4, 2025 · 0 comments
Open
Assignees

Comments

@hawkw
Copy link
Member

hawkw commented Mar 4, 2025

When an instance-start saga unwinds, the compensating actions transition the instance back to the Stopped state. This makes sense from the perspective of "an unwinding saga node should put things back Exactly The Way They Were Before".1 However, it's a bit weird with regards to the user's intent: arguably, an attempt to start the instance that was unsuccessful is equivalent to successfully starting an instance and then having it fail, in terms of the desired state for that instance. The user asked us to start it, so perhaps we should continue trying to start it.2

This could maybe be achieved by transitioning the instance to Failed. It's also potentially solveable by the idea @gjcolombo and I have discussed where we store a target instance state alongside the current instance state, and attempt to reconcile them when Something Happens in the Real World. This also solves stuff like #6809, which is kind of the same weird behavior in the opposite direction (we were asked to stop an instance, upon doing so we discover it's gone away, and then we immediately try to restart it, which is Goofy).

On the other hand, the question of what is the Right Thing in this situation is complicated by some of the reasons an instance-start saga may fail. In particular, it might fail due to insufficient resources being available for the instance, in which case we probably should not retry starting it immediately, but should maybe try again in a little while if sufficient resources become available? Or perhaps not --- maybe retrying starting an instance that we couldn't find resources for should be an explicit user action? I feel a bit conflicted about that.

Footnotes

  1. (more or less, modulo generation numbers &c)

  2. At least, if auto-restart is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants