-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eager workflow dispatch #3835
Eager workflow dispatch #3835
Conversation
// The current workflow task is not inflight or not the first task or we exceeded the first attempt and fell back to | ||
// matching based dispatch. | ||
if !mutableStateInfo.hasInflight || mutableStateInfo.workflowTaskInfo.StartedEventID != 3 || mutableStateInfo.workflowTaskInfo.Attempt > 1 { | ||
return nil, serviceerror.NewWorkflowExecutionAlreadyStarted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return error here? Start workflow is still a success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we cannot return an eager task.
Caller should be notified with an error and handle this case.
I think it's clearer than returning nil task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence whether this should result in an error or not.
In any case what the caller should do is get a handle to the workflow and wait for its completion.
With the error approach there's at least a way to let the caller know what happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Spikhalskiy @cretz I could use your opinion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think it's clearer/safer to consider eagerness to be a request of the server, not a requirement. The server is allowed to use whatever heuristics it wants to deny the request and do a non-eager and it still be a successful task. The absence of a task in the response I think is enough to tell the caller the server denied the request.
I understand your use case is a "require to be eager" but that use case isn't supported for activities either. If we want a require-to-be-eager to be a thing, it should be a thing on both. I could support a server-side/namespace option of "fail if eager requested but cannot be given".
(my opinion is not super strong here...if we agree that eager workflow tasks are a server requirement not a request, we can just doc clearly and then error if it cannot be granted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm...k.
So if I call this start, may a workflow start because of it? If so, IMO that should never error even if something post-start can't happen. To me success means "workflow started because of this request" and failure means "workflow did not start because of this request". Unsure if that's related. Sorry, not familiar w/ details here so I don't have a big opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a client gets this error it means that a retried request came in too late and the workflow task cannot be dispatched eagerly (likely due to it already being dispatched via the standard, matching based, path).
The only thing an error gives you over the alternative where the server responds successfully but omits the inline task is that additional piece of information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply - if success means workflow started by this request and failure means workflow not started by request, works for me. If failure can mean workflow still started by this request, that's confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the workflow was started from a previous attempt of the same request. So it seems like you're saying it should not be an error.
I tend to agree but I will add a log to avoid losing some of this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up returning successful result and recording a metric saying eager execution was denied with a reason tag.
1e7342a
to
593a669
Compare
TaskToken: serializedToken, | ||
WorkflowExecution: &commonpb.WorkflowExecution{WorkflowId: workflowID, RunId: runID}, | ||
WorkflowType: request.GetWorkflowType(), | ||
PreviousStartedEventId: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ideally we should get the value from mutable state.
return nil, err | ||
} | ||
|
||
// If first workflow task should back off (e.g. cron or workflow retry) a workflow task will not be scheduled. | ||
if requestEagerExecution && scheduledEventID != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for readability use newMutableState.HasPendingWorkflowTask()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I find it more readable but I'm fine with making this change
49d58ce
to
5b10c6c
Compare
5b10c6c
to
9a3a16e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we abstract an ExecutionStrategy
interface here in a separate PR to keep things SOLID and make the review easier? The code looks good to me, but it's hard to review with all of the modifications to the existing code.
common/metrics/tags.go
Outdated
|
||
// ReasonTag is a generic tag can be used anywhere a reason is needed. | ||
// Make sure that the value is of limited cardinality. | ||
func ReasonTag(value string) Tag { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take in an opaque enum type here to prevent misuse. I think the safety benefit outweighs the cost of the tediousness involved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow, do you mean defining a type like:
// ReasonString is just a string but used to remind anyone using ReasonTag to limit the cardinality of the possible reasons.
type ReasonString string
If that's the case, I see little benefit to that over documenting the ask to limit the cardinality of values.
I already have a custom string enum type where this is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up adding this.
@@ -99,16 +99,32 @@ func NewWorkflowWithSignal( | |||
return nil, err | |||
} | |||
} | |||
|
|||
requestEagerExecution := startRequest.StartRequest.GetRequestEagerExecution() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first usage of startRequest.StartRequest
that I see in this function, so I'm worried about potential NPEs. How do we know this is always non-nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, we count on this request to be validated before this method is called.
@@ -109,7 +109,7 @@ type ( | |||
AddWorkflowTaskCompletedEvent(int64, int64, *workflowservice.RespondWorkflowTaskCompletedRequest, int) (*historypb.HistoryEvent, error) | |||
AddWorkflowTaskFailedEvent(scheduledEventID int64, startedEventID int64, cause enumspb.WorkflowTaskFailedCause, failure *failurepb.Failure, identity, binChecksum, baseRunID, newRunID string, forkEventVersion int64) (*historypb.HistoryEvent, error) | |||
AddWorkflowTaskScheduleToStartTimeoutEvent(int64) (*historypb.HistoryEvent, error) | |||
AddFirstWorkflowTaskScheduled(*historypb.HistoryEvent) error | |||
AddFirstWorkflowTaskScheduled(event *historypb.HistoryEvent, bypassTaskGeneration bool) (int64, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd split this into two separate methods instead of adding a flag argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface already has way too many methods IMHO.
} | ||
|
||
// prepare applies request overrides, validates the request, and records eager execution metrics. | ||
func (s *Starter) prepare(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider parsing the proto into a new StartRequest type instead of modifying the request: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that is commonly don't in Go?
If this was JS or a functional language or a language that has the concept of immutable data I would totally clone the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of the same pattern we use for tasks here: https://github.com/temporalio/temporal/blob/66db3aebeead81869dcc864e0f174063b167ee10/common/persistence/serialization/task_serializer.go
Basically taking the proto and parsing it into a plain Go object with the fields already validated and parsed into more structured types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I conceptually agree here but since the raw protos are used in the internal APIs, parsing does not make sense in this case.
|
||
// creationContext is a container for all information obtained from creating the uncommitted execution. | ||
// The information is later used to create a new execution and handle conflicts. | ||
type creationContext struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd rename this to creationParams to avoid having another somethingCtx param floating around because devs will be unsure whether it embeds an actual Context or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
metricsHandler.Counter(metrics.WorkflowEagerExecutionDeniedCounter.GetMetricName()).Record( | ||
1, | ||
metrics.NamespaceTag(s.namespace.Name().String()), | ||
metrics.TaskQueueTag(s.request.StartRequest.TaskQueue.Name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where using a parsed type instead of the raw StartRequest is better because it prevents Law of Demeter violations here and elsewhere
if err == nil { | ||
return s.generateResponse(creationCtx.runID, creationCtx.workflowTaskInfo, extractHistoryEvents(creationCtx.workflowEventBatches)) | ||
} | ||
t, ok := err.(*persistence.CurrentWorkflowConditionFailedError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Errors.As
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the original implementation but yes, As is better here.
// The history and mutable state we generated above should be deleted by a background process. | ||
return s.handleConflict(ctx, creationCtx, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we crash before reaching this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this comment:
// The history and mutable state we generated above should be deleted by a background process.
Is handleConflict
the method that deletes them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WT state machine changes are not in conflict with workflow update changes.
opTag := tag.WorkflowActionWorkflowTaskScheduled | ||
if err := ms.checkMutability(opTag); err != nil { | ||
return err | ||
return 0, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is common.EmptyEventID
(which is 0
) and it fits perfectly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
service/history/handler.go
Outdated
if err != nil { | ||
return nil, h.convertError(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error check needs to be inside if
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter much but you're right.
16445ad
to
e0a69c2
Compare
Implement eager workflow dispatch for
StartWorkflowExecution
.I've only added a single unit test to the repo, the rest are in this PR in the
features
repo.Added a counter metric
workflow_eager_execution
to count the number of eager execution requests per namespace + task queue.I've added a TODO to add support for eager signal with start, I figured that can be added later.