-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Add skaffold internal error and return that instead of user cancelled #6846
fix: Add skaffold internal error and return that instead of user cancelled #6846
Conversation
Codecov Report
@@ Coverage Diff @@
## main #6846 +/- ##
==========================================
- Coverage 70.48% 69.12% -1.36%
==========================================
Files 515 547 +32
Lines 23150 25078 +1928
==========================================
+ Hits 16317 17335 +1018
- Misses 5776 6577 +801
- Partials 1057 1166 +109
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e02bf37
to
d860795
Compare
d860795
to
ea7d1b8
Compare
ea7d1b8
to
a991781
Compare
@@ -243,8 +246,7 @@ func (s *monitor) statusCheck(ctx context.Context, out io.Writer) (proto.StatusC | |||
|
|||
// Wait for all deployment statuses to be fetched | |||
wg.Wait() | |||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a defer cancel
at L223
==== Testing Notes =====
|
Code changed after @aprindle reviewed
|
||
for _, d := range resources { | ||
wg.Add(1) | ||
go func(r *resource.Resource) { | ||
defer wg.Done() | ||
// keep updating the resource status until it fails/succeeds/times out | ||
pollResourceStatus(ctx, s.cfg, r) | ||
rcCopy := c.markProcessed(r.Status().Error()) | ||
rcCopy := c.markProcessed(ctx, r.StatusCode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the logic here and in getSkaffoldDeployStatus()
into counter
? This scattering of logic is awkward:
markProcessed
can check ifctx.Err() == context.Canceled
to determine if the user cancelled- have
markProcessed
return(counter, failed bool)
to replace theresourceFailed()
check on line 235 - move the exitStatusCode determination into
counter
- line 249 would
return c.result()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use markProcessed can check if ctx.Err() == context.Canceled to determine if the user cancelled
since, we fail fast and cancel context if a deployment failed.
So at this point, its difficult to detect if a user hit cntrl C Vs, code cancelling all the status check for all resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had suggestions 2-4 implemented but then the unit tests were becoming more complex. Maybe the counter
need to be renamed to deployState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so i remember, why we cant have counter store the exitStatusCode
. The counter gets updated in a go routine.
The exit code is stored in resources
. It makes sense to declare another mutex to update the exit status code safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok,
I am addressed 1 and 2 from the suggestions.
For 3, 4:
- I have declared a exitStatusCode in
checkStatus
and updated the code to only capture first failure. - kept the
getDeployStatus
as is.
if r.Status().Error() != nil && r.StatusCode() != proto.StatusCode_STATUSCHECK_USER_CANCELLED { | ||
// if a resource fails, cancel status checks for all resources to fail fast | ||
// and capture the first failed exit code. | ||
if failed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd that cancel is not considered a failure — worth documenting here.
return r.StatusCode(), err | ||
} | ||
if sc == proto.StatusCode_STATUSCHECK_SUCCESS || sc == 0 { | ||
log.Entry(ctx).Debugf("found statuscode %s. setting skaffold deploy status to STATUSCHECK_INTERNAL_ERROR.", sc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have a panicIfDev()
function that will fail for non-release builds.
Relates to #5424
VSC nightly job is seeing a bunch of errors where status check returns "USER_CANCELLED" when the deployment failed.
The deployments could have failed due to
In this case, the logic returns the exit code seen by
Resource.resource
i.e. pods in case of deployments and statefulset, KCC resource Instance in case of KCC.Previously, the code set
USER_CANCELLED
as failure code. In this PR, I have done the following changesCode change to only return
USER_CANCELLED
if all resources status check was cancelled.To detect this condition add
cancelled
to the counter.Add a new error code
STATUSCHECK_INTERNAL_ERROR
when resource fails but we don't find an offending error code.