Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add skaffold internal error and return that instead of user cancelled #6846

Merged
merged 10 commits into from
Nov 15, 2021

Conversation

tejal29
Copy link
Contributor

@tejal29 tejal29 commented Nov 11, 2021

Relates to #5424

VSC nightly job is seeing a bunch of errors where status check returns "USER_CANCELLED" when the deployment failed.
The deployments could have failed due to

  1. resource failed to stabilize within the given time.
    In this case, the logic returns the exit code seen by Resource.resource i.e. pods in case of deployments and statefulset, KCC resource Instance in case of KCC.

Previously, the code setUSER_CANCELLED as failure code. In this PR, I have done the following changes

  1. Code change to only return USER_CANCELLED if all resources status check was cancelled.
    To detect this condition add cancelled to the counter.

  2. Add a new error code STATUSCHECK_INTERNAL_ERROR when resource fails but we don't find an offending error code.

@tejal29 tejal29 requested a review from a team as a code owner November 11, 2021 21:31
@google-cla google-cla bot added the cla: yes label Nov 11, 2021
@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Merging #6846 (05a8b52) into main (290280e) will decrease coverage by 1.35%.
The diff coverage is 64.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6846      +/-   ##
==========================================
- Coverage   70.48%   69.12%   -1.36%     
==========================================
  Files         515      547      +32     
  Lines       23150    25078    +1928     
==========================================
+ Hits        16317    17335    +1018     
- Misses       5776     6577     +801     
- Partials     1057     1166     +109     
Impacted Files Coverage Δ
cmd/skaffold/app/cmd/deploy.go 52.00% <ø> (-1.85%) ⬇️
cmd/skaffold/app/cmd/dev.go 84.61% <0.00%> (ø)
cmd/skaffold/app/cmd/flags.go 91.00% <0.00%> (+0.18%) ⬆️
cmd/skaffold/app/cmd/render.go 36.66% <0.00%> (-4.72%) ⬇️
cmd/skaffold/skaffold.go 0.00% <0.00%> (ø)
cmd/skaffold/app/cmd/inspect_tests.go 62.50% <14.28%> (-1.14%) ⬇️
cmd/skaffold/app/cmd/lint.go 42.85% <42.85%> (ø)
cmd/skaffold/app/cmd/find_configs.go 48.88% <50.00%> (+0.24%) ⬆️
cmd/skaffold/app/skaffold.go 76.19% <70.00%> (-8.43%) ⬇️
cmd/skaffold/app/cmd/inspect_build_env.go 65.11% <75.00%> (+6.39%) ⬆️
... and 161 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01b8d5c...05a8b52. Read the comment docs.

aaron-prindle
aaron-prindle previously approved these changes Nov 11, 2021
Copy link
Contributor

@aaron-prindle aaron-prindle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tejal29 tejal29 force-pushed the fix_user_cancelled_sc branch 2 times, most recently from e02bf37 to d860795 Compare November 12, 2021 03:13
@tejal29 tejal29 force-pushed the fix_user_cancelled_sc branch from d860795 to ea7d1b8 Compare November 12, 2021 03:25
@tejal29 tejal29 force-pushed the fix_user_cancelled_sc branch from ea7d1b8 to a991781 Compare November 12, 2021 03:26
@@ -243,8 +246,7 @@ func (s *monitor) statusCheck(ctx context.Context, out io.Writer) (proto.StatusC

// Wait for all deployment statuses to be fetched
wg.Wait()
cancel()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a defer cancel at L223

@tejal29
Copy link
Contributor Author

tejal29 commented Nov 12, 2021

==== Testing Notes =====

  1. Return status code other than success for unstable deployment
$ LOCAL=true make && cd integration/testdata/unstable-deployment
$ ../../../out/skaffold dev -d gcr.io/tejal-gke1 -v=debug
DEBU[0010] Fetching logs for container unstable-deployment-7cbcd465c7-z898k/incorrect-example  subtask=-1 task=DevLoop
 - deployment/unstable-deployment: BackOff: Back-off restarting failed container
    - pod/unstable-deployment-7cbcd465c7-z898k: BackOff: Back-off restarting failed container
DEBU[0011] marking resource failed due to error code STATUSCHECK_CONTAINER_RESTARTING  subtask=-1 task=Deploy
 - deployment/unstable-deployment: container incorrect-example is backing off waiting to restart
    - pod/unstable-deployment-7cbcd465c7-z898k: container incorrect-example is backing off waiting to restart
 - deployment/unstable-deployment failed. Error: container incorrect-example is backing off waiting to restart.
DEBU[0011] setting skaffold deploy status to STATUSCHECK_CONTAINER_RESTARTING.  subtask=-1 task=Deploy

  1. Return status code cancelled when you hit control C in status check phase.
$ ../../../out/skaffold dev -d gcr.io/tejal-gke1 -v=debug

INFO[0004] Deploy completed in 1.661 second              subtask=-1 task=Deploy
Waiting for deployments to stabilize...
DEBU[0004] getting client config for kubeContext: `gke_tejal-gke1_us-central1-c_dump-apis`  subtask=-1 task=DevLoop
DEBU[0004] getting client config for kubeContext: `gke_tejal-gke1_us-central1-c_dump-apis`  subtask=-1 task=DevLoop
DEBU[0005] checking status deployment/unstable-deployment  subtask=-1 task=Deploy
^CDEBU[0005] marking resource status check cancelledSTATUSCHECK_USER_CANCELLED  subtask=-1 task=Deploy
DEBU[0005] set

@tejal29 tejal29 dismissed aaron-prindle’s stale review November 12, 2021 21:41

Code changed after @aprindle reviewed


for _, d := range resources {
wg.Add(1)
go func(r *resource.Resource) {
defer wg.Done()
// keep updating the resource status until it fails/succeeds/times out
pollResourceStatus(ctx, s.cfg, r)
rcCopy := c.markProcessed(r.Status().Error())
rcCopy := c.markProcessed(ctx, r.StatusCode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the logic here and in getSkaffoldDeployStatus() into counter? This scattering of logic is awkward:

  • markProcessed can check if ctx.Err() == context.Canceled to determine if the user cancelled
  • have markProcessed return (counter, failed bool) to replace the resourceFailed() check on line 235
  • move the exitStatusCode determination into counter
  • line 249 would return c.result()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use markProcessed can check if ctx.Err() == context.Canceled to determine if the user cancelled since, we fail fast and cancel context if a deployment failed.
So at this point, its difficult to detect if a user hit cntrl C Vs, code cancelling all the status check for all resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had suggestions 2-4 implemented but then the unit tests were becoming more complex. Maybe the counter need to be renamed to deployState.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so i remember, why we cant have counter store the exitStatusCode. The counter gets updated in a go routine.
The exit code is stored in resources. It makes sense to declare another mutex to update the exit status code safely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,
I am addressed 1 and 2 from the suggestions.
For 3, 4:

  • I have declared a exitStatusCode in checkStatus and updated the code to only capture first failure.
  • kept the getDeployStatus as is.

if r.Status().Error() != nil && r.StatusCode() != proto.StatusCode_STATUSCHECK_USER_CANCELLED {
// if a resource fails, cancel status checks for all resources to fail fast
// and capture the first failed exit code.
if failed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that cancel is not considered a failure — worth documenting here.

return r.StatusCode(), err
}
if sc == proto.StatusCode_STATUSCHECK_SUCCESS || sc == 0 {
log.Entry(ctx).Debugf("found statuscode %s. setting skaffold deploy status to STATUSCHECK_INTERNAL_ERROR.", sc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a panicIfDev() function that will fail for non-release builds.

@tejal29 tejal29 merged commit 9e8762f into GoogleContainerTools:main Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants