fix: Add skaffold internal error and return that instead of user cancelled #6846

tejal29 · 2021-11-11T21:31:01Z

Relates to #5424

VSC nightly job is seeing a bunch of errors where status check returns "USER_CANCELLED" when the deployment failed.
The deployments could have failed due to

resource failed to stabilize within the given time.
In this case, the logic returns the exit code seen by Resource.resource i.e. pods in case of deployments and statefulset, KCC resource Instance in case of KCC.

Previously, the code setUSER_CANCELLED as failure code. In this PR, I have done the following changes

Code change to only return USER_CANCELLED if all resources status check was cancelled.
To detect this condition add cancelled to the counter.
Add a new error code STATUSCHECK_INTERNAL_ERROR when resource fails but we don't find an offending error code.

…elled

codecov · 2021-11-11T21:56:50Z

Codecov Report

Merging #6846 (05a8b52) into main (290280e) will decrease coverage by 1.35%.
The diff coverage is 64.65%.

@@            Coverage Diff             @@
##             main    #6846      +/-   ##
==========================================
- Coverage   70.48%   69.12%   -1.36%     
==========================================
  Files         515      547      +32     
  Lines       23150    25078    +1928     
==========================================
+ Hits        16317    17335    +1018     
- Misses       5776     6577     +801     
- Partials     1057     1166     +109

Impacted Files	Coverage Δ
cmd/skaffold/app/cmd/deploy.go	`52.00% <ø> (-1.85%)`	⬇️
cmd/skaffold/app/cmd/dev.go	`84.61% <0.00%> (ø)`
cmd/skaffold/app/cmd/flags.go	`91.00% <0.00%> (+0.18%)`	⬆️
cmd/skaffold/app/cmd/render.go	`36.66% <0.00%> (-4.72%)`	⬇️
cmd/skaffold/skaffold.go	`0.00% <0.00%> (ø)`
cmd/skaffold/app/cmd/inspect_tests.go	`62.50% <14.28%> (-1.14%)`	⬇️
cmd/skaffold/app/cmd/lint.go	`42.85% <42.85%> (ø)`
cmd/skaffold/app/cmd/find_configs.go	`48.88% <50.00%> (+0.24%)`	⬆️
cmd/skaffold/app/skaffold.go	`76.19% <70.00%> (-8.43%)`	⬇️
cmd/skaffold/app/cmd/inspect_build_env.go	`65.11% <75.00%> (+6.39%)`	⬆️
... and 161 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01b8d5c...05a8b52. Read the comment docs.

pkg/skaffold/kubernetes/status/status_check.go

aaron-prindle

LGTM

tejal29 · 2021-11-12T03:27:31Z

pkg/skaffold/kubernetes/status/status_check.go

@@ -243,8 +246,7 @@ func (s *monitor) statusCheck(ctx context.Context, out io.Writer) (proto.StatusC

 	// Wait for all deployment statuses to be fetched
 	wg.Wait()
-	cancel()


There is a defer cancel at L223

tejal29 · 2021-11-12T03:51:27Z

==== Testing Notes =====

Return status code other than success for unstable deployment

$ LOCAL=true make && cd integration/testdata/unstable-deployment

$ ../../../out/skaffold dev -d gcr.io/tejal-gke1 -v=debug
DEBU[0010] Fetching logs for container unstable-deployment-7cbcd465c7-z898k/incorrect-example  subtask=-1 task=DevLoop
 - deployment/unstable-deployment: BackOff: Back-off restarting failed container
    - pod/unstable-deployment-7cbcd465c7-z898k: BackOff: Back-off restarting failed container
DEBU[0011] marking resource failed due to error code STATUSCHECK_CONTAINER_RESTARTING  subtask=-1 task=Deploy
 - deployment/unstable-deployment: container incorrect-example is backing off waiting to restart
    - pod/unstable-deployment-7cbcd465c7-z898k: container incorrect-example is backing off waiting to restart
 - deployment/unstable-deployment failed. Error: container incorrect-example is backing off waiting to restart.
DEBU[0011] setting skaffold deploy status to STATUSCHECK_CONTAINER_RESTARTING.  subtask=-1 task=Deploy

Return status code cancelled when you hit control C in status check phase.

$ ../../../out/skaffold dev -d gcr.io/tejal-gke1 -v=debug

INFO[0004] Deploy completed in 1.661 second              subtask=-1 task=Deploy
Waiting for deployments to stabilize...
DEBU[0004] getting client config for kubeContext: `gke_tejal-gke1_us-central1-c_dump-apis`  subtask=-1 task=DevLoop
DEBU[0004] getting client config for kubeContext: `gke_tejal-gke1_us-central1-c_dump-apis`  subtask=-1 task=DevLoop
DEBU[0005] checking status deployment/unstable-deployment  subtask=-1 task=Deploy
^CDEBU[0005] marking resource status check cancelledSTATUSCHECK_USER_CANCELLED  subtask=-1 task=Deploy
DEBU[0005] set

…statuses being updated

@aprindle

Code changed after @aprindle reviewed

briandealwis · 2021-11-12T22:29:05Z

pkg/skaffold/kubernetes/status/status_check.go


 	for _, d := range resources {
 		wg.Add(1)
 		go func(r *resource.Resource) {
 			defer wg.Done()
 			// keep updating the resource status until it fails/succeeds/times out
 			pollResourceStatus(ctx, s.cfg, r)
-			rcCopy := c.markProcessed(r.Status().Error())
+			rcCopy := c.markProcessed(ctx, r.StatusCode())


Can we move the logic here and in getSkaffoldDeployStatus() into counter? This scattering of logic is awkward:

markProcessed can check if ctx.Err() == context.Canceled to determine if the user cancelled

have markProcessed return (counter, failed bool) to replace the resourceFailed() check on line 235

move the exitStatusCode determination into counter

line 249 would return c.result()

We can't use markProcessed can check if ctx.Err() == context.Canceled to determine if the user cancelled since, we fail fast and cancel context if a deployment failed.
So at this point, its difficult to detect if a user hit cntrl C Vs, code cancelling all the status check for all resources.

I had suggestions 2-4 implemented but then the unit tests were becoming more complex. Maybe the counter need to be renamed to deployState.

Ok, so i remember, why we cant have counter store the exitStatusCode. The counter gets updated in a go routine.
The exit code is stored in resources. It makes sense to declare another mutex to update the exit status code safely.

Ok,
I am addressed 1 and 2 from the suggestions.
For 3, 4:

I have declared a exitStatusCode in checkStatus and updated the code to only capture first failure.

kept the getDeployStatus as is.

briandealwis · 2021-11-15T18:15:11Z

pkg/skaffold/kubernetes/status/status_check.go

-			if r.Status().Error() != nil && r.StatusCode() != proto.StatusCode_STATUSCHECK_USER_CANCELLED {
+			// if a resource fails, cancel status checks for all resources to fail fast
+			// and capture the first failed exit code.
+			if failed {


It's odd that cancel is not considered a failure — worth documenting here.

briandealwis · 2021-11-15T18:23:08Z

pkg/skaffold/kubernetes/status/status_check.go

-			return r.StatusCode(), err
-		}
+	if sc == proto.StatusCode_STATUSCHECK_SUCCESS || sc == 0 {
+		log.Entry(ctx).Debugf("found statuscode %s. setting skaffold deploy status to STATUSCHECK_INTERNAL_ERROR.", sc)


I wonder if we should have a panicIfDev() function that will fail for non-release builds.

proto/enums/enums.proto

fix: Add skaffold internal error and return that instead of user canc…

fda2a58

…elled

tejal29 requested a review from a team as a code owner November 11, 2021 21:31

tejal29 requested a review from aaron-prindle November 11, 2021 21:31

pull-request-size bot added the size/XL label Nov 11, 2021

google-cla bot added the cla: yes label Nov 11, 2021

tejal29 commented Nov 11, 2021

View reviewed changes

pkg/skaffold/kubernetes/status/status_check.go Outdated Show resolved Hide resolved

aaron-prindle previously approved these changes Nov 11, 2021

View reviewed changes

tejal29 force-pushed the fix_user_cancelled_sc branch 2 times, most recently from e02bf37 to d860795 Compare November 12, 2021 03:13

change logic to detect skaffold deplpy status

3bf856f

tejal29 force-pushed the fix_user_cancelled_sc branch from d860795 to ea7d1b8 Compare November 12, 2021 03:25

fix lint

a991781

tejal29 force-pushed the fix_user_cancelled_sc branch from ea7d1b8 to a991781 Compare November 12, 2021 03:26

tejal29 commented Nov 12, 2021

View reviewed changes

rever intended change

f6cdeed

tejal29 added 3 commits November 11, 2021 19:53

fix tests

5e5a547

fix race condition between deployment recorded as successful and pod …

60127d3

…statuses being updated

fix lint

0b072af

fix struct lint

1d2c886

briandealwis reviewed Nov 12, 2021

View reviewed changes

tejal29 added 2 commits November 12, 2021 15:56

code review comments

221c085

fix lint size of struct

05a8b52

briandealwis approved these changes Nov 15, 2021

View reviewed changes

tejal29 merged commit 9e8762f into GoogleContainerTools:main Nov 15, 2021

gsquared94 mentioned this pull request Nov 16, 2021

Status Check misleading error #5424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add skaffold internal error and return that instead of user cancelled #6846

fix: Add skaffold internal error and return that instead of user cancelled #6846

tejal29 commented Nov 11, 2021 •

edited

Loading

codecov bot commented Nov 11, 2021 •

edited

Loading

aaron-prindle left a comment

tejal29 Nov 12, 2021

tejal29 commented Nov 12, 2021

briandealwis Nov 12, 2021

tejal29 Nov 12, 2021

tejal29 Nov 12, 2021

tejal29 Nov 12, 2021

tejal29 Nov 12, 2021

briandealwis Nov 15, 2021

briandealwis Nov 15, 2021

fix: Add skaffold internal error and return that instead of user cancelled #6846

fix: Add skaffold internal error and return that instead of user cancelled #6846

Conversation

tejal29 commented Nov 11, 2021 • edited Loading

codecov bot commented Nov 11, 2021 • edited Loading

Codecov Report

aaron-prindle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tejal29 commented Nov 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tejal29 commented Nov 11, 2021 •

edited

Loading

codecov bot commented Nov 11, 2021 •

edited

Loading