Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't fail immediately on missing resources. #3385

Merged
merged 3 commits into from
Oct 20, 2020

Conversation

mattmoor
Copy link
Member

Changes

As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache. Currently, when that cache is behind the pipeline run will immediately fail. This change builds in a buffer of resources.MinimumAge and a helper resources.IsYoung that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).

Fixes: #3378

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Add a grace period for resources to appear before failing *Runs

@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 14, 2020
@tekton-robot tekton-robot requested review from afrittoli and a user October 14, 2020 16:21
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 14, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/resources.go Do not exist 100.0%

@@ -196,7 +196,7 @@ func GetResourcesFromBindings(pr *v1beta1.PipelineRun, getResource resources.Get
for _, resource := range pr.Spec.Resources {
r, err := resources.GetResourceFromBinding(resource, getResource)
if err != nil {
return rs, fmt.Errorf("error following resource reference for %s: %w", resource.Name, err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapping wasn't adding anything over the not found error, and it was obscuring our ability to use kerrors.IsNotFound. I added logic to the unit test to confirm our ability to check that.

@@ -67,7 +66,7 @@ func ResolveTaskResources(ts *v1beta1.TaskSpec, taskName string, kind v1beta1.Ta
rr, err := GetResourceFromBinding(r.PipelineResourceBinding, gr)

if err != nil {
return nil, fmt.Errorf("couldn't retrieve referenced output PipelineResource: %w", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies here as above.

@mattmoor
Copy link
Member Author

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 14, 2020
@mattmoor
Copy link
Member Author

/test check-pr-has-kind-label

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/resources.go Do not exist 100.0%

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2020
@mattmoor
Copy link
Member Author

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it sounds like a good idea.

This benefits the ability to run for uses cases where task / taskrun (as well as pipeline / pipelinerun) are created at the same time and adds a bit of pain when authoring pipelines as typos and other similar issues will take 30s longer to emerge; it's probably a fair trade.

Eventually we might want to expose the definition of "young" in a config or start-up flag, so that it can be set to zero in development environments.

I wonder how this will interact with tekton bundles? @pierretasci @vdemeester

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/resources.go Do not exist 100.0%

@vdemeester
Copy link
Member

I wonder how this will interact with tekton bundles? @pierretasci @vdemeester

This might in a good way, as we could re-use that "young" state when pulling the bundle takes time. Other than this, I think it's ok 😉

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2020
@imjasonh
Copy link
Member

/lgtm
/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ImJasonH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2020
@bobcatfish
Copy link
Collaborator

This benefits the ability to run for uses cases where task / taskrun (as well as pipeline / pipelinerun) are created at the same time and adds a bit of pain when authoring pipelines as typos and other similar issues will take 30s longer to emerge; it's probably a fair trade.

Can we explore this a bit more? I think 30 seconds might be a bit generous here - maybe we can make it significantly shorter, e.g. 1 second?

Otherwise this is a pretty big shift in expectations imo; before this change, making a mistake in a Run would fail immediately, now you need to know to check back in 30 seconds. Additionally, I don't think the status will reflect what is going on, is that right? Would you be able to tell the difference b/w this being re-queued and a reconciler being down?

I'm also wondering: when will IsYoungResource realistically return false for a PipelineRun? Since PipelineRuns are always executed when created, it seems like they will always be "young"? And in fact, what happens if a Pipeline is executing, and a resource it needs is deleted within the first 30 seconds of execution? This would result in the already running PipelineRun being "suspended", i.e. would be re-queued, which is probably not what we want?

I think there is a bigger discussion to have here about what behavior we want when referenced resources are missing.

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020
@mattmoor
Copy link
Member Author

@bobcatfish my expectation is sort of that working pipelines do not fail, which is a bigger violation of the principle of least surprise IMO 😉

Ultimately, I think any non-zero value here is an improvement, so I'm happy to adjust it. There isn't going to be a "right" answer here, it'll be trade-offs based on SLAs, and I don't know if/what SLA exists for when K8s responds to watches (nor am I sure that even that's a perfect approximation).

@bobcatfish
Copy link
Collaborator

What about alternatively changing the PipelineResource fetching to use Get instead of a Lister? afaik that's what we've been doing for all other fetching, e.g. Task specs (https://github.com/tektoncd/pipeline/blob/1710b688c0238d21ebf43cd637b79e3cb627f054/pkg/reconciler/taskrun/resources/taskref.go#L37:32) to combat this problem

@mattmoor
Copy link
Member Author

Outside of very narrow use cases, using the client for gets is basically guaranteed to become a problem at scale.

resources.GetResourcesFromBindings is called on the core path through reconcile which means if you had 1000 resources, a global resync (e.g. rolling out a new controller, failover) then you need to eat 1000 API calls before your workqueue is free to do any real work. The default client-side rate limit is 5 QPS, which means it'd take roughly 200s to do nothing.

I'm not sure how stressful the Tekton e2e tests are, but running the "chaos duck" during e2e tests in Knative uncovered a bunch of these. It kills controllers every 30s, so if you choke on failovers you'll never survive.

@bobcatfish
Copy link
Collaborator

Outside of very narrow use cases, using the client for gets is basically guaranteed to become a problem at scale.

That's a good point however we are already using gets for Tasks and Pipelines - if we're seeing a performance problem there, i suggest we tackle that performance problem directly.

In the meantime it seems reasonable to me that we update the PipelineResource fetching logic to match how we're retrieving Pipelines and Tasks?

@mattmoor
Copy link
Member Author

we are already using gets for Tasks and Pipelines

Those gets are done O(once in the lifecycle of a *Run) to snapshot the Task/Pipeline, where this is O(per reconcile).

"Just do a Get through the client" is literally asking me to introduce what I know is a scale bug, so I'm not going to do it, sorry.

If you want to make a more invasive change to bring the order of API calls more in line with Task/Pipeline, then be my guest, but it's a bigger change than I signed up for, and I still think that all of the API traffic for reads is worrisome. Feel free to close if you want something else entirely, as I don't think that change would benefit from this.

@bobcatfish
Copy link
Collaborator

Those gets are done O(once in the lifecycle of a *Run) to snapshot the Task/Pipeline, where this is O(per reconcile).

Ah, because since we started storing the Task / Pipeline spec in the Run itself so we can grab it from there afterward.

I wonder if that means that the most consistent approach here would be to do the same with the Resource spec? (retrieve with Get, store in the Run) I'm surprised we didn't do that already, I'm guessing it's because of the uncertain state of PipelineResources.

In that case, if we treat this as more of a stop gap until we have decided the fate of PipelineResources that seems reasonable to me - I also misunderstood the PR initially and thought we were doing this for all "resource" retrieval, not just PipelineResources. Maybe we can lower the tolerance significantly? Talking with @imjasonh he had suggested 5 seconds, i wonder if we could get away with something even shorter but 5 seconds seems okay.

Unless we think it's reasonable to start dereferencing the PipelineResource spec into the status now ( @vdemeester @afrittoli )

@bobcatfish
Copy link
Collaborator

Oh and one more question: would it make sense to update the Run status in this case? Esp. if we go with the 30 second window, if I'm understanding correctly, a user wouldn't be able to tell the difference between a Run that is being retried b/c its PipelineResources don't exist, and a Run that isn't being reconciled at all?

@vdemeester
Copy link
Member

Those gets are done O(once in the lifecycle of a *Run) to snapshot the Task/Pipeline, where this is O(per reconcile).

Ah, because since we started storing the Task / Pipeline spec in the Run itself so we can grab it from there afterward.

I wonder if that means that the most consistent approach here would be to do the same with the Resource spec? (retrieve with Get, store in the Run) I'm surprised we didn't do that already, I'm guessing it's because of the uncertain state of PipelineResources.

In that case, if we treat this as more of a stop gap until we have decided the fate of PipelineResources that seems reasonable to me - I also misunderstood the PR initially and thought we were doing this for all "resource" retrieval, not just PipelineResources. Maybe we can lower the tolerance significantly? Talking with @imjasonh he had suggested 5 seconds, i wonder if we could get away with something even shorter but 5 seconds seems okay.

I am fine starting with 5s and see if we need to refine this later on (and have a configuration for it so that users/integration can customize that period on time).

Unless we think it's reasonable to start dereferencing the PipelineResource spec into the status now ( @vdemeester @afrittoli )

I would rather not do it, at least not until we are sure of what we are doing with PipelineResource.

As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache.  Currently, when that cache is behind the pipeline run will immediately fail.  This change builds in a buffer of `resources.MinimumAge` and a helper `resources.IsYoung` that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).

Fixes: tektoncd#3378
Drop to 5s.

Update status (`Succeeded: Unknown`) with information about why things aren't progressing.
@mattmoor
Copy link
Member Author

Alright, back from long weekend 🏝️ 😎

would it make sense to update the Run status in this case

Sure, I added calls that toggle the reason/message on Succeeded: Unknown for this case.

I am fine starting with 5s and see if we need to refine this later on

Changed to 5s, hopefully this covers the majority of cases 🤞

I rebased, but put all the meaningful changes in a separate commit.

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2020
@bobcatfish
Copy link
Collaborator

Welcome back!

Sure, I added calls that toggle the reason/message on Succeeded: Unknown for this case.

Great, I think that helps!!

Changed to 5s, hopefully this covers the majority of cases 🤞

Sounds reasonable to me! There's the last question of making it configurable as @vdemeester suggested - I am slightly inclined to go-ahead as is since this only applies to PipelineResources and I think it's likely we'd be removing any config option we added later on (either b/c we don't have PipelineResources or b/c we'd be embedding the spec like we do with the other types). But I wouldn't block adding it if you feel strongly @vdemeester

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 81.8% 80.4% -1.5
pkg/reconciler/resources.go Do not exist 100.0%

@mattmoor
Copy link
Member Author

Yeah, +1 to delaying the addition of a knob until the future is clearer.

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2020
@mattmoor
Copy link
Member Author

This is the flake @afrittoli has been chasing:

Expected 2 number of successful events from pipelinerun and taskrun but got 3; list of received events

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipelineRun fails too eagerly on missing resources
6 participants