-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't fail immediately on missing resources. #3385
Conversation
The following is the coverage report on the affected files.
|
31ed70c
to
13813fb
Compare
@@ -196,7 +196,7 @@ func GetResourcesFromBindings(pr *v1beta1.PipelineRun, getResource resources.Get | |||
for _, resource := range pr.Spec.Resources { | |||
r, err := resources.GetResourceFromBinding(resource, getResource) | |||
if err != nil { | |||
return rs, fmt.Errorf("error following resource reference for %s: %w", resource.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapping wasn't adding anything over the not found error, and it was obscuring our ability to use kerrors.IsNotFound
. I added logic to the unit test to confirm our ability to check that.
@@ -67,7 +66,7 @@ func ResolveTaskResources(ts *v1beta1.TaskSpec, taskName string, kind v1beta1.Ta | |||
rr, err := GetResourceFromBinding(r.PipelineResourceBinding, gr) | |||
|
|||
if err != nil { | |||
return nil, fmt.Errorf("couldn't retrieve referenced output PipelineResource: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies here as above.
/kind bug |
/test check-pr-has-kind-label |
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, it sounds like a good idea.
This benefits the ability to run for uses cases where task / taskrun (as well as pipeline / pipelinerun) are created at the same time and adds a bit of pain when authoring pipelines as typos and other similar issues will take 30s longer to emerge; it's probably a fair trade.
Eventually we might want to expose the definition of "young" in a config or start-up flag, so that it can be set to zero in development environments.
I wonder how this will interact with tekton bundles? @pierretasci @vdemeester
The following is the coverage report on the affected files.
|
This might in a good way, as we could re-use that "young" state when pulling the bundle takes time. Other than this, I think it's ok 😉 /lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ImJasonH The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Can we explore this a bit more? I think 30 seconds might be a bit generous here - maybe we can make it significantly shorter, e.g. 1 second? Otherwise this is a pretty big shift in expectations imo; before this change, making a mistake in a Run would fail immediately, now you need to know to check back in 30 seconds. Additionally, I don't think the status will reflect what is going on, is that right? Would you be able to tell the difference b/w this being re-queued and a reconciler being down? I'm also wondering: when will IsYoungResource realistically return false for a PipelineRun? Since PipelineRuns are always executed when created, it seems like they will always be "young"? And in fact, what happens if a Pipeline is executing, and a resource it needs is deleted within the first 30 seconds of execution? This would result in the already running PipelineRun being "suspended", i.e. would be re-queued, which is probably not what we want? I think there is a bigger discussion to have here about what behavior we want when referenced resources are missing. /hold |
@bobcatfish my expectation is sort of that working pipelines do not fail, which is a bigger violation of the principle of least surprise IMO 😉 Ultimately, I think any non-zero value here is an improvement, so I'm happy to adjust it. There isn't going to be a "right" answer here, it'll be trade-offs based on SLAs, and I don't know if/what SLA exists for when K8s responds to watches (nor am I sure that even that's a perfect approximation). |
What about alternatively changing the PipelineResource fetching to use Get instead of a Lister? afaik that's what we've been doing for all other fetching, e.g. Task specs (https://github.com/tektoncd/pipeline/blob/1710b688c0238d21ebf43cd637b79e3cb627f054/pkg/reconciler/taskrun/resources/taskref.go#L37:32) to combat this problem |
Outside of very narrow use cases, using the client for gets is basically guaranteed to become a problem at scale.
I'm not sure how stressful the Tekton e2e tests are, but running the "chaos duck" during e2e tests in Knative uncovered a bunch of these. It kills controllers every 30s, so if you choke on failovers you'll never survive. |
That's a good point however we are already using gets for Tasks and Pipelines - if we're seeing a performance problem there, i suggest we tackle that performance problem directly. In the meantime it seems reasonable to me that we update the PipelineResource fetching logic to match how we're retrieving Pipelines and Tasks? |
Those gets are done O(once in the lifecycle of a *Run) to snapshot the Task/Pipeline, where this is O(per reconcile). "Just do a Get through the client" is literally asking me to introduce what I know is a scale bug, so I'm not going to do it, sorry. If you want to make a more invasive change to bring the order of API calls more in line with Task/Pipeline, then be my guest, but it's a bigger change than I signed up for, and I still think that all of the API traffic for reads is worrisome. Feel free to close if you want something else entirely, as I don't think that change would benefit from this. |
Ah, because since we started storing the Task / Pipeline spec in the Run itself so we can grab it from there afterward. I wonder if that means that the most consistent approach here would be to do the same with the Resource spec? (retrieve with Get, store in the Run) I'm surprised we didn't do that already, I'm guessing it's because of the uncertain state of PipelineResources. In that case, if we treat this as more of a stop gap until we have decided the fate of PipelineResources that seems reasonable to me - I also misunderstood the PR initially and thought we were doing this for all "resource" retrieval, not just PipelineResources. Maybe we can lower the tolerance significantly? Talking with @imjasonh he had suggested 5 seconds, i wonder if we could get away with something even shorter but 5 seconds seems okay. Unless we think it's reasonable to start dereferencing the PipelineResource spec into the status now ( @vdemeester @afrittoli ) |
Oh and one more question: would it make sense to update the Run status in this case? Esp. if we go with the 30 second window, if I'm understanding correctly, a user wouldn't be able to tell the difference between a Run that is being retried b/c its PipelineResources don't exist, and a Run that isn't being reconciled at all? |
I am fine starting with 5s and see if we need to refine this later on (and have a configuration for it so that users/integration can customize that period on time).
I would rather not do it, at least not until we are sure of what we are doing with |
As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache. Currently, when that cache is behind the pipeline run will immediately fail. This change builds in a buffer of `resources.MinimumAge` and a helper `resources.IsYoung` that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff). Fixes: tektoncd#3378
Drop to 5s. Update status (`Succeeded: Unknown`) with information about why things aren't progressing.
Alright, back from long weekend 🏝️ 😎
Sure, I added calls that toggle the reason/message on
Changed to 5s, hopefully this covers the majority of cases 🤞 I rebased, but put all the meaningful changes in a separate commit. |
f4d203b
to
d45a012
Compare
Welcome back!
Great, I think that helps!!
Sounds reasonable to me! There's the last question of making it configurable as @vdemeester suggested - I am slightly inclined to go-ahead as is since this only applies to PipelineResources and I think it's likely we'd be removing any config option we added later on (either b/c we don't have PipelineResources or b/c we'd be embedding the spec like we do with the other types). But I wouldn't block adding it if you feel strongly @vdemeester /hold cancel |
The following is the coverage report on the affected files.
|
Yeah, +1 to delaying the addition of a knob until the future is clearer. |
/lgtm |
This is the flake @afrittoli has been chasing:
/retest |
Changes
As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache. Currently, when that cache is behind the pipeline run will immediately fail. This change builds in a buffer of
resources.MinimumAge
and a helperresources.IsYoung
that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).Fixes: #3378
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes