TEP-0127: Larger Results via Sidecar Logs

jerop · jerop · commit a8f11c4f7f64 · 2022-11-30T15:41:00.000-05:00
Today, `Results` have a size limit of 4KB per `Step` and 12KB per `TaskRun` in the best case - see [issue][4012]. The goal of [TEP-0086][tep-0086] is to support larger `Results` beyond the current size limits. TEP-0086 has many alternatives but no proposal. This TEP is proposes experimenting with one of the alternatives - `Sidecar` logs. This allows us to support larger `Results` that are stored within `TaskRun` CRDs. [4012]: tektoncd/pipeline#4012 [4808]: tektoncd/pipeline#4808 [tep-0086]: https://github.com/tektoncd/community/blob/main/teps/0086-changing-the-way-result-parameters-are-stored.md
diff --git a/teps/0127-larger-results-via-sidecar-logs.md b/teps/0127-larger-results-via-sidecar-logs.md
@@ -0,0 +1,244 @@
+---
+status: implementable
+title: Larger Results via Sidecar Logs
+creation-date: '2022-11-30'
+last-updated: '2022-11-30'
+authors:
+- '@chitrangpatel'
+- '@jerop'
+see-also:
+- TEP-0086
+---
+
+# TEP-0127: Larger Results via Sidecar Logs
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+  - [Use Cases](#use-cases)
+- [Proposal](#proposal)
+  - [Feature Gate](#feature-gate)
+  - [Logs Access](#logs-access)
+  - [Size Limit](#size-limit)
+  - [PipelineRun Status](#pipelinerun-status)
+- [Design Details](#design-details)
+- [Design Evaluation](#design-evaluation)
+  - [Reusability](#reusability)
+  - [Simplicity](#simplicity)
+  - [Performance](#performance)
+  - [Security](#security)
+- [Experiment Questions](#experiment-questions)
+- [References](#references)
+<!-- /toc -->
+
+This TEP builds on the hard work of many people who have been tackling the problem over the past couple years,
+including but not limited to:
+- '@abayer'
+- '@afrittoli'
+- '@bobcatfish'
+- '@dibyom'
+- '@imjasonh'
+- '@pritidesai'
+- '@ScrapCodes'
+- '@skaegi'
+- '@tlawrie'
+- '@tomcli'
+- '@vdemeester'
+- '@wlynch'
+
+## Summary
+
+Today, `Results` have a size limit of 4KB per `Step` and 12KB per `TaskRun` in the best case - see [issue][4012]. 
+
+The goal of [TEP-0086][tep-0086] is to support larger `Results` beyond the current size limits. TEP-0086 has many
+alternatives but no proposal. This TEP is proposes experimenting with one of the alternatives - `Sidecar` logs.
+This allows us to support larger `Results` which are stored within `TaskRun` CRDs.
+
+## Motivation
+
+`Results` are too small - see [issue][4012]. The current implementation of `Results` involves parsing from disk and
+storing as part of the `Termination Message` which has a limit of 4KB per `Container` and 12KB per `Pod`. As such,
+the size limit of `Results` is 12KB per `TaskRun` and 4KB per `Step` at best.
+
+To make matters worse, the limit is divided equally among all `Containers` in a `Pod` - see [issue][4808]. Therefore,
+the more the `Steps` in a `Task` the less the size limit for `Results`. For example, if there are 12 `Steps` then each
+has 1KB in `Termination Message` storage to produce `Results`.
+
+[TEP-0086][tep-0086] aims to support larger `Results`. [TEP-0086][tep-0086] has many [alternatives][tep-0086-alts] with
+no proposal because there's no obvious "best" solution that would meet all the requirements.
+
+This TEP proposes experimenting with `Sidecar` logs to support larger `Results` that are stored with `TaskRun` CRDs.
+This allows us to provide an immediate solution to the urgent needs of users, while not blocking pursuit of the other
+alternatives.
+
+### Goals
+
+The main goal is to support larger `Results` that are limited by the size of a `TaskRun` CRD (1.5MB) via `Sidecar` logs.
+
+### Non-Goals
+
+The following are out of scope for this TEP:
+
+1. Solving use cases that requires really large `Results` beyond the size limit of a CRD (1.5MB).
+
+2. Addressing other [alternatives][tep-0086-alts] for larger `Results` that are listed in [TEP-0086][tep-0086].
+
+### Use Cases
+
+1. Store larger `Results`, such as JSON payloads from an HTTP call or a Cloud Event, that can be inspected later or
+   passed to other `Tasks` without the need to understand storage mechanisms.
+
+3. The [documented guidance][docs] is that `Results` are used for outputs less than 1KB while `Workspaces` are used
+   for larger data.
+
+   > As a general rule-of-thumb, if a `Result` needs to be larger than a kilobyte, you should likely use a `Workspace`
+     to store and pass it between `Tasks` within a `Pipeline`.
+   
+   Supporting larger `Results` upto the CRD limit allows users to reuse `Tasks` in more scenarios without having to
+   change the specification to use `Workspaces` upon hitting the current low size limit of 4KB per `TaskRun`.
+
+## Proposal
+
+We propose using stdout logs from a dedicated `Sidecar` to return a json `Result` object to support larger `Results`.
+The `Pipeline` controller would wait for the `Sidecar` to exit and then read the logs based on a particular query and
+append `Results` to the `TaskRun`.
+
+We propose injecting the dedicated `Sidecar` alongside other `Steps`. The `Sidecar` will watch the `Results` paths of
+the `Steps`. This `Sidecar` will output the name of the `Result` and its contents to stdout in a parsable pattern. The
+`TaskRun` controller will access the stdout logs of the `Sidecar` then extract the `Results` and its contents.
+
+After the `Steps` have terminated, the `TaskRun` controller will write the `Results` from the logs of the `Sidecar`
+instead of using the `Termination Message`, hence bypassing the 4KB limit. This method keeps the rest of the current
+functionality identical and does not require any external storage mechanism.
+
+For further details, see the [demonstration][demo] and [proof of concept][poc].
+
+This provides an opportunity to experiment with this solution to provide `Results` within the CRDs as we continue to
+explore other alternatives, including those that involve external storage.
+
+### Feature Gate
+
+This feature will be gated using a `result-extraction-method` feature flag. 
+
+Users have to set `result-extraction-method` to `"sidecar-logs"` to enable the feature:
+```shell
+kubectl patch cm feature-flags -n tekton-pipelines -p '{"data":{"result-extraction-method":"sidecar-logs"}}'
+```
+
+This feature is disabled by default. When disabled, `Results` continue to pass through `Termination Message`.
+
+### Logs Access
+
+This feature requires that the `Pipeline` controller has access to `Pod` logs. 
+
+Users have to grant `get` access to all `pods/log` to the `Pipeline` controller:
+```shell
+kubectl apply -f config/enable-log-access-to-controller/
+```
+
+### Size Limit
+
+The size limit per `Result` can be `max-result-size` feature flag, which takes the integer value of the bytes.
+
+The `max-result-size` feature flag defaults to 4096 bytes. This ensures that we support existing `Tasks` with only one
+`Result` that uses up the whole size limit of the `Termination Message`.
+
+If users want to set the size limit per `Result` to be something other than 4096 bytes, they can set `max-result-size`
+by setting `max-result-size: <VALUE-IN-BYTES>`. The value set here cannot exceed the CRD size limit of 1.5MB.
+
+```shell
+kubectl patch cm feature-flags -n tekton-pipelines -p '{"data":{"max-result-size":"<VALUE-IN-BYTES>"}}'
+```
+
+Even though the size limit per `Result` is configurable, the size of `Results` are limited by CRD size limit of 1.5MB.
+If the size of `Results` exceeds this limit, then the `TaskRun` will fail with a message indicating the size limit has
+been exceeded.
+
+### PipelineRun Status
+
+In [TEP-0100][tep-0100], we proposed changes to `PipelineRun` status to reduce the amount of information stored about
+the status of `TaskRuns` and `Runs` to improve performance and reduce memory bloat. The `PipelineRun` status is set up
+to handle larger `Results` without exacerbating the performance and storage issues that were there before.
+
+For `ChildReferences` to be populated, the `embedded-status` must be set to `"minimal"`. We recommend that the minimal
+embedded status - `ChildReferences` - is enabled while migration is ongoing until it becomes the only supported status.
+
+## Design Details
+
+The `Sidecar` will run a binary that:
+- receives argument for `Results`' paths and names identified from `taskSpec.results` field. This allows the `Sidecar`
+  to know the `Results` it needs to read.
+- has `/tekton/run` volume mounted where status of each `Step` is written.
+- periodically checks for `Step` status in the path `/tekton/run`.
+- when all `Steps` have completed, it immediately parses all the `Results` in paths and prints them to stdout in a
+  parsable pattern.
+
+For further details, see the [demonstration][demo] of the proof of concept.
+
+## Design Evaluation
+
+### Reusability
+
+This proposal does not introduce any API changes to specification `Results`, the changes are in implementation details
+of `Results`. The existing `Tasks` will continue to function as they are, only that they can support larger `Results`.
+
+Even more, this supporting larger `Results` upto the CRD limit allows users to reuse `Tasks` in more scenarios without
+having to change the specification to use `Workspaces` upon hitting the current low size limit of 4KB per `TaskRun`.
+This allows users to control execution as needed by their context without having to modify `Tasks` and `Pipelines`.
+
+### Simplicity
+
+This proposal provides a simple solution that solves most use cases:
+- Users don't need additional infrastructure, such as server or object storage, to support larger `Results`.
+- Existing `Tasks` will continue to function as they are, while supporting larger `Results`, without any API changes.
+
+### Performance
+
+Performance benchmarking with 20-30 `PipelineRuns`, each with 3 `TaskRuns` each with two `Steps`:
+- Average `Pipeline` controller's CPU difference during `PipelineRun` execution: 1%
+- Average `Pipeline` controller's Memory usage difference during `PipelineRun` execution: 0.2%
+- Average `Pod` startup time (time to get to running state) difference: 3s per `TaskRun`
+
+In the experiment, we will continue to measure the startup overhead and explore ways to improve it.
+
+For further details, see the [performance metrics][performance].
+
+### Security
+
+This approach requires that the `Pipeline` controller has access to `Pod` logs. The `Pipeline` controller already has
+extensive permissions in the cluster, such as read access to `Secrets`. Expanding the access even further is a concern
+for some users, but is also acceptable for some users given the advantages. We will document the extended permissions
+so that users can make the right choice for their own use cases and requirements.
+
+## Experiment Questions
+
+These are some questions we hope to answer in the experiment:
+
+- What impact does this change have on the start-up time and execution time of `TaskRuns` and `PipelineRuns`? Can we
+  improve the performance impact? 
+
+- How reliable is using `Sidecar` logs to process `Results`?
+
+- How many users adopt this solution? How many are satisfied with it given the advantages and disadvantages? We will
+  conduct a survey soon after the feature has been released.
+
+## References
+
+- [TEP-0086: Larger Results][tep-0086]
+- [TEP-0086: Larger Results via Sidecar Logs][745]
+- [Implementation Pull Request][poc]
+- [Demonstration by Chitrang][demo]
+
+[docs]: https://tekton.dev/docs/pipelines/tasks/#emitting-results
+[4012]: https://github.com/tektoncd/pipeline/issues/4012
+[4808]: https://github.com/tektoncd/pipeline/issues/4808
+[tep-0086]: ./0086-changing-the-way-result-parameters-are-stored.md
+[tep-0086-alts]: ./0086-changing-the-way-result-parameters-are-stored.md#alternatives
+[crd-size]: https://github.com/kubernetes/kubernetes/issues/82292
+[tep-0100]: ./0100-embedded-taskruns-and-runs-status-in-pipelineruns.md
+[demo]: https://drive.google.com/file/d/1NrWudE_XBqweomiY24DP2Txnl1yN0yD9/view
+[poc]: https://github.com/tektoncd/pipeline/pull/5695
+[performance]: https://github.com/tektoncd/community/pull/745#issuecomment-1206668381
+[745]: https://github.com/tektoncd/community/pull/745
diff --git a/teps/README.md b/teps/README.md
@@ -292,3 +292,4 @@ This is the complete list of Tekton teps:
 |[TEP-0123](0123-specifying-on-demand-retry-in-pipelinetask.md) | Specifying on-demand-retry in a PipelineTask | proposed | 2022-10-11 |
 |[TEP-0124](0124-distributed-tracing-for-tasks-and-pipelines.md) | Distributed tracing for Tasks and Pipelines | implementable | 2022-10-16 |
 |[TEP-0125](0125-add-credential-filter-to-entrypoint-logger.md) | Add credential filter to entrypoint logger | proposed | 2022-10-27 |
+|[TEP-0127](0127-larger-results-via-sidecar-logs.md) | Larger Results via Sidecar Logs | implementable | 2022-11-30 |