|
| 1 | +--- |
| 2 | +status: implementable |
| 3 | +title: Larger Results via Sidecar Logs |
| 4 | +creation-date: '2022-11-30' |
| 5 | +last-updated: '2022-11-30' |
| 6 | +authors: |
| 7 | +- '@chitrangpatel' |
| 8 | +- '@jerop' |
| 9 | +see-also: |
| 10 | +- TEP-0086 |
| 11 | +--- |
| 12 | + |
| 13 | +# TEP-0127: Larger Results via Sidecar Logs |
| 14 | + |
| 15 | +<!-- toc --> |
| 16 | +- [Summary](#summary) |
| 17 | +- [Motivation](#motivation) |
| 18 | + - [Goals](#goals) |
| 19 | + - [Non-Goals](#non-goals) |
| 20 | + - [Use Cases](#use-cases) |
| 21 | +- [Proposal](#proposal) |
| 22 | + - [Feature Gate](#feature-gate) |
| 23 | + - [Logs Access](#logs-access) |
| 24 | + - [Size Limit](#size-limit) |
| 25 | + - [PipelineRun Status](#pipelinerun-status) |
| 26 | +- [Design Details](#design-details) |
| 27 | +- [Design Evaluation](#design-evaluation) |
| 28 | + - [Reusability](#reusability) |
| 29 | + - [Simplicity](#simplicity) |
| 30 | + - [Performance](#performance) |
| 31 | + - [Security](#security) |
| 32 | +- [Experiment Questions](#experiment-questions) |
| 33 | +- [References](#references) |
| 34 | +<!-- /toc --> |
| 35 | + |
| 36 | +This TEP builds on the hard work of many people who have been tackling the problem over the past couple years, |
| 37 | +including but not limited to: |
| 38 | +- '@abayer' |
| 39 | +- '@afrittoli' |
| 40 | +- '@bobcatfish' |
| 41 | +- '@dibyom' |
| 42 | +- '@imjasonh' |
| 43 | +- '@pritidesai' |
| 44 | +- '@ScrapCodes' |
| 45 | +- '@skaegi' |
| 46 | +- '@tlawrie' |
| 47 | +- '@tomcli' |
| 48 | +- '@vdemeester' |
| 49 | +- '@wlynch' |
| 50 | + |
| 51 | +## Summary |
| 52 | + |
| 53 | +Today, `Results` have a size limit of 4KB per `Step` and 12KB per `TaskRun` in the best case - see [issue][4012]. |
| 54 | + |
| 55 | +The goal of [TEP-0086][tep-0086] is to support larger `Results` beyond the current size limits. TEP-0086 has many |
| 56 | +alternatives but no proposal. This TEP is proposes experimenting with one of the alternatives - `Sidecar` logs. |
| 57 | +This allows us to support larger `Results` which are stored within `TaskRun` CRDs. |
| 58 | + |
| 59 | +## Motivation |
| 60 | + |
| 61 | +`Results` are too small - see [issue][4012]. The current implementation of `Results` involves parsing from disk and |
| 62 | +storing as part of the `Termination Message` which has a limit of 4KB per `Container` and 12KB per `Pod`. As such, |
| 63 | +the size limit of `Results` is 12KB per `TaskRun` and 4KB per `Step` at best. |
| 64 | + |
| 65 | +To make matters worse, the limit is divided equally among all `Containers` in a `Pod` - see [issue][4808]. Therefore, |
| 66 | +the more the `Steps` in a `Task` the less the size limit for `Results`. For example, if there are 12 `Steps` then each |
| 67 | +has 1KB in `Termination Message` storage to produce `Results`. |
| 68 | + |
| 69 | +[TEP-0086][tep-0086] aims to support larger `Results`. [TEP-0086][tep-0086] has many [alternatives][tep-0086-alts] with |
| 70 | +no proposal because there's no obvious "best" solution that would meet all the requirements. |
| 71 | + |
| 72 | +This TEP proposes experimenting with `Sidecar` logs to support larger `Results` that are stored with `TaskRun` CRDs. |
| 73 | +This allows us to provide an immediate solution to the urgent needs of users, while not blocking pursuit of the other |
| 74 | +alternatives. |
| 75 | + |
| 76 | +### Goals |
| 77 | + |
| 78 | +The main goal is to support larger `Results` that are limited by the size of a `TaskRun` CRD (1.5MB) via `Sidecar` logs. |
| 79 | + |
| 80 | +### Non-Goals |
| 81 | + |
| 82 | +The following are out of scope for this TEP: |
| 83 | + |
| 84 | +1. Solving use cases that requires really large `Results` beyond the size limit of a CRD (1.5MB). |
| 85 | + |
| 86 | +2. Addressing other [alternatives][tep-0086-alts] for larger `Results` that are listed in [TEP-0086][tep-0086]. |
| 87 | + |
| 88 | +### Use Cases |
| 89 | + |
| 90 | +1. Store larger `Results`, such as JSON payloads from an HTTP call or a Cloud Event, that can be inspected later or |
| 91 | + passed to other `Tasks` without the need to understand storage mechanisms. |
| 92 | + |
| 93 | +3. The [documented guidance][docs] is that `Results` are used for outputs less than 1KB while `Workspaces` are used |
| 94 | + for larger data. |
| 95 | + |
| 96 | + > As a general rule-of-thumb, if a `Result` needs to be larger than a kilobyte, you should likely use a `Workspace` |
| 97 | + to store and pass it between `Tasks` within a `Pipeline`. |
| 98 | + |
| 99 | + Supporting larger `Results` upto the CRD limit allows users to reuse `Tasks` in more scenarios without having to |
| 100 | + change the specification to use `Workspaces` upon hitting the current low size limit of 4KB per `TaskRun`. |
| 101 | + |
| 102 | +## Proposal |
| 103 | + |
| 104 | +We propose using stdout logs from a dedicated `Sidecar` to return a json `Result` object to support larger `Results`. |
| 105 | +The `Pipeline` controller would wait for the `Sidecar` to exit and then read the logs based on a particular query and |
| 106 | +append `Results` to the `TaskRun`. |
| 107 | + |
| 108 | +We propose injecting the dedicated `Sidecar` alongside other `Steps`. The `Sidecar` will watch the `Results` paths of |
| 109 | +the `Steps`. This `Sidecar` will output the name of the `Result` and its contents to stdout in a parsable pattern. The |
| 110 | +`TaskRun` controller will access the stdout logs of the `Sidecar` then extract the `Results` and its contents. |
| 111 | + |
| 112 | +After the `Steps` have terminated, the `TaskRun` controller will write the `Results` from the logs of the `Sidecar` |
| 113 | +instead of using the `Termination Message`, hence bypassing the 4KB limit. This method keeps the rest of the current |
| 114 | +functionality identical and does not require any external storage mechanism. |
| 115 | + |
| 116 | +For further details, see the [demonstration][demo] and [proof of concept][poc]. |
| 117 | + |
| 118 | +This provides an opportunity to experiment with this solution to provide `Results` within the CRDs as we continue to |
| 119 | +explore other alternatives, including those that involve external storage. |
| 120 | + |
| 121 | +### Feature Gate |
| 122 | + |
| 123 | +This feature will be gated using a `result-extraction-method` feature flag. |
| 124 | + |
| 125 | +Users have to set `result-extraction-method` to `"sidecar-logs"` to enable the feature: |
| 126 | +```shell |
| 127 | +kubectl patch cm feature-flags -n tekton-pipelines -p '{"data":{"result-extraction-method":"sidecar-logs"}}' |
| 128 | +``` |
| 129 | + |
| 130 | +This feature is disabled by default. When disabled, `Results` continue to pass through `Termination Message`. |
| 131 | + |
| 132 | +### Logs Access |
| 133 | + |
| 134 | +This feature requires that the `Pipeline` controller has access to `Pod` logs. |
| 135 | + |
| 136 | +Users have to grant `get` access to all `pods/log` to the `Pipeline` controller: |
| 137 | +```shell |
| 138 | +kubectl apply -f config/enable-log-access-to-controller/ |
| 139 | +``` |
| 140 | + |
| 141 | +### Size Limit |
| 142 | + |
| 143 | +The size limit per `Result` can be `max-result-size` feature flag, which takes the integer value of the bytes. |
| 144 | + |
| 145 | +The `max-result-size` feature flag defaults to 4096 bytes. This ensures that we support existing `Tasks` with only one |
| 146 | +`Result` that uses up the whole size limit of the `Termination Message`. |
| 147 | + |
| 148 | +If users want to set the size limit per `Result` to be something other than 4096 bytes, they can set `max-result-size` |
| 149 | +by setting `max-result-size: <VALUE-IN-BYTES>`. The value set here cannot exceed the CRD size limit of 1.5MB. |
| 150 | + |
| 151 | +```shell |
| 152 | +kubectl patch cm feature-flags -n tekton-pipelines -p '{"data":{"max-result-size":"<VALUE-IN-BYTES>"}}' |
| 153 | +``` |
| 154 | + |
| 155 | +Even though the size limit per `Result` is configurable, the size of `Results` are limited by CRD size limit of 1.5MB. |
| 156 | +If the size of `Results` exceeds this limit, then the `TaskRun` will fail with a message indicating the size limit has |
| 157 | +been exceeded. |
| 158 | + |
| 159 | +### PipelineRun Status |
| 160 | + |
| 161 | +In [TEP-0100][tep-0100], we proposed changes to `PipelineRun` status to reduce the amount of information stored about |
| 162 | +the status of `TaskRuns` and `Runs` to improve performance and reduce memory bloat. The `PipelineRun` status is set up |
| 163 | +to handle larger `Results` without exacerbating the performance and storage issues that were there before. |
| 164 | + |
| 165 | +For `ChildReferences` to be populated, the `embedded-status` must be set to `"minimal"`. We recommend that the minimal |
| 166 | +embedded status - `ChildReferences` - is enabled while migration is ongoing until it becomes the only supported status. |
| 167 | + |
| 168 | +## Design Details |
| 169 | + |
| 170 | +The `Sidecar` will run a binary that: |
| 171 | +- receives argument for `Results`' paths and names identified from `taskSpec.results` field. This allows the `Sidecar` |
| 172 | + to know the `Results` it needs to read. |
| 173 | +- has `/tekton/run` volume mounted where status of each `Step` is written. |
| 174 | +- periodically checks for `Step` status in the path `/tekton/run`. |
| 175 | +- when all `Steps` have completed, it immediately parses all the `Results` in paths and prints them to stdout in a |
| 176 | + parsable pattern. |
| 177 | + |
| 178 | +For further details, see the [demonstration][demo] of the proof of concept. |
| 179 | + |
| 180 | +## Design Evaluation |
| 181 | + |
| 182 | +### Reusability |
| 183 | + |
| 184 | +This proposal does not introduce any API changes to specification `Results`, the changes are in implementation details |
| 185 | +of `Results`. The existing `Tasks` will continue to function as they are, only that they can support larger `Results`. |
| 186 | + |
| 187 | +Even more, this supporting larger `Results` upto the CRD limit allows users to reuse `Tasks` in more scenarios without |
| 188 | +having to change the specification to use `Workspaces` upon hitting the current low size limit of 4KB per `TaskRun`. |
| 189 | +This allows users to control execution as needed by their context without having to modify `Tasks` and `Pipelines`. |
| 190 | + |
| 191 | +### Simplicity |
| 192 | + |
| 193 | +This proposal provides a simple solution that solves most use cases: |
| 194 | +- Users don't need additional infrastructure, such as server or object storage, to support larger `Results`. |
| 195 | +- Existing `Tasks` will continue to function as they are, while supporting larger `Results`, without any API changes. |
| 196 | + |
| 197 | +### Performance |
| 198 | + |
| 199 | +Performance benchmarking with 20-30 `PipelineRuns`, each with 3 `TaskRuns` each with two `Steps`: |
| 200 | +- Average `Pipeline` controller's CPU difference during `PipelineRun` execution: 1% |
| 201 | +- Average `Pipeline` controller's Memory usage difference during `PipelineRun` execution: 0.2% |
| 202 | +- Average `Pod` startup time (time to get to running state) difference: 3s per `TaskRun` |
| 203 | + |
| 204 | +In the experiment, we will continue to measure the startup overhead and explore ways to improve it. |
| 205 | + |
| 206 | +For further details, see the [performance metrics][performance]. |
| 207 | + |
| 208 | +### Security |
| 209 | + |
| 210 | +This approach requires that the `Pipeline` controller has access to `Pod` logs. The `Pipeline` controller already has |
| 211 | +extensive permissions in the cluster, such as read access to `Secrets`. Expanding the access even further is a concern |
| 212 | +for some users, but is also acceptable for some users given the advantages. We will document the extended permissions |
| 213 | +so that users can make the right choice for their own use cases and requirements. |
| 214 | + |
| 215 | +## Experiment Questions |
| 216 | + |
| 217 | +These are some questions we hope to answer in the experiment: |
| 218 | + |
| 219 | +- What impact does this change have on the start-up time and execution time of `TaskRuns` and `PipelineRuns`? Can we |
| 220 | + improve the performance impact? |
| 221 | + |
| 222 | +- How reliable is using `Sidecar` logs to process `Results`? |
| 223 | + |
| 224 | +- How many users adopt this solution? How many are satisfied with it given the advantages and disadvantages? We will |
| 225 | + conduct a survey soon after the feature has been released. |
| 226 | + |
| 227 | +## References |
| 228 | + |
| 229 | +- [TEP-0086: Larger Results][tep-0086] |
| 230 | +- [TEP-0086: Larger Results via Sidecar Logs][745] |
| 231 | +- [Implementation Pull Request][poc] |
| 232 | +- [Demonstration by Chitrang][demo] |
| 233 | + |
| 234 | +[docs]: https://tekton.dev/docs/pipelines/tasks/#emitting-results |
| 235 | +[4012]: https://github.com/tektoncd/pipeline/issues/4012 |
| 236 | +[4808]: https://github.com/tektoncd/pipeline/issues/4808 |
| 237 | +[tep-0086]: ./0086-changing-the-way-result-parameters-are-stored.md |
| 238 | +[tep-0086-alts]: ./0086-changing-the-way-result-parameters-are-stored.md#alternatives |
| 239 | +[crd-size]: https://github.com/kubernetes/kubernetes/issues/82292 |
| 240 | +[tep-0100]: ./0100-embedded-taskruns-and-runs-status-in-pipelineruns.md |
| 241 | +[demo]: https://drive.google.com/file/d/1NrWudE_XBqweomiY24DP2Txnl1yN0yD9/view |
| 242 | +[poc]: https://github.com/tektoncd/pipeline/pull/5695 |
| 243 | +[performance]: https://github.com/tektoncd/community/pull/745#issuecomment-1206668381 |
| 244 | +[745]: https://github.com/tektoncd/community/pull/745 |
0 commit comments