status | title | creation-date | last-updated | authors | |||
---|---|---|---|---|---|---|---|
implemented |
Simplify metrics |
2021-06-23 |
2022-02-28 |
|
- Summary
- Motivation
- Requirements
- Proposal
- Design Details
- Test Plan
- Design Evaluation
- Drawbacks
- Alternatives
- Infrastructure Needed (optional)
- Upgrade & Migration Strategy (optional)
- Implementation Pull request(s)
- References (optional)
Tekton pipelines provides metrics but users are having trouble while using it. The main reason is that current metrics is both too much and too fine-grained.
This TEP proposes to simplify metrics by reducing the amount of it and allow users focus on the metrics they really care about.
Users are complaining that their monitoring systems crashed because of too much metrics at this issue. There are 2 main reasons behind the problem: large metrics volume and fine granularity.
Here is an extract of an issues raised by the App-SRE team at Red Hat (See here).
We found that the cluster's prometheus instance was under heavy load and tracked it down to the top 2 heavy queries in the cluster. these were:
- tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket
- tekton_pipelines_controller_pipelinerun_duration_seconds_bucket
we trigger a lot of pipelines, so within a few days we hit ~8k PipelineRun CRs on a single cluster
for
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket
we currently have ~200k metrics published and ~100k fortekton_pipelines_controller_pipelinerun_duration_seconds_bucket
it looks like some labels that are used are cuasing cardinality explosion:
pipelinerun
,taskrun
.
Everytime a PipelineRun
or TaskRun
is executed, it creates new metric or
time series because we use pipelinerun
and taskrun
labels/tags. This causes
unbounded cardinality which isn't recommended for systems like prometheus.
While we can expect Pipeline
or Task
object to remain fairly constant,
pipelinerun
or taskrun
object will continue to increase which lead to
cardinality explosion. In systems, where Pipeline
and Task
are expected to
grow, labels based on them can also cause cardinality explosion.
Currently, there is too much metrics per
TaskRun
/PipelineRun
. According to
metrics.md,
the metrics count of a PipelineRun
with n
TaskRun
is
approximately 15*(n+1)+n
. Majority of metrics comes from histogram.
The amount of metrics is huge while cluster is under stress. This
amount of metrics will cause severe cluster performance degration if
users have monitoring system running unthrottled.
Large metrics volume may also cause metrics loss. Some monitoring system has limitation on metrics count. For example, the maximum number of Prometheus metrics that the agent can consume from the target is 10000 according to Prometheus Metrics Limitation. Once metrics count reaches limitation, new metrics will be dropped and we cannot receive them anymore.
Currently, Tekton pipelines collect metrics at TaskRun
and
PipelineRun
level. Users usually find it too noisy and want
high-level and aggregated metrics.
- Reduce the metrics volume
- Allow users to configure the metrics granularity as they want
- Simplify metrics so that users can see what they care about easily
- Change the behavior of existing metrics ingestor (Prometheus, Stackdriver etc)
- Add support to a new metrics ingestor to handle a large amount of metrics
- Build a new metrics ingestor to handle a large amount of metrics
Cluster admin users are able to configure Tekton pipelines to produce
high level metrics and focus on overall status to make sure Tekton
works fine. While single users who care about statistics of
individual TaskRun
can still keep the fine-grained metrics.
Coarse-grained metrics should be provided to satisfy the users' needs. This can be implemented in following ways:
- changing
PipelineRun
orTaskRun
level metrics toTask
orPipeline
level - changing
PipelineRun
orTaskRun
level metrics to namespace level - changing metrics type from histogram to gauge in case of
TaskRun
orPipelineRun
level. Latter is mutually exclusive with former two. If users care about individualTaskRun
orPipelineRun
's performance, they can setduration_seconds
type from histogram to gauge. If users care about overall performance, they can collect metrics at namespace level.
We can add a config-observability
option to switch between TaskRun
and
PipelineRun
level metrics, Task
and Pipeline
or namespace level.
metrics.taskrun.level
and metrics.pipelinerun.level
fields will indicate
at what level to aggregate metrics.
- When they are set to
namespace
, they will removeTask
,TaskRun
and,PipelineRun
andTaskRun
label respectively in metrics. - When set to
task
orpipeline
, they will removeTaskRun
andPipelineRun
label respectively. - When set to
taskrun
orpipelinerun
, current behaviour will be exhibited.
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: tekton-pipelines
labels:
app.kubernetes.io/instance: default
app.kubernetes.io/part-of: tekton-pipelines
data:
metrics.task.level: "task"
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-ytqrdxja",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-ymelobwl",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-xnuasulj",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-tqerstbj",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-alcdjfnk",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-rtyjsdfm",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-iytyhksd",le="10"} 1
When the option is task
, these metrics will be merged into two based on task
label.
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",task="anonymous",status="success",le="10"} 3
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",task="test", status="success",le="10"} 4
When the option is namespace
, these metrics will be merged into one.
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw","status="success",le="10"} 7
Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one.
For example, metrics like tekton_pipelinerun_duration_seconds
, taskrun_duration_seconds
, tekton_pipelinerun_taskrun_duration_seconds
are histogram. But the metrics could not provide much information at TaskRun
level and can be changed to gauge.
Before
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="10"} 0
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="30"} 0
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="60"} 0
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="300"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="900"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="1800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="3600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="5400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="10800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="21600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="43200"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="86400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="+Inf"} 1
tekton_taskrun_duration_seconds_sum{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt"} 122
tekton_taskrun_duration_seconds_count{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt"} 1
After
tekton_taskrun_duration_seconds{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt"} 122
we can add a config-observability
option to switch duration_seconds
type. The default value is histogram
.
It can be set to gauge
if users only care about the execution time of individual TaskRun
.
It can't be set to gauge
when metrics.namespace-level
is true
because gauge can't be aggregated at namespace level.
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: tekton-pipelines
labels:
app.kubernetes.io/instance: default
app.kubernetes.io/part-of: tekton-pipelines
data:
metrics.duration-seconds-type: gauge
The ultimate goal of this TEP is to provide the metrics that users
truly need. So the config-observability
option is not planned to
exist indefinitely. Add a config-observability
option to only
report at the namespace level, default it to false for 1+ release,
mention this in release notes. If there's no significant user
pushback, default it to true for 1+ release, then remove it entirely.
If there is users' feedback, it can be used to guide future decisions.
End users can also be allowed to customize what they want to collect and what level of granularity. We can add a flag to specify whether users want to customize the metrics.
If it is set to true
, Tekton pipelines would only report the metrics users configured.
The configuration could be like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: tekton-pipelines
labels:
app.kubernetes.io/instance: default
app.kubernetes.io/part-of: tekton-pipelines
data:
metrics.customize-metrics: "true"
metrics.customize-metrics-spec: |
{
"metrics": [{
"name": "pipelinerun_duration_seconds",
"labels": ["status", "namespace"]
}, {
"name": "taskrun_duration_seconds",
"labels": ["status", "namespace"]
}, {
"name": "taskrun_count"
}
]
}
If user doesn't specify the labels, use the default labels for the metrics. Besides, specifying every metrics users want might be annoying. We can only allow user to configure metrics whose type is histogram.
Drawbacks includes:
- Might be laborsome for users to configure what they want and influence the user experience
- Take some effort to valid users input
Prometheus and Stackdriver both can filter the metrics by label. Users can customize condition configuration to choose what they need. We can provide a filter sample and let users configure monitoring system as they wish.
Drawbacks includes:
- Might slow down the monitoring system and cause new performance problem
- Some monitoring system may not support label filter
- Add Configuration for Metrics Cardinality Simplification
- Change Default Metrics Level for Taskrun and Pipelinerun
Additional context for this TEP can be found in the following links: