Skip to content

Latest commit

 

History

History
430 lines (341 loc) · 18.3 KB

0073-simplify-metrics.md

File metadata and controls

430 lines (341 loc) · 18.3 KB
status title creation-date last-updated authors
implemented
Simplify metrics
2021-06-23
2022-02-28
@vdemeester
@yaoxiaoqi
@khrm

TEP-0073: Simplify metrics

Summary

Tekton pipelines provides metrics but users are having trouble while using it. The main reason is that current metrics is both too much and too fine-grained.

This TEP proposes to simplify metrics by reducing the amount of it and allow users focus on the metrics they really care about.

Motivation

Users are complaining that their monitoring systems crashed because of too much metrics at this issue. There are 2 main reasons behind the problem: large metrics volume and fine granularity.

Here is an extract of an issues raised by the App-SRE team at Red Hat (See here).

We found that the cluster's prometheus instance was under heavy load and tracked it down to the top 2 heavy queries in the cluster. these were:

  • tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket
  • tekton_pipelines_controller_pipelinerun_duration_seconds_bucket

we trigger a lot of pipelines, so within a few days we hit ~8k PipelineRun CRs on a single cluster

for tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket we currently have ~200k metrics published and ~100k for tekton_pipelines_controller_pipelinerun_duration_seconds_bucket

it looks like some labels that are used are cuasing cardinality explosion: pipelinerun, taskrun.

Everytime a PipelineRun or TaskRun is executed, it creates new metric or time series because we use pipelinerun and taskrun labels/tags. This causes unbounded cardinality which isn't recommended for systems like prometheus. While we can expect Pipeline or Task object to remain fairly constant, pipelinerun or taskrun object will continue to increase which lead to cardinality explosion. In systems, where Pipeline and Task are expected to grow, labels based on them can also cause cardinality explosion.

Large metrics volume

Currently, there is too much metrics per TaskRun/PipelineRun. According to metrics.md, the metrics count of a PipelineRun with n TaskRun is approximately 15*(n+1)+n. Majority of metrics comes from histogram. The amount of metrics is huge while cluster is under stress. This amount of metrics will cause severe cluster performance degration if users have monitoring system running unthrottled.

Large metrics volume may also cause metrics loss. Some monitoring system has limitation on metrics count. For example, the maximum number of Prometheus metrics that the agent can consume from the target is 10000 according to Prometheus Metrics Limitation. Once metrics count reaches limitation, new metrics will be dropped and we cannot receive them anymore.

Fine granularity

Currently, Tekton pipelines collect metrics at TaskRun and PipelineRun level. Users usually find it too noisy and want high-level and aggregated metrics.

Goals

  • Reduce the metrics volume
  • Allow users to configure the metrics granularity as they want
  • Simplify metrics so that users can see what they care about easily

Non-Goals

  • Change the behavior of existing metrics ingestor (Prometheus, Stackdriver etc)
  • Add support to a new metrics ingestor to handle a large amount of metrics
  • Build a new metrics ingestor to handle a large amount of metrics

Use Cases (optional)

Cluster admin users are able to configure Tekton pipelines to produce high level metrics and focus on overall status to make sure Tekton works fine. While single users who care about statistics of individual TaskRun can still keep the fine-grained metrics.

Requirements

Proposal

Coarse-grained metrics should be provided to satisfy the users' needs. This can be implemented in following ways:

  • changing PipelineRun or TaskRun level metrics to Task or Pipeline level
  • changing PipelineRun or TaskRun level metrics to namespace level
  • changing metrics type from histogram to gauge in case of TaskRun or PipelineRun level. Latter is mutually exclusive with former two. If users care about individual TaskRun or PipelineRun's performance, they can set duration_seconds type from histogram to gauge. If users care about overall performance, they can collect metrics at namespace level.

Setting Level of Metrics for TaskRun or PipelineRun

We can add a config-observability option to switch between TaskRun and PipelineRun level metrics, Task and Pipeline or namespace level. metrics.taskrun.level and metrics.pipelinerun.level fields will indicate at what level to aggregate metrics.

  • When they are set to namespace, they will remove Task, TaskRun and, PipelineRun and TaskRun label respectively in metrics.
  • When set to task or pipeline, they will remove TaskRun and PipelineRun label respectively.
  • When set to taskrun or pipelinerun, current behaviour will be exhibited.
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/instance: default
    app.kubernetes.io/part-of: tekton-pipelines
data:
  metrics.task.level: "task"
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-ytqrdxja",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-ymelobwl",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-xnuasulj",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-tqerstbj",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-alcdjfnk",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-rtyjsdfm",le="10"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="test",taskrun="duplicate-pod-task-run-iytyhksd",le="10"} 1

When the option is task, these metrics will be merged into two based on task label.

tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",task="anonymous",status="success",le="10"} 3
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",task="test", status="success",le="10"} 4

When the option is namespace, these metrics will be merged into one.

tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw","status="success",le="10"} 7

Change metrics type

Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one. For example, metrics like tekton_pipelinerun_duration_seconds, taskrun_duration_seconds, tekton_pipelinerun_taskrun_duration_seconds are histogram. But the metrics could not provide much information at TaskRun level and can be changed to gauge.

Before

tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="10"} 0
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="30"} 0
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="60"} 0
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="300"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="900"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="1800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="3600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="5400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="10800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="21600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="43200"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="86400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt",le="+Inf"} 1
tekton_taskrun_duration_seconds_sum{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt"} 122
tekton_taskrun_duration_seconds_count{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt"} 1

After

tekton_taskrun_duration_seconds{namespace="arendelle-nsfqw",status="success",task="anonymous",taskrun="duplicate-pod-task-run-wnigeayt"} 122

we can add a config-observability option to switch duration_seconds type. The default value is histogram. It can be set to gauge if users only care about the execution time of individual TaskRun. It can't be set to gauge when metrics.namespace-level is true because gauge can't be aggregated at namespace level.

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/instance: default
    app.kubernetes.io/part-of: tekton-pipelines
data:
  metrics.duration-seconds-type: gauge

Notes/Caveats (optional)

The ultimate goal of this TEP is to provide the metrics that users truly need. So the config-observability option is not planned to exist indefinitely. Add a config-observability option to only report at the namespace level, default it to false for 1+ release, mention this in release notes. If there's no significant user pushback, default it to true for 1+ release, then remove it entirely. If there is users' feedback, it can be used to guide future decisions.

Risks and Mitigations

User Experience (optional)

Performance (optional)

Design Details

Test Plan

Design Evaluation

Drawbacks

Alternatives

Alternatives 1: Customize metrics

End users can also be allowed to customize what they want to collect and what level of granularity. We can add a flag to specify whether users want to customize the metrics. If it is set to true, Tekton pipelines would only report the metrics users configured.

The configuration could be like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/instance: default
    app.kubernetes.io/part-of: tekton-pipelines
data:
  metrics.customize-metrics: "true"
  metrics.customize-metrics-spec: |
    {
      "metrics": [{
          "name": "pipelinerun_duration_seconds",
          "labels": ["status", "namespace"]
        }, {
          "name": "taskrun_duration_seconds",
          "labels": ["status", "namespace"]
        }, {
          "name": "taskrun_count"
        }
      ]
    }

If user doesn't specify the labels, use the default labels for the metrics. Besides, specifying every metrics users want might be annoying. We can only allow user to configure metrics whose type is histogram.

Drawbacks includes:

  • Might be laborsome for users to configure what they want and influence the user experience
  • Take some effort to valid users input

Alternatives 2: Configure monitoring system filter

Prometheus and Stackdriver both can filter the metrics by label. Users can customize condition configuration to choose what they need. We can provide a filter sample and let users configure monitoring system as they wish.

Drawbacks includes:

  • Might slow down the monitoring system and cause new performance problem
  • Some monitoring system may not support label filter

Infrastructure Needed (optional)

Upgrade & Migration Strategy (optional)

Implementation Pull request(s)

References (optional)

Additional context for this TEP can be found in the following links: