Skip to content

Commit dbd333f

Browse files
committed
Add Node Affinity for TaskRuns that share PVC workspace
TaskRuns within a PipelineRun may share files using a workspace volume. The typical case is files from a git-clone operation. Tasks in a CI-pipeline often perform operations on the filesystem, e.g. generate files or analyze files, so the workspace abstraction is very useful. The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims). PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes). The most commonly available PV access mode is ReadWriteOnce, volumes with this access mode can only be mounted on one Node at a time. When using parallel Tasks in a Pipeline, the pods for the TaskRuns is scheduled to any Node, most likely not to the same Node in a cluster. Since volumes with the commonly available ReadWriteOnce access mode cannot be use by multiple nodes at a time, these "parallel" pods is forced to execute sequentially, since the volume only is available on one node at a time. This may make that your TaskRuns time out. Clusters are often _regional_, e.g. they are deployed across 3 Availability Zones, but Persistent Volumes are often _zonal_, e.g. they are only available for the Nodes within a single zone. Some cloud providers offer regional PVs, but sometimes regional PVs is only replicated to one additional zone, e.g. not all 3 zones within a region. This works fine for most typical stateful application, but Tekton uses storage in a different way - it is designed so that multiple pods access the same volume, in a sequece or parallel. This makes it difficult to design a Pipeline that starts with parallel tasks using its own PVC and then have a common tasks that mount the volume from the earlier tasks - since - what happens if those tasks were scheduled to different zones - the common task can not mount the PVCs that now is located in different zones, so the PipelineRun is deadlocked. There are a few technical solutions that offer parallel executions of Tasks even when sharing PVC workspace: - Using PVC access mode ReadWriteMany. But this access mode is not widely available, and is typically a NFS server or another not so "cloud native" solution. - An alternative is to use a storage that is tied to a specific node, e.g. local volume and then configure so pods are scheduled to this node, but this is not commonly available and it has drawbacks, e.g. the pod may need to consume and mount a whole disk e.g. several hundreds GB. Consequently, it would be good to find a way so that TaskRun pods that share workspace are scheduled to the same Node - and thereby make it easy to use parallel tasks with workspace - while executing concurrently - on widely available Kubernetes cluster and storage configurations. A few alternative solutions have been considered, as documented in tektoncd#2586. However, they all have major drawbacks, e.g. major API and contract changes. This commit introduces an "Affinity Assistant" - a minimal placeholder-pod, so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node. This solution has several benefits: it does not introduce any API changes, it does not break or change any existing Tekton concepts and it is implemented with very few changes. Additionally it can be disabled with a feature-flag. **How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is created for each PVC workspace volume. TaskRun pods that share workspace volume is configured with podAffinity to the "Affinity Assisant" pod that was created for the volume. The "Affinity Assistant" lives until the PipelineRun is completed, or deleted. "Affinity Assistant" pods are configured with podAntiAffinity to repel other "Affinity Assistants" - in a Best Effort fashion. The Affinity Assistant is _singleton_ workload, since it acts as a placeholder pod and TaskRun pods with affinity must be scheduled to the same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests - since it does not provide any work other than beeing a placeholder. Singleton workloads can be implemented in multiple ways, and they differ in behavior when the Node becomes unreachable: - as a Pod - the Pod is not managed, so it will not be recreated. - as a Deployment - the Pod will be recreated and puts Availability before the singleton property - as a StatefulSet - the Pod will be recreated but puds the singleton property before Availability Therefor the Affinity Assistant is implemented as a StatefulSet. Essentialy this commit provides an effortless way to use a functional task parallelism with any Kubernetes cluster that has any PVC based storage. Solves tektoncd#2586 /kind feature
1 parent 1fbac2a commit dbd333f

13 files changed

+810
-40
lines changed

config/200-clusterrole.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ rules:
5252
# Unclear if this access is actually required. Simply a hold-over from the previous
5353
# incarnation of the controller's ClusterRole.
5454
- apiGroups: ["apps"]
55-
resources: ["deployments"]
55+
resources: ["deployments", "statefulsets"]
5656
verbs: ["get", "list", "create", "update", "delete", "patch", "watch"]
5757
- apiGroups: ["apps"]
5858
resources: ["deployments/finalizers"]

docs/install.md

+12
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,18 @@ file lists the keys you can customize along with their default values.
268268

269269
To customize the behavior of the Pipelines Controller, modify the ConfigMap `feature-flags` as follows:
270270

271+
- `disable-affinity-assistant` - set this flag to disable the [Affinity Assistant](./workspaces.md#affinity-assistant-and-specifying-workspace-order-in-a-pipeline)
272+
that is used to provide Node Affinity for `TaskRun` pods that share workspace volume. The Assistant pods are
273+
incompatible with NodeSelector and other affinity rules configured for `TaskRun` pods.
274+
275+
**Note:** Affinity Assistant use [Inter-pod affinity and anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
276+
that require substantial amount of processing which can slow down scheduling in large clusters
277+
significantly. We do not recommend using them in clusters larger than several hundred nodes
278+
279+
**Note:** Pod anti-affinity requires nodes to be consistently labelled, in other words every
280+
node in the cluster must have an appropriate label matching `topologyKey`. If some or all nodes
281+
are missing the specified `topologyKey` label, it can lead to unintended behavior.
282+
271283
- `disable-home-env-overwrite` - set this flag to `true` to prevent Tekton
272284
from overriding the `$HOME` environment variable for the containers executing your `Steps`.
273285
The default is `false`. For more information, see the [associated issue](https://github.com/tektoncd/pipeline/issues/2013).

docs/labels.md

+2
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ The following labels are added to resources automatically:
5858
reference a `ClusterTask` will also receive `tekton.dev/task`.
5959
- `tekton.dev/taskRun` is added to `Pods`, and contains the name of the
6060
`TaskRun` that created the `Pod`.
61+
- `app.kubernetes.io/instance` and `app.kubernetes.io/component` is added to
62+
Affinity Assistant `StatefulSets` and `Pods`. These are used for Pod Affinity for TaskRuns.
6163

6264
## Examples
6365

docs/tasks.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -363,7 +363,8 @@ steps:
363363
### Specifying `Workspaces`
364364

365365
[`Workspaces`](workspaces.md#using-workspaces-in-tasks) allow you to specify
366-
one or more volumes that your `Task` requires during execution. For example:
366+
one or more volumes that your `Task` requires during execution. It is recommended that `Tasks` uses **at most**
367+
one writeable `Workspace`. For example:
367368

368369
```yaml
369370
spec:

docs/workspaces.md

+22-20
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ weight: 5
1515
- [Mapping `Workspaces` in `Tasks` to `TaskRuns`](#mapping-workspaces-in-tasks-to-taskruns)
1616
- [Examples of `TaskRun` definitions using `Workspaces`](#examples-of-taskrun-definitions-using-workspaces)
1717
- [Using `Workspaces` in `Pipelines`](#using-workspaces-in-pipelines)
18-
- [Specifying `Workspace` order in a `Pipeline`](#specifying-workspace-order-in-a-pipeline)
18+
- [Affinity Assistant and specifying `Workspace` order in a `Pipeline`](#affinity-assistant-and-specifying-workspace-order-in-a-pipeline)
1919
- [Specifying `Workspaces` in `PipelineRuns`](#specifying-workspaces-in-pipelineruns)
2020
- [Example `PipelineRun` definition using `Workspaces`](#example-pipelinerun-definitions-using-workspaces)
2121
- [Specifying `VolumeSources` in `Workspaces`](#specifying-volumesources-in-workspaces)
@@ -89,7 +89,8 @@ To configure one or more `Workspaces` in a `Task`, add a `workspaces` list with
8989

9090
Note the following:
9191

92-
- A `Task` definition can include as many `Workspaces` as it needs.
92+
- A `Task` definition can include as many `Workspaces` as it needs. It is recommended that `Tasks` use
93+
**at most** one _writable_ `Workspace`.
9394
- A `readOnly` `Workspace` will have its volume mounted as read-only. Attempting to write
9495
to a `readOnly` `Workspace` will result in errors and failed `TaskRuns`.
9596
- `mountPath` can be either absolute or relative. Absolute paths start with `/` and relative paths
@@ -244,26 +245,27 @@ Include a `subPath` in the workspace binding to mount different parts of the sam
244245

245246
The `subPath` specified in a `Pipeline` will be appended to any `subPath` specified as part of the `PipelineRun` workspace declaration. So a `PipelineRun` declaring a Workspace with `subPath` of `/foo` for a `Pipeline` who binds it to a `Task` with `subPath` of `/bar` will end up mounting the `Volume`'s `/foo/bar` directory.
246247

247-
#### Specifying `Workspace` order in a `Pipeline`
248+
#### Affinity Assistant and specifying `Workspace` order in a `Pipeline`
248249

249250
Sharing a `Workspace` between `Tasks` requires you to define the order in which those `Tasks`
250-
will be accessing that `Workspace` since different classes of storage have different limits
251-
for concurrent reads and writes. For example, a `PersistentVolumeClaim` with
252-
[access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes)
253-
`ReadWriteOnce` only allow `Tasks` on the same node writing to it at once.
254-
255-
Using parallel `Tasks` in a `Pipeline` will work with `PersistentVolumeClaims` configured with
256-
[access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes)
257-
`ReadWriteMany` or `ReadOnlyMany` but you must ensure that those are available for your storage class.
258-
When using `PersistentVolumeClaims` with access mode `ReadWriteOnce` for parallel `Tasks`, you can configure a
259-
workspace with it's own `PersistentVolumeClaim` for each parallel `Task`.
260-
261-
Use the `runAfter` field in your `Pipeline` definition to define when a `Task` should be executed. For more
262-
information, see the [`runAfter` documentation](pipelines.md#runAfter).
263-
264-
**Warning:** You *must* ensure that this order is compatible with the configured access modes for your `PersistentVolumeClaim`.
265-
Parallel `Tasks` using the same `PersistentVolumeClaim` with access mode `ReadWriteOnce`, may execute on
266-
different nodes and be forced to execute sequentially which may cause `Tasks` to time out.
251+
write to or read from that `Workspace`. Use the `runAfter` field in your `Pipeline` definition
252+
to define when a `Task` should be executed. For more information, see the [`runAfter` documentation](pipelines.md#runAfter).
253+
254+
When a `PersistentVolumeClaim` is used as volume source for a `Workspace` in a `PipelineRun`,
255+
an Affinity Assistant will be created. The Affinity Assistant acts as a placeholder for `TaskRun` pods
256+
sharing the same `Workspace`. All `TaskRun` pods within the `PipelineRun` that share the `Workspace`
257+
will be scheduled to the same Node as the Affinity Assistant pod. This means that Affinity Assistant is incompatible
258+
with e.g. NodeSelectors or other affinity rules configured for the `TaskRun` pods. The Affinity Assistant
259+
is deleted when the `PipelineRun` is completed. The Affinity Assistant can be disabled by setting the
260+
[disable-affinity-assistant](install.md#customizing-basic-execution-parameters) feature gate.
261+
262+
**Note:** Affinity Assistant use [Inter-pod affinity and anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
263+
that require substantial amount of processing which can slow down scheduling in large clusters
264+
significantly. We do not recommend using them in clusters larger than several hundred nodes
265+
266+
**Note:** Pod anti-affinity requires nodes to be consistently labelled, in other words every
267+
node in the cluster must have an appropriate label matching `topologyKey`. If some or all nodes
268+
are missing the specified `topologyKey` label, it can lead to unintended behavior.
267269

268270
#### Specifying `Workspaces` in `PipelineRuns`
269271

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# This example shows how both sequiential and parallel Tasks can share data
2+
# using a PersistentVolumeClaim as a workspace. The TaskRun pods that share
3+
# workspace will be scheduled to the same Node in your cluster with an
4+
# Affinity Assistant (unless it is disabled). The REPORTER task does not
5+
# use a workspace so it does not get affinity to the Affinity Assistant
6+
# and can be scheduled to any Node. If multiple concurrent PipelineRuns is
7+
# executed, their Affinity Assistant pods will repel eachother to different
8+
# Nodes in a Best Effort fashion.
9+
#
10+
# A PipelineRun will pass a message parameter to the Pipeline in this example.
11+
# The STARTER task will write the message to a file in the workspace. The UPPER
12+
# and LOWER tasks will execute in parallel and process the message written by
13+
# the STARTER, and transform it to upper case and lower case. The REPORTER task
14+
# is will use the Task Result from the UPPER task and print it - it is intented
15+
# to mimic a Task that sends data to an external service and shows a Task that
16+
# doesn't use a workspace. The VALIDATOR task will validate the result from
17+
# UPPER and LOWER.
18+
#
19+
# Use the runAfter property in a Pipeline to configure that a task depend on
20+
# another task. Output can be shared both via Task Result (e.g. like REPORTER task)
21+
# or via files in a workspace.
22+
#
23+
# -- (upper) -- (reporter)
24+
# / \
25+
# (starter) (validator)
26+
# \ /
27+
# -- (lower) ------------
28+
29+
apiVersion: tekton.dev/v1beta1
30+
kind: Pipeline
31+
metadata:
32+
name: parallel-pipeline
33+
spec:
34+
params:
35+
- name: message
36+
type: string
37+
38+
workspaces:
39+
- name: ws
40+
41+
tasks:
42+
- name: starter # Tasks that does not declare a runAfter property
43+
taskRef: # will start execution immediately
44+
name: persist-param
45+
params:
46+
- name: message
47+
value: $(params.message)
48+
workspaces:
49+
- name: task-ws
50+
workspace: ws
51+
subPath: init
52+
53+
- name: upper
54+
runAfter: # Note the use of runAfter her to declare that this task
55+
- starter # depend on a previous task
56+
taskRef:
57+
name: to-upper
58+
params:
59+
- name: input-path
60+
value: init/message
61+
workspaces:
62+
- name: w
63+
workspace: ws
64+
65+
- name: lower
66+
runAfter:
67+
- starter
68+
taskRef:
69+
name: to-lower
70+
params:
71+
- name: input-path
72+
value: init/message
73+
workspaces:
74+
- name: w
75+
workspace: ws
76+
77+
- name: reporter
78+
runAfter:
79+
- upper
80+
taskRef:
81+
name: result-reporter
82+
params:
83+
- name: result-to-report
84+
value: $(tasks.upper.results.message)
85+
86+
- name: validator # This task validate the output from upper and lower Task
87+
runAfter: # It does not strictly depend on the reporter Task
88+
- reporter # But you may want to skip this task if the reporter Task fail
89+
- lower
90+
taskRef:
91+
name: validator
92+
workspaces:
93+
- name: files
94+
workspace: ws
95+
---
96+
apiVersion: tekton.dev/v1beta1
97+
kind: Task
98+
metadata:
99+
name: persist-param
100+
spec:
101+
params:
102+
- name: message
103+
type: string
104+
results:
105+
- name: message
106+
description: A result message
107+
steps:
108+
- name: write
109+
image: ubuntu
110+
script: echo $(params.message) | tee $(workspaces.task-ws.path)/message $(results.message.path)
111+
workspaces:
112+
- name: task-ws
113+
---
114+
apiVersion: tekton.dev/v1beta1
115+
kind: Task
116+
metadata:
117+
name: to-upper
118+
spec:
119+
description: |
120+
This task read and process a file from the workspace and write the result
121+
both to a file in the workspace and as a Task Result.
122+
params:
123+
- name: input-path
124+
type: string
125+
results:
126+
- name: message
127+
description: Input message in upper case
128+
steps:
129+
- name: to-upper
130+
image: ubuntu
131+
script: cat $(workspaces.w.path)/$(params.input-path) | tr '[:lower:]' '[:upper:]' | tee $(workspaces.w.path)/upper $(results.message.path)
132+
workspaces:
133+
- name: w
134+
---
135+
apiVersion: tekton.dev/v1beta1
136+
kind: Task
137+
metadata:
138+
name: to-lower
139+
spec:
140+
description: |
141+
This task read and process a file from the workspace and write the result
142+
both to a file in the workspace and as a Task Result
143+
params:
144+
- name: input-path
145+
type: string
146+
results:
147+
- name: message
148+
description: Input message in lower case
149+
steps:
150+
- name: to-lower
151+
image: ubuntu
152+
script: cat $(workspaces.w.path)/$(params.input-path) | tr '[:upper:]' '[:lower:]' | tee $(workspaces.w.path)/lower $(results.message.path)
153+
workspaces:
154+
- name: w
155+
---
156+
apiVersion: tekton.dev/v1beta1
157+
kind: Task
158+
metadata:
159+
name: result-reporter
160+
spec:
161+
params:
162+
- name: result-to-report
163+
type: string
164+
steps:
165+
- name: report-result
166+
image: ubuntu
167+
script: echo $(params.result-to-report)
168+
---
169+
apiVersion: tekton.dev/v1beta1
170+
kind: Task
171+
metadata:
172+
name: validator
173+
spec:
174+
steps:
175+
- name: validate-upper
176+
image: ubuntu
177+
script: cat $(workspaces.files.path)/upper | grep HELLO\ TEKTON
178+
- name: validate-lower
179+
image: ubuntu
180+
script: cat $(workspaces.files.path)/lower | grep hello\ tekton
181+
workspaces:
182+
- name: files
183+
---
184+
apiVersion: tekton.dev/v1beta1
185+
kind: PipelineRun
186+
metadata:
187+
generateName: parallel-pipelinerun-
188+
spec:
189+
params:
190+
- name: message
191+
value: Hello Tekton
192+
pipelineRef:
193+
name: parallel-pipeline
194+
workspaces:
195+
- name: ws
196+
volumeClaimTemplate:
197+
spec:
198+
accessModes:
199+
- ReadWriteOnce
200+
resources:
201+
requests:
202+
storage: 1Gi

pkg/pod/pod.go

+32-1
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ import (
2626
"github.com/tektoncd/pipeline/pkg/names"
2727
"github.com/tektoncd/pipeline/pkg/system"
2828
"github.com/tektoncd/pipeline/pkg/version"
29+
"github.com/tektoncd/pipeline/pkg/workspace"
2930
corev1 "k8s.io/api/core/v1"
3031
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
3132
"k8s.io/apimachinery/pkg/runtime/schema"
@@ -219,6 +220,17 @@ func MakePod(images pipeline.Images, taskRun *v1beta1.TaskRun, taskSpec v1beta1.
219220
return nil, err
220221
}
221222

223+
// Using node affinity on taskRuns sharing PVC workspace, with an Affinity Assistant
224+
// is mutual exclusive with other affinity on taskRun pods. If other
225+
// affinity is wanted, that should be added on the Affinity Assistant pod unless
226+
// assistant is disabled. When Affinity Assistant is disabled, an affinityAssistantName is not set.
227+
var affinity *corev1.Affinity
228+
if affinityAssistantName := taskRun.Annotations[workspace.AnnotationAffinityAssistantName]; affinityAssistantName != "" {
229+
affinity = nodeAffinityUsingAffinityAssistant(affinityAssistantName)
230+
} else {
231+
affinity = podTemplate.Affinity
232+
}
233+
222234
mergedPodContainers := stepContainers
223235

224236
// Merge sidecar containers with step containers.
@@ -265,7 +277,7 @@ func MakePod(images pipeline.Images, taskRun *v1beta1.TaskRun, taskSpec v1beta1.
265277
Volumes: volumes,
266278
NodeSelector: podTemplate.NodeSelector,
267279
Tolerations: podTemplate.Tolerations,
268-
Affinity: podTemplate.Affinity,
280+
Affinity: affinity,
269281
SecurityContext: podTemplate.SecurityContext,
270282
RuntimeClassName: podTemplate.RuntimeClassName,
271283
AutomountServiceAccountToken: podTemplate.AutomountServiceAccountToken,
@@ -296,6 +308,25 @@ func MakeLabels(s *v1beta1.TaskRun) map[string]string {
296308
return labels
297309
}
298310

311+
// nodeAffinityUsingAffinityAssistant achieves Node Affinity for taskRun pods
312+
// sharing PVC workspace by setting PodAffinity so that taskRuns is
313+
// scheduled to the Node were the Affinity Assistant pod is scheduled.
314+
func nodeAffinityUsingAffinityAssistant(affinityAssistantName string) *corev1.Affinity {
315+
return &corev1.Affinity{
316+
PodAffinity: &corev1.PodAffinity{
317+
RequiredDuringSchedulingIgnoredDuringExecution: []corev1.PodAffinityTerm{{
318+
LabelSelector: &metav1.LabelSelector{
319+
MatchLabels: map[string]string{
320+
workspace.LabelInstance: affinityAssistantName,
321+
workspace.LabelComponent: workspace.ComponentNameAffinityAssistant,
322+
},
323+
},
324+
TopologyKey: "kubernetes.io/hostname",
325+
}},
326+
},
327+
}
328+
}
329+
299330
// getLimitRangeMinimum gets all LimitRanges in a namespace and
300331
// searches for if a container minimum is specified. Due to
301332
// https://github.com/kubernetes/kubernetes/issues/79496, the

0 commit comments

Comments
 (0)