Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tpu-dra-driver and nvidia-dra-driver-gpu helm charts #1028

Merged
merged 3 commits into from
Mar 28, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions charts/tpu-dra-driver/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: tpu-dra-driver
description: An example Helm chart for a Dynamic Resource Allocation (DRA) resource driver

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "0.1.0"
8 changes: 8 additions & 0 deletions charts/tpu-dra-driver/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# TPU DRA DRIVER

This helm chart is for running TPU DRA Driver Private Preview version on GKE

## Overview

Run `./install-tpu-dra-driver.sh` to install tpu-dra-driver on your GKE Cluster
nodes with TPU resources
28 changes: 28 additions & 0 deletions charts/tpu-dra-driver/install-tpu-dra-driver.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash

CURRENT_DIR="$(cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)"

set -o pipefail

# The name of the example driver
: ${DRIVER_NAME:=tpu-dra-driver}

# The registry, image and tag for the example driver
# Please update DRIVER_IMAGE_REGISTRY
: ${DRIVER_IMAGE_REGISTRY:="gcr.io/gke-release-staging"}
: ${DRIVER_IMAGE_NAME:="${DRIVER_NAME}"}
: ${DRIVER_IMAGE_TAG:="master.0"}
: ${DRIVER_IMAGE_PLATFORM:="ubuntu22.04"}

# The derived name of the driver image to build
: ${DRIVER_IMAGE:="${DRIVER_IMAGE_REGISTRY}/${DRIVER_IMAGE_NAME}:${DRIVER_IMAGE_TAG}"}

helm upgrade -i --create-namespace --namespace tpu-dra-driver tpu-dra-driver ${CURRENT_DIR} \
--set image.repository=${DRIVER_IMAGE_REGISTRY}/${DRIVER_IMAGE_NAME} \
--set image.tag=${DRIVER_IMAGE_TAG} \
--set image.pullPolicy=Always \
--set cdi.enabled=true \
--set cdi.default=true \
--set controller.priorityClassName="" \
--set kubeletPlugin.priorityClassName="" \
--set deviceClasses="{tpu}" \
97 changes: 97 additions & 0 deletions charts/tpu-dra-driver/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
{{/*
Expand the name of the chart.
*/}}
{{- define "tpu-dra-driver.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "tpu-dra-driver.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/*
Allow the release namespace to be overridden for multi-namespace deployments in combined charts
*/}}
{{- define "tpu-dra-driver.namespace" -}}
{{- if .Values.namespaceOverride -}}
{{- .Values.namespaceOverride -}}
{{- else -}}
{{- .Release.Namespace -}}
{{- end -}}
{{- end -}}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "tpu-dra-driver.chart" -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- printf "%s-%s" $name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "tpu-dra-driver.labels" -}}
helm.sh/chart: {{ include "tpu-dra-driver.chart" . }}
{{ include "tpu-dra-driver.templateLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/*
Template labels
*/}}
{{- define "tpu-dra-driver.templateLabels" -}}
app.kubernetes.io/name: {{ include "tpu-dra-driver.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- if .Values.selectorLabelsOverride }}
{{ toYaml .Values.selectorLabelsOverride }}
{{- end }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "tpu-dra-driver.selectorLabels" -}}
{{- if .Values.selectorLabelsOverride -}}
{{ toYaml .Values.selectorLabelsOverride }}
{{- else -}}
{{ include "tpu-dra-driver.templateLabels" . }}
{{- end }}
{{- end }}

{{/*
Full image name with tag
*/}}
{{- define "tpu-dra-driver.fullimage" -}}
{{- $tag := printf "v%s" .Chart.AppVersion }}
{{- .Values.image.repository -}}:{{- .Values.image.tag | default $tag -}}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "tpu-dra-driver.serviceAccountName" -}}
{{- $name := printf "%s-service-account" (include "tpu-dra-driver.fullname" .) }}
{{- if .Values.serviceAccount.create }}
{{- default $name .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}
16 changes: 16 additions & 0 deletions charts/tpu-dra-driver/templates/clusterrole.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: {{ include "tpu-dra-driver.fullname" . }}-role
namespace: {{ include "tpu-dra-driver.namespace" . }}
rules:
- apiGroups: ["resource.k8s.io"]
resources: ["resourceclaims"]
verbs: ["get"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get"]
- apiGroups: ["resource.k8s.io"]
resources: ["resourceslices"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
14 changes: 14 additions & 0 deletions charts/tpu-dra-driver/templates/clusterrolebinding.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: {{ include "tpu-dra-driver.fullname" . }}-role-binding
namespace: {{ include "tpu-dra-driver.namespace" . }}
subjects:
- kind: ServiceAccount
name: {{ include "tpu-dra-driver.serviceAccountName" . }}
namespace: {{ include "tpu-dra-driver.namespace" . }}
roleRef:
kind: ClusterRole
name: {{ include "tpu-dra-driver.fullname" . }}-role
apiGroup: rbac.authorization.k8s.io
8 changes: 8 additions & 0 deletions charts/tpu-dra-driver/templates/deviceclass.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: resource.k8s.io/v1beta1
kind: DeviceClass
metadata:
name: tpu.google.com
spec:
selectors:
- cel:
expression: device.driver == "tpu.google.com"
123 changes: 123 additions & 0 deletions charts/tpu-dra-driver/templates/kubeletplugin.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{ include "tpu-dra-driver.fullname" . }}-kubeletplugin
namespace: {{ include "tpu-dra-driver.namespace" . }}
labels:
{{- include "tpu-dra-driver.labels" . | nindent 4 }}
spec:
selector:
matchLabels:
{{- include "tpu-dra-driver.selectorLabels" . | nindent 6 }}
{{- with .Values.kubeletPlugin.updateStrategy }}
updateStrategy:
{{- toYaml . | nindent 4 }}
{{- end }}
template:
metadata:
{{- with .Values.kubeletPlugin.podAnnotations }}
annotations:
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "tpu-dra-driver.templateLabels" . | nindent 8 }}
spec:
hostNetwork: true
{{- if .Values.kubeletPlugin.priorityClassName }}
priorityClassName: {{ .Values.kubeletPlugin.priorityClassName }}
{{- end }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "tpu-dra-driver.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.kubeletPlugin.podSecurityContext | nindent 8 }}
containers:
- name: plugin
securityContext:
{{- toYaml .Values.kubeletPlugin.containers.plugin.securityContext | nindent 10 }}
image: {{ include "tpu-dra-driver.fullimage" . }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
command: ["tpu-dra-kubeletplugin"]
resources:
{{- toYaml .Values.kubeletPlugin.containers.plugin.resources | nindent 10 }}
env:
- name: CDI_ROOT
value: /var/run/cdi
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: DEVICE_CLASSES
value: {{ .Values.deviceClasses | join "," }}
volumeMounts:
- name: plugins-registry
mountPath: /var/lib/kubelet/plugins_registry
- name: plugins
mountPath: /var/lib/kubelet/plugins
- name: cdi
mountPath: /var/run/cdi
- name: sys
mountPath: /sys
- name: proc
mountPath: /proc
volumes:
- name: plugins-registry
hostPath:
path: /var/lib/kubelet/plugins_registry
- name: plugins
hostPath:
path: /var/lib/kubelet/plugins
- name: cdi
hostPath:
path: /var/run/cdi
- name: dev
hostPath:
path: /dev
type: DirectoryOrCreate
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
type: DirectoryOrCreate
- name: tpu-env
hostPath:
path: /etc/tpu
type: DirectoryOrCreate
- name: tpu-logs
hostPath:
path: /tmp/tpu_logs
type: DirectoryOrCreate
- name: sys
hostPath:
path: /sys
type: Directory
- name: proc
hostPath:
path: /proc
type: Directory
{{- with .Values.kubeletPlugin.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.kubeletPlugin.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.kubeletPlugin.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
13 changes: 13 additions & 0 deletions charts/tpu-dra-driver/templates/serviceaccount.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{{- if .Values.serviceAccount.create -}}
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ include "tpu-dra-driver.serviceAccountName" . }}
namespace: {{ include "tpu-dra-driver.namespace" . }}
labels:
{{- include "tpu-dra-driver.labels" . | nindent 4 }}
{{- with .Values.serviceAccount.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
{{- end }}
31 changes: 31 additions & 0 deletions charts/tpu-dra-driver/templates/validatingadmissionpolicy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: resourceslices-policy-{{ include "tpu-dra-driver.fullname" . }}
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["resource.k8s.io"]
apiVersions: ["v1beta1"]
operations: ["CREATE", "UPDATE", "DELETE"]
resources: ["resourceslices"]
matchConditions:
- name: isRestrictedUser
expression: >-
request.userInfo.username == "system:serviceaccount:{{ include "tpu-dra-driver.namespace" . }}:{{ include "tpu-dra-driver.serviceAccountName" . }}"
variables:
- name: userNodeName
expression: >-
request.userInfo.extra[?'authentication.kubernetes.io/node-name'][0].orValue('')
- name: objectNodeName
expression: >-
(request.operation == "DELETE" ? oldObject : object).spec.?nodeName.orValue("")
validations:
- expression: variables.userNodeName != ""
message: >-
no node association found for user, this user must run in a pod on a node and ServiceAccountTokenPodNodeInfo must be enabled
- expression: variables.userNodeName == variables.objectNodeName
messageExpression: >-
"this user running on node '"+variables.userNodeName+"' may not modify " +
(variables.objectNodeName == "" ?"cluster resourceslices" : "resourceslices on node '"+variables.objectNodeName+"'")
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: resourceslices-policy-{{ include "tpu-dra-driver.fullname" . }}
spec:
policyName: resourceslices-policy-{{ include "tpu-dra-driver.fullname" . }}
validationActions: [Deny]
# All ResourceSlices are matched.
Loading