This project shows how to add a GPU-enabled node pool to an existing AKS cluster and how to autoscale and monitor GPU-enabled worker nodes

Graphical processing units (GPUs) are often used for compute-intensive workloads such as graphics and visualization workloads.  AKS supports the creation of GPU-enabled node pools to run these compute-intensive workloads in Kubernetes. For more information on AKS , see Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS). For more information on available GPU-enabled virtual machines, see GPU optimized VM sizes in Azure.

Azure supports creating a brand new AKS cluster with a GPU-enabled default node pool, as well as adding one more GPU-enabled node pools to an existing cluster using for example the az aks nodepool add command. Before the GPUs in the nodes can be used, you must deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs. As alternative to these steps, AKS provides a specialized GPU image that already contains the NVIDIA device plugin for Kubernetes.

AKS Cluster Autoscaling

To keep up with application demands in Azure Kubernetes Service (AKS), you may need to adjust the number of GPU nodes that run your compute-intensive workloads. The AKS cluster autoscaler component can watch for pods in your cluster that can't be scheduled because of resource constraints. When issues are detected, the number of nodes in a node pool is increased to meet the application demand. Nodes are also regularly checked for a lack of running pods, with the number of nodes then decreased as needed. This ability to automatically scale up or down the number of nodes in your AKS cluster lets you run an efficient, cost-effective cluster.

To adjust to changing application demands, such as between the workday and evening or on a weekend, or when running on-demand, compute-intensive jobs, clusters often need a way to automatically scale out the number of nodes to schedule the increased number of pods. AKS clusters can scale in one of two ways:

  • The cluster autoscaler watches for pods that can't be scheduled on nodes because of resource constraints. The cluster then automatically increases the number of nodes.

  • The horizontal pod autoscaler uses the Metrics Server in a Kubernetes cluster to monitor the resource demand of pods. If an application needs more resources, the number of pods is automatically increased to meet the demand.

The cluster autoscaler and horizontal pod autoscaler often work together to support the required application demands

Both the horizontal pod autoscaler and cluster autoscaler can also decrease the number of pods and nodes as needed. The cluster autoscaler decreases the number of nodes when there has been unused capacity for a period of time. Pods on a node to be removed by the cluster autoscaler are safely scheduled elsewhere in the cluster. For more information, see Automatically scale a cluster to meet application demands on Azure Kubernetes Service (AKS). Note: when the AKS cluster is composed of multiple node pools, the autoscaler needs to be activated separately for each node pool.

Accelerated Networking

For AKS nodes, we recommend a minimum size of Standard_NC6*.* I strongly recommend to use a GPU-enabled VM SKU that supports accelerated networking.

Accelerated networking greatly improves networking performance when accessing PaaS services such as Azure SQL Database, Azure Cosmos DB, or Storage Accounts by increasing throughput,

reducing latency, jitter, and CPU utilization. Accelerated networking is particularly indicated for demanding network workloads on supported VM types.

Data Center GPU Manager

Monitoring stacks usually consist of a metrics collector, a time-series database to store metrics, and a visualization layer. A popular open-source stack is Prometheus, used along with Grafana as the visualization tool to create rich dashboards. Prometheus also includes Alertmanager to create and manage alerts. Prometheus is deployed along with kube-state-metrics and node_exporter to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization. The figure below shows a sample architecture with Prometheus and Grafana.

Image showing the various components of a Prometheus + Grafana architecture for gathering telemetry, including the server, Alertmanager, and UI components.

Kubernetes includes experimental support for managing AMD and NVIDIA GPUs across several nodes. DCGM-Exporter is a tool based on the Go APIs to NVIDIA DCGM that allows users to gather GPU metrics and understand workload behavior or monitor GPUs in clusters. DCGM-Exporter is written in Go and exposes GPU metrics at an HTTP endpoint (/metrics) for monitoring solutions such as Prometheus. DCGM-Exporter is also configurable. You can customize the GPU metrics to be collected by DCGM by using an input configuration file in the .csv format. For more information on available metrics, see here and here.

DCGM-Exporter collects metrics for all available GPUs on a node. However, in Kubernetes, you might not necessarily know which GPUs in a node would be assigned to a pod when it requests GPU resources. Starting in v1.13, kubelet has added a device monitoring feature that lets you find out the assigned devices to the podpod name, pod namespace, and device ID—using a pod-resources socket. The http server in DCGM-Exporter connects to the kubelet pod-resources server (/var/lib/kubelet/pod-resources) to identify the GPU devices running on a pod and appends the GPU devices pod information to the metrics collected.

Image showing the architecture of dcgm-exporter for gathering telemetry with Prometheus with the node-exporter, dcgm-exporter components, and service monitor components.

For more information on the NVIDIA Data Center GOU Manager, see NVIDIA Data Center GPU Manager GitHub repo. For more information on the DCGM-Exporter, see NVIDIA GPU Monitoring Tools GitHub repo. For information on the profiling metrics available from DCGM, refer to this section in the documentation. As an alternative to the DCGM-Exporter, you can use the NVIDIA GPU Operator.

DCGM Installation

In order to install DCGM and Prometheus + Grafana, follows the instructions at Integrating GPU Telemetry into Kubernetes. During the installation, I stumbled into the issues described in this post on Stackoverflow.

  • DCGM-Exporter is not configured to track the DCGM_FI_DEV_GPU_UTIL metric by default. This metric captures the GPU utilization. I solved the problem by creating a Dockerfile to build a custom Docker image based on the latest base image of the DCGM-Exporter. The Dockerfile uncomments the  DCGM_FI_DEV_GPU_UTIL in the configuration csv that contains the metrics collected by the DaemonSet on GPU nodes. You can build the Docker image using the script and push it to your Azure Container Registry (ACR) using the script. Another approach is to create a new csv file containing the metrics the DCGM-Exporter should collect and export from GPU-enabled nodes.


# Variables

# Login to ACR
az acr login --name ${acrName,,} 

# Retrieve ACR login server. Each container image needs to be tagged with the loginServer name of the registry. 
echo "Logging to [$acrName] Azure Container Registry..."
loginServer=$(az acr show --name $acrName --query loginServer --output tsv)

# Tag the local image with the loginServer of ACR
docker tag $imageName:$tag $loginServer/$imageName:$tag

# Push local container image to ACR
docker push $loginServer/$imageName:$tag

# Show the repository
echo "This is the [$imageName:$tag] container image in the [$acrName] Azure Container Registry:"
az acr repository show --name $acrName \
                       --image $imageName:$tag 
  • DCGM-Exporter pod is recycling due a too short readiness probe. Since the InitialDelaySeconds of the livenessProbe and readinessProbe is not parametrized in the original Helm chart of the DCGM-Exporter, you cannot override the value that is hardcoded in the template to increase the time interval when deploying the chart. Hence, I downloaded and customized the chart that you can find in the zip file under the dcgm-exporter folder. The chart is also parametrized to use the custom image above. The docker image has been registered in an Azure Container Registry used by the AKS cluster. You can deploy the Helm chart by using the script or use the to install the original Helm chart.

# For more information, see 
# Also look at for metrics

# check if namespace exists in the cluster
result=$(kubectl get ns -o jsonpath="{.items[?('$namespace')]}")

if [[ -n $result ]]; then
    echo "$namespace namespace already exists in the cluster"
    echo "$namespace namespace does not exist in the cluster"
    echo "creating $namespace namespace in the cluster..."
    kubectl create namespace $namespace

# Install Helm chart
result=$(helm list -n $namespace | grep $releaseName | awk '{print $1}')

if [[ -n $result ]]; then
    echo "[$releaseName] already exists in the [$namespace] namespace"
    # Install the Helm chart
    echo "Deploying [$releaseName] to the [$namespace] namespace..."
    helm install $releaseName $chartName \
        --namespace $namespace \
        --values values.yaml

# List pods
kubectl get pods -n $namespace -o wide

The Bash script in the zip file can be used to add a GPU-enabled node pool to an existing AKS cluster. The script:


# Variables

az aks nodepool show \
    --name $nodePoolName \
    --cluster-name $aksClusterName \
    --resource-group $resourceGroupName &>/dev/null

if [[ $? == 0 ]]; then
    echo "A node pool called [$nodePoolName] already exists in the [$aksClusterName] AKS cluster"
    echo "No node pool called [$nodePoolName] actually exists in the [$aksClusterName] AKS cluster"
    echo "Creating [$nodePoolName] node pool in the [$aksClusterName] AKS cluster..."

    if [[ -z $useAksCustomHeaders ]]; then
        az aks nodepool add \
            --name $nodePoolName \
            --cluster-name $aksClusterName \
            --resource-group $resourceGroupName \
            --enable-cluster-autoscaler \
            --node-vm-size $vmSize \
            --node-count $nodeCount \
            --min-count $minCount \
            --max-count $maxCount \
            --max-pods $maxPods \
            --node-taints $taints 1>/dev/null
        echo "Using [UseGPUDedicatedVHD=true] AKS custom header..."
        az aks nodepool add \
            --name $nodePoolName \
            --cluster-name $aksClusterName \
            --resource-group $resourceGroupName \
            --enable-cluster-autoscaler \
            --node-vm-size $vmSize \
            --node-count $nodeCount \
            --min-count $minCount \
            --max-count $maxCount \
            --max-pods $maxPods \
            --node-taints $taints \
            --aks-custom-headers UseGPUDedicatedVHD=true 1>/dev/null

    if [[ $? == 0 ]]; then
        echo "[$nodePoolName] node pool successfully created in the [$aksClusterName] AKS cluster"
        echo "Failed to create the [$nodePoolName] node pool in the [$aksClusterName] AKS cluster"
  • Sets enables the autoscaler on the new node pool, and sets a minimum number of nodes to 1 and a maximum to 3.

  • Uses the new specialized GPU image that already contains the NVIDIA device plugin for Kubernetes.

  • Add a taint to GOU-enabled nodes: sku=gpu:NoSchedule. Hence, in order to run pods and jobs on this node pool, their definition needs to contain the following toleration:

- key: "sku"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"

In case you deploy the GPU-enabled node pool using the standard VM image, the zip file also contains the script to deploy NVIDIA plugin defined in the nvidia-device-plugin-ds.yaml manifest.

As described at Integrating GPU Telemetry into Kubernetes, you can install Prometheus and Grafana using the kube-prometheus-stack Helm chart. If you already installed this Helm chart in your AKS cluster, you can use the script and prometheus.stack.values.yaml values file in the zip file to make the necessary changes in the current setup and in particular to configure Prometheus to scrape GPU metrics. Note: make sure to specify the namespace that hosts the DCGM-Exporter DaemonSet, in my case dcgm-exporter.

# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  - role: endpoints
      - dcgm-exporter
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

Finally, you need to install the Prometheus Adapter for Kubernetes Metrics APIs which provides an  implementation of the Kubernetes resource metricscustom metrics, and external metrics APIs. This adapter is therefore suitable for use with the autoscaling/v2 Horizontal Pod Autoscaler in Kubernetes 1.6+. You can use the script to install the Prometheus Adapter for Kubernetes Metrics APIs.


# For more information, see 

# check if namespace exists in the cluster
result=$(kubectl get ns -o jsonpath="{.items[?('$namespace')]}")

if [[ -n $result ]]; then
    echo "$namespace namespace already exists in the cluster"
    echo "$namespace namespace does not exist in the cluster"
    echo "creating $namespace namespace in the cluster..."
    kubectl create namespace $namespace

# Check if the repository is not already added
result=$(helm repo list | grep $repoName | awk '{print $1}')

if [[ -n $result ]]; then
    echo "[$repoName] Helm repo already exists"
    # Add the Jetstack Helm repository
    echo "Adding [$repoName] Helm repo..."
    helm repo add $repoName $repoUrl

# Update your local Helm chart repository cache
echo 'Updating Helm repos...'
helm repo update

# Install Helm chart
result=$(helm list -n $namespace | grep $releaseName | awk '{print $1}')

if [[ -n $result ]]; then
    echo "[$releaseName] already exists in the [$namespace] namespace"
    # Install the Helm chart
    echo "Deploying [$releaseName] to the [$namespace] namespace..."
    helm install $releaseName $repoName/$chartName \
        --namespace $namespace \
        --set rbac.create=true \
        --set prometheus.url=http://kube-prometheus-stack-prometheus.kube-prometheus-stack.svc.cluster.local \
        --set prometheus.port=9090

# List pods
kubectl get pods -n $namespace -o wide

# After a a few minutes you should be able to list metrics using the following command(s):
# kubectl get --raw /apis/
# Use grafana dashboard to see GPU metrics

In Kubernetes, to scale an application and provide a reliable service, you need to understand how the application behaves when it is deployed. You can examine application performance in a Kubernetes cluster by examining the containers, podsservices, and the characteristics of the overall cluster. Kubernetes provides detailed information about an application's resource usage at each of these levels. This information allows you to evaluate your application's performance and where bottlenecks can be removed to improve overall performance. In Kubernetes, application monitoring does not depend on a single monitoring solution. On new clusters, you can use resource metrics or full metrics pipelines to collect monitoring statistics. You can use the following command to access the CPU and memory metrics via the API natively supported by Kubernetes.

# Get pods CPU and memory metrics
kubectl get --raw /apis/ | jq .

# Get nodes CPU and memory metrics
kubectl get --raw /apis/ | jq .

# Get CPU and memory metrics of the pods running the contoso namespace
kubectl get --raw /apis/ | jq .

Custom metrics can accessed invoking the API:

# Get custom metrics
kubectl get --raw /apis/ | jq -r . 

# Get DCGM_FI_DEV_GPU_UTIL metrics
kubectl get --raw /apis/  | jq -r '.resources[] | select(.name | contains("DCGM_FI_DEV_GPU_UTIL"))'

To review the metrics collected by the DCGM-Exporter, you can use script which connects to one of the pods of the DaemonSet and retrieves the metrics from the HTTP server (http://localhost:9400/metrics) of DCGM-Exporter.

Finally, make sure to install the DCGM dashboard in Grafana.


You can proceed as follows to check that the GPU-enabled node pool autoscaling works as expected. You can run the script specifying the number of jobs that you want to run. The job is defined in the samples-tf-mnist-demo.yaml manifest and uses a toleration to run on a worker node in the GPU-enabled node pool.

apiVersion: batch/v1
kind: Job
    app: samples-tf-mnist-demo
  name: samples-tf-mnist-demo
        app: samples-tf-mnist-demo
      - name: samples-tf-mnist-demo
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

As specified under the resources section, the samples-tf-mnist-demo container requires a At the time of this writing, each container can request one or more GPUs. It is not possible to request a fraction of a GPU. For more information, see Schedule GPUs under the Kubernetes documentation.

Before starting the test, take note of the number of GPU nodes. In our case, the initial number of nodes was 1 and SKU was Standard_NC6.

A picture containing graphical user interface Description automatically generated

Run the script with a large number such 20-50. If the current number of worker nodes and vCores is lower than this number, job pods will remain in a Pending state. In

A picture containing text Description automatically generated

If you configured the GPU-enabled node pool for autoscaling, the autoscaler will increase the number of nodes

A picture containing graphical user interface Description automatically generated

You can use Prometheus UI to see the GPU utilization (DCGM_FI_DEV_GPU_UTIL metric) of individual job containers.

Chart, box and whisker chart Description automatically generated

The GPU metrics are also visible either in the NVIDIA DCGME Exporter Grafana dashboard or the Prometheus dashboard as can be seen in the following screenshots showing GPU utilization, memory allocated as the application is running on the GPU:

A picture containing text, indoor Description automatically generated

After jobs completed, and GOU nodes are no more used by any workloads, the number of GPU nodes will scale back to the minimum.

A picture containing graphical user interface Description automatically generated


Starting with agent version ciprod03022019, Container insights integrated agent now supports monitoring GPU (graphical processing units) usage on GPU-aware Kubernetes cluster nodes, and monitor pods/containers requesting and using GPU resources.

Container insights supports monitoring GPU clusters from following GPU vendors:

Container insights automatically starts monitoring GPU usage on nodes, and GPU requesting pods and workloads by collecting the following metrics at 60sec intervals and storing them in the InsightMetrics table.

Metric name Metric dimension (tags) Description
containerGpuDutyCycle,, containerName, gpuId, gpuModel, gpuVendor Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100.
containerGpuLimits,, containerName Each container can specify limits as one or more GPUs. It is not possible to request or limit a fraction of a GPU.
containerGpuRequests,, containerName Each container can request one or more GPUs. It is not possible to request or limit a fraction of a GPU.
containerGpumemoryTotalBytes,, containerName, gpuId, gpuModel, gpuVendor Amount of GPU Memory in bytes available to use for a specific container.
containerGpumemoryUsedBytes,, containerName, gpuId, gpuModel, gpuVendor Amount of GPU Memory in bytes used by a specific container.
nodeGpuAllocatable,, gpuVendor Number of GPUs in a node that can be used by Kubernetes.
nodeGpuCapacity,, gpuVendor Total Number of GPUs in a node.

For example, the following Kusto Query:

let startDatetime = todatetime("2021-07-01 09:30:00.0");
let endDatetime = todatetime("2021-07-01 09:55:00.0");
let interval = 60s;
| where Name == "containerGpuDutyCycle" 
  and TimeGenerated  between(startDatetime .. endDatetime)
| summarize ["Average Container Gpu Duty Cycle"] = avg(Val) by bin(TimeGenerated, interval)
| render timechart

Returns the following time chart of the percentage of time over the past sample period during which GPU was busy/actively processing for a conta

Graphical user interface, chart, line chart Description automatically generated


For more information, see Configure GPU monitoring with Container insights.


Keda is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed. KEDA is a single-purpose and lightweight component that can be added into any Kubernetes cluster. KEDA works alongside standard Kubernetes components like the Horizontal Pod Autoscaler and can extend functionality without overwriting or duplication. With KEDA you can explicitly map the apps you want to use event-driven scale, with other apps continuing to function. This makes KEDA a flexible and safe option to run alongside any number of any other Kubernetes applications or frameworks. KEDA also supports Azure Monitor which in turn supports GPU monitoring on AKS via ContainerInsights. Using these two features together, it should be doable to use KEDA to scale out a GPU-enabled AKS cluster.


