Skip to content
This repository was archived by the owner on Nov 2, 2021. It is now read-only.

dcgm-exporter missing metrics for A100 GPU #166

Open
anaconda2196 opened this issue Mar 15, 2021 · 5 comments
Open

dcgm-exporter missing metrics for A100 GPU #166

anaconda2196 opened this issue Mar 15, 2021 · 5 comments

Comments

@anaconda2196
Copy link

GPU Machine: A100-PCIE-40GB.
[gpu-monitoring-tools-2.3.1]

I am using latest release of for dcgm-exporter ( 2.1.4-2.3.1-ubuntu18.04).

kubectl get pods -A
NAMESPACE              NAME                                                              READY   STATUS    RESTARTS   AGE
default                dcgm-exporter-1615787551-qc8dm                                    1/1     Running   0          86s

In prometheus while query executing, I found few missing metrics DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL.

I do see them enabled in default-counters.csv though inside my running pod. Is it a bug or not supporting these metrics for A100 GPU?

I have checked with other GPU Machines (4 Tesla, V100) and everything looks good and able to get all metrics.

Thank you in advance.

@crinavar
Copy link

Hi Anaconda,
The metrics are working here on a DGX A100 we have. By chance, did you subdivide the GPUs as MIG devices? MIG GPUs are currently not detected for some metrics.

@dualvtable
Copy link
Contributor

Hi guys - yes, we are working on adding MIG support into dcgm-exporter so we can do metric attribution to MIG devices. We hope to make a release in the next couple of weeks.

@supertetelman
Copy link

Any update on this? I am also not seeing DCGM_FI_DEV_GPU_UTIL show up on the latest dcgm-exporter release. I am seeing this on DGX Stations's with V100 and A100.

I am however seeing the other three metrics mentioned here.

This is running with version nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04.

@jjhidalgar
Copy link

Same issue here with latest versions and all types of GPUs.
Just tried some previous version "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04" and I got GPU_UTIL metric back on all servers

@jfolz
Copy link

jfolz commented Jun 9, 2021

Is this maybe related to #143?
I.e., these metrics were turned off by default a while ago.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants