Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable DCGM_FI_DEV_CLOCK_EVENTS_ACTIVE #254

Merged
merged 3 commits into from
Feb 21, 2024

Conversation

nvvfedorov
Copy link
Collaborator

The PR adds a new DCGM Exporter metric: DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT.

The new metric represents a number of the DCGM_FI_DEV_CLOCK_THROTTLE_REASONS with a breakdown by reasons during the defined time window.

Here is an example:

The dcgm returns the DCGM_FI_DEV_CLOCK_THROTTLE_REASONS with value 96. The 96 is a result of the following reasons: DCGM_CLOCKS_THROTTLE_REASON_SW_THERMAL|DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL. In this case, the DCGM-exporter will produce the following output:

DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT{gpu="0",UUID="GPU-b9f9e81b-bee7-34bc-af17-132ef6592740",device="nvidia0",modelName="NVIDIA T400 4GB",Hostname="localhost",DCGM_FI_DRIVER_VERSION="545.29.02",throttle_reason="sw_thermal",window_size_in_ms="300000"} 1
DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT{gpu="0",UUID="GPU-b9f9e81b-bee7-34bc-af17-132ef6592740",device="nvidia0",modelName="NVIDIA T400 4GB",Hostname="localhost",DCGM_FI_DRIVER_VERSION="545.29.02",throttle_reason="hw_thermal",window_size_in_ms="300000"} 1

The output can be read as follows: for the last 5 minutes, there were detected 2 throttle reasons: sw_thermal and hw_thermal.

The throttle_reason labels may have the following values:

  • "gpu_idle"
  • "clocks_setting"
  • "power_cap"
  • "hw_slowdown"
  • "sync_boost"
  • "sw_thermal"
  • "hw_thermal"
  • "hw_power_brake"
  • "display_clocks"

You can find more details here: https://docs.nvidia.com/datacenter/dcgm/1.7/dcgm-api/group__dcgmFieldScope.html

The time window can be configured with the help of the clock-throttle-reasons-count-window-size" parameter. The default value is 5 minutes.

Test steps:

  1. Enable DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT in the ./etc/default-counters.csv file.

  2. Run dcgm-exporter with a dcgm in a standalone mode

go run cmd/dcgm-exporter/main.go -f ./etc/default-counters.csv -r localhost:5555
  1. Inject CLOCK_THROTTLE_REASONS errors as described here: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-error-injection.html
dcgmi test --inject --gpuid 0 -f 112 -v 96

Where 96 is DCGM_CLOCKS_THROTTLE_REASON_SW_THERMAL|DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL

  1. Request the dcgm-exporter /metric endpoint:
curl -v http://localhost:9400/metrics

Expected result:

You should see a response similar to the example below:

# TYPE DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT gauge
DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT{gpu="0",UUID="GPU-b9f9e81b-bee7-34bc-af17-132ef6592740",device="nvidia0",modelName="NVIDIA T400 4GB",Hostname="localhost",DCGM_FI_DRIVER_VERSION="545.29.02",throttle_reason="sw_thermal",window_size_in_ms="300000"} 1
DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT{gpu="0",UUID="GPU-b9f9e81b-bee7-34bc-af17-132ef6592740",device="nvidia0",modelName="NVIDIA T400 4GB",Hostname="localhost",DCGM_FI_DRIVER_VERSION="545.29.02",throttle_reason="hw_thermal",window_size_in_ms="300000"} 1

@nvvfedorov nvvfedorov force-pushed the enable-dcgm_fi_dev_clock_events_active branch from 33257c4 to 9e03913 Compare February 19, 2024 14:19
Signed-off-by: Vadym Fedorov <vfedorov@nvidia.com>
Signed-off-by: Vadym Fedorov <vfedorov@nvidia.com>
@nvvfedorov nvvfedorov force-pushed the enable-dcgm_fi_dev_clock_events_active branch from 9e03913 to fc3a561 Compare February 19, 2024 22:13
Copy link
Collaborator

@rohit-arora-dev rohit-arora-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the review comments. I have made 2-3 suggestions for the new file file_entity_group_system.

Signed-off-by: Vadym Fedorov <vfedorov@nvidia.com>
Copy link
Collaborator

@rohit-arora-dev rohit-arora-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nvvfedorov for addressing the review comments.

/LGTM

@nvvfedorov nvvfedorov merged commit 543d648 into main Feb 21, 2024
1 check passed
@nvvfedorov nvvfedorov deleted the enable-dcgm_fi_dev_clock_events_active branch February 21, 2024 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants