-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable DCGM_FI_DEV_CLOCK_EVENTS_ACTIVE #254
Conversation
33257c4
to
9e03913
Compare
Signed-off-by: Vadym Fedorov <vfedorov@nvidia.com>
Signed-off-by: Vadym Fedorov <vfedorov@nvidia.com>
9e03913
to
fc3a561
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the review comments. I have made 2-3 suggestions for the new file file_entity_group_system.
Signed-off-by: Vadym Fedorov <vfedorov@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nvvfedorov for addressing the review comments.
/LGTM
The PR adds a new DCGM Exporter metric: DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT.
The new metric represents a number of the DCGM_FI_DEV_CLOCK_THROTTLE_REASONS with a breakdown by reasons during the defined time window.
Here is an example:
The dcgm returns the DCGM_FI_DEV_CLOCK_THROTTLE_REASONS with value 96. The 96 is a result of the following reasons:
DCGM_CLOCKS_THROTTLE_REASON_SW_THERMAL|DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL
. In this case, the DCGM-exporter will produce the following output:The output can be read as follows: for the last 5 minutes, there were detected 2 throttle reasons: sw_thermal and hw_thermal.
The throttle_reason labels may have the following values:
You can find more details here: https://docs.nvidia.com/datacenter/dcgm/1.7/dcgm-api/group__dcgmFieldScope.html
The time window can be configured with the help of the clock-throttle-reasons-count-window-size" parameter. The default value is 5 minutes.
Test steps:
Enable DCGM_FI_EXP_CLOCK_THROTTLE_REASONS_COUNT in the
./etc/default-counters.csv
file.Run dcgm-exporter with a dcgm in a standalone mode
Where 96 is
DCGM_CLOCKS_THROTTLE_REASON_SW_THERMAL|DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL
Expected result:
You should see a response similar to the example below: