Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for exposing PSI metrics #3083

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Support for exposing PSI metrics #3083

wants to merge 3 commits into from

Conversation

dqminh
Copy link
Contributor

@dqminh dqminh commented Mar 23, 2022

Fix #3052

This depends on opencontainers/runc#3358, so it should not be merged as-is, but we can review the structure and how metrics are exposed.

@k8s-ci-robot
Copy link
Collaborator

Hi @dqminh. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dqminh dqminh force-pushed the psi branch 2 times, most recently from a870125 to 0a22793 Compare May 17, 2022 12:33
dqminh added 2 commits May 17, 2022 13:42
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
This adds 2 new set of metrics:
- `psi_total`: read total number of seconds a resource is under pressure
- `psi_avg`: read ratio of time a resource is under pressure over a
  sliding time window.

For more details about these definitions, see:
- https://www.kernel.org/doc/html/latest/accounting/psi.html
- https://facebookmicrosites.github.io/psi/docs/overview

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
@dqminh dqminh force-pushed the psi branch 3 times, most recently from 4c5c036 to 57ac945 Compare May 20, 2022 11:08
This adds support for reading PSI metrics via prometheus. We exposes the
following for `psi_total`:

```
container_cpu_psi_total_seconds
container_memory_psi_total_seconds
container_io_psi_total_seconds
```

And for `psi_avg`:

```
container_cpu_psi_avg10_ratio
container_cpu_psi_avg60_ratio
container_cpu_psi_avg300_ratio

container_memory_psi_avg10_ratio
container_memory_psi_avg60_ratio
container_memory_psi_avg300_ratio

container_io_psi_avg10_ratio
container_io_psi_avg60_ratio
container_io_psi_avg300_ratio
```

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
@szuecs
Copy link

szuecs commented Nov 4, 2022

@bobbypage does this need to wait longer, anything missing?
I ask because there is an internal request to support this feature. :)

@dqminh likely it needs a rebase, maybe you can do it :)

@bobbypage
Copy link
Collaborator

We are still waiting for opencontainers/runc#3358 to merged in runc...

@szuecs
Copy link

szuecs commented Nov 14, 2022

@bobbypage not sure should I take over the PR?
@dqminh thoughts?

john-liuqiming pushed a commit to cloud-native-observability/cadvisor that referenced this pull request Apr 4, 2023
Signed-off-by: liuqiming.lqm <liuqiming.lqm@alibaba-inc.com>
@SuperQ
Copy link
Contributor

SuperQ commented Apr 17, 2023

Fyi, it's unnecessary and not in best practices to expose the pre-computed averages in Prometheus.

Prometheus can compute arbitrary averages from the Total data.

@nmlc
Copy link

nmlc commented Aug 23, 2023

Any news here? Seems like runc PSI pr got merged

if includedMetrics.Has(container.PSITotalMetrics) {
c.containerMetrics = append(c.containerMetrics, []containerMetric{
{
name: "container_cpu_psi_total_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "container_cpu_psi_total_seconds",
name: "container_pressure_cpu_seconds_total",

return getPSIValues(s, &s.Cpu.PSI, "total")
},
}, {
name: "container_memory_psi_total_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "container_memory_psi_total_seconds",
name: "container_pressure_memory_seconds_total",

return getPSIValues(s, &s.Memory.PSI, "total")
},
}, {
name: "container_io_psi_total_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "container_io_psi_total_seconds",
name: "container_pressure_io_seconds_total",

Comment on lines +1801 to +1827
if includedMetrics.Has(container.PSIAvgMetrics) {
makePSIAvgMetric := func(controller, window string) containerMetric {
return containerMetric{
name: fmt.Sprintf("container_%s_psi_avg%s_ratio", controller, window),
help: fmt.Sprintf("Ratio of time spent under %s pressure over time window of %s seconds", controller, window),
valueType: prometheus.GaugeValue,
extraLabels: []string{"kind"},
getValues: func(s *info.ContainerStats) metricValues {
switch controller {
case "cpu":
return getPSIValues(s, &s.Cpu.PSI, "avg"+window)
case "memory":
return getPSIValues(s, &s.Memory.PSI, "avg"+window)
case "io":
return getPSIValues(s, &s.DiskIo.PSI, "avg"+window)
default:
return nil
}
},
}
}
for _, controller := range []string{"cpu", "memory", "io"} {
for _, window := range []string{"10", "60", "300"} {
c.containerMetrics = append(c.containerMetrics, makePSIAvgMetric(controller, window))
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are unnecessary in Prometheus as we can compute averages from the counters. Please remove them to avoid excess metric cardinality.

Suggested change
if includedMetrics.Has(container.PSIAvgMetrics) {
makePSIAvgMetric := func(controller, window string) containerMetric {
return containerMetric{
name: fmt.Sprintf("container_%s_psi_avg%s_ratio", controller, window),
help: fmt.Sprintf("Ratio of time spent under %s pressure over time window of %s seconds", controller, window),
valueType: prometheus.GaugeValue,
extraLabels: []string{"kind"},
getValues: func(s *info.ContainerStats) metricValues {
switch controller {
case "cpu":
return getPSIValues(s, &s.Cpu.PSI, "avg"+window)
case "memory":
return getPSIValues(s, &s.Memory.PSI, "avg"+window)
case "io":
return getPSIValues(s, &s.DiskIo.PSI, "avg"+window)
default:
return nil
}
},
}
}
for _, controller := range []string{"cpu", "memory", "io"} {
for _, window := range []string{"10", "60", "300"} {
c.containerMetrics = append(c.containerMetrics, makePSIAvgMetric(controller, window))
}
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this may be resolved since the 10/60/300 averages are inherent to the exposed PSI data?

case "total":
// total is measured as microseconds
v = append(v, metricValue{value: float64(time.Duration(psi.Some.Total)*time.Microsecond) / float64(time.Second), timestamp: s.Timestamp, labels: []string{"some"}})
v = append(v, metricValue{value: float64(time.Duration(psi.Full.Total)*time.Microsecond) / float64(time.Second), timestamp: s.Timestamp, labels: []string{"full"}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for CPU, we don't need to expose "full". In practice, I've only found the "some" metrics to be useful. The "some" value is a superset of "full". IMO we should just include that to reduce the cardinality.

See the PSI docs:

CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.

Copy link

@ein-stein-chen ein-stein-chen Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to raise my opinion for exposing "CPU full" metrics on a container(/cgroup) level.

While "CPU full" is undefined at the system level, it reports values for other cgroups:

$ cat /sys/fs/cgroup/user.slice/cpu.pressure
some avg10=0.00 avg60=0.05 avg300=0.20 total=68139509
full avg10=0.00 avg60=0.00 avg300=0.00 total=40148380

Exposing both might be useful to differentiate between cgroups that try to execute more work than compute time available to them (indicated through "CPU some") and cgroups that are fully blocked/stalled (maybe because of other cgroups and/or process priority, …; indicated through "CPU full").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for this info. Are there kernel docs that document the per-cgroup behavior?

IMO this should be a separate metric name, rather than a label. The reason is that since some is a inclusive of full, doing something like sum(rate(container_pressure_cpu_seconds_total[1m])) would be confusing.

I would suggest these two metric names:

  • container_pressure_cpu_seconds_total
  • container_pressure_cpu_full_seconds_total

@szuecs
Copy link

szuecs commented Oct 31, 2023

@bobbypage can you set ok to test and do a review?
Would be great to see this in cadvisor

@bobbypage
Copy link
Collaborator

Waiting for new runc release to include opencontainers/runc@1aa7ca8

@MathieuCesbron
Copy link

Can we have an update on this ? I would like to access the psi interface. Thanks guys.

@dims
Copy link
Collaborator

dims commented Feb 23, 2024

@MathieuCesbron did you see the previous comment (the one before yours?)

google.golang.org/grpc v1.33.2
k8s.io/klog/v2 v2.4.0
k8s.io/utils v0.0.0-20211116205334-6203023598ed
)

replace github.com/opencontainers/runc => github.com/dqminh/runc v0.0.0-20220513155811-6414629ada8a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this and use the upstream runc as-is.

@zouyee
Copy link
Contributor

zouyee commented Apr 7, 2024

runc has released 1.2.0-rc.1

@zouyee
Copy link
Contributor

zouyee commented Apr 17, 2024

@bobbypage not sure should I take over the PR? @dqminh thoughts?

@szuecs take over ?

@SuperQ
Copy link
Contributor

SuperQ commented Apr 17, 2024

I would be happy to take over this change, I would like to make sure it aligns well with best practices. As it is, it does not.

@szuecs
Copy link

szuecs commented Apr 17, 2024

There is only a release candidate so I would wait until there's a proper release.
I don't mind that someone else would do it.

@akgoel18
Copy link

akgoel18 commented Jun 6, 2024

@szuecs Any update on this ? I'm asking this is a internally heavily requested feature. :)

@szuecs
Copy link

szuecs commented Jun 6, 2024

@akgoel18 please check upstream project release cycle, thanks

@pacoxu
Copy link
Contributor

pacoxu commented Jul 18, 2024

opencontainers/runc#3900 was merged and will be released with runc 1.2.0

@pacoxu
Copy link
Contributor

pacoxu commented Oct 22, 2024

https://github.com/opencontainers/runc/releases/tag/v1.2.0 runc v1.2.0 is released today.

Do you have time to bump runc?

@dims
Copy link
Collaborator

dims commented Oct 22, 2024

@pacoxu when containerd 2.0 gets out runc used with it will be 1.2.0, so till we have only 1.6/1.7 containerd with k8s, we should stick to older runc(s) both as binary as well as vendoring

@haircommander
Copy link
Contributor

@dims why do cadvisor/k8s and containerd need to have an in sync runc version?

@dims
Copy link
Collaborator

dims commented Oct 22, 2024

@dims why do cadvisor/k8s and containerd need to have an in sync runc version?

bad things have happened before @haircommander example see opencontainers/runc#3849

@alexandremahdhaoui
Copy link

@dims @haircommander to address and prevent the issue with "out of sync" runc versions we could add a CI check.

NB: I wrote a tool (usage) in kubernetes-sigs/container-runtime to verify and ensure specified go modules stays in sync with some upstream modules. I'd be glad to open a PR here or in k/k to add a check.

@rexagod
Copy link

rexagod commented Nov 12, 2024

Since the backports are in now, I think we can go ahead with this (and incorporating @SuperQ's suggestions above)?

@SuperQ
Copy link
Contributor

SuperQ commented Nov 12, 2024

I might go ahead and make a fork/alternative to this specific PR.

@enp0s3
Copy link

enp0s3 commented Jan 18, 2025

@SuperQ Hi, do you plan to take this over?

@SuperQ
Copy link
Contributor

SuperQ commented Jan 18, 2025

I was hoping I would have time over the holidays. But ended up working on other projects. I could take a look soon. But as-is I don't think this PR should be merged.

@xinau
Copy link
Contributor

xinau commented Jan 20, 2025

@SuperQ I would be happy to help on the development of this feature. Is it basically a rebase of this PR + resolving the review comments? Or is there more to be done?

@rexagod
Copy link

rexagod commented Jan 21, 2025

@xinau That sounds good for now. I'd suggest opening a PR with those suggestions, any additional changes may be made incrementally upon review.

xinau added a commit to xinau/cadvisor that referenced this pull request Jan 26, 2025
issues: google#3052, google#3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the _PSIStats_ of each controller in it's stats, i.e.
_CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

```
container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total
```

Signed-off-by: Felix Ehrenpfort <felix@ehrenpfort.de>
@xinau
Copy link
Contributor

xinau commented Jan 26, 2025

I've opened #3649 with the requested rebase and addressed the review comments.

xinau added a commit to xinau/cadvisor that referenced this pull request Jan 26, 2025
issues: google#3052, google#3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the _PSIStats_ of each controller in it's stats, i.e.
_CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

```
container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total
```

Signed-off-by: Felix Ehrenpfort <felix@ehrenpfort.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose Pressure Stall Information as metrics