Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fetching PSI stats for cgroupv2 containers [dnm / carry] #3627

Closed
wants to merge 9 commits into from

Conversation

thaJeztah
Copy link
Member

Testing my suggestions from #3358 / dqminh#35

@kolyshkin
Copy link
Contributor

Retrieving https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
Preparing...                          ########################################
	package epel-release-7-14.noarch is already installed

Ah! they fixed it and now our CI is b0rked again.

@kolyshkin
Copy link
Contributor

Retrieving https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
Preparing...                          ########################################
	package epel-release-7-14.noarch is already installed

Ah! they fixed it and now our CI is b0rked again.

Fixed by #3628

return nil, fmt.Errorf("invalid PSI value: %q", f)
return nil, fmt.Errorf("invalid PSI value (%s): %q", kv[0], f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, since f already contains both the key and the value. So,

before:

invalid PSI value: "avg10=foo"

after:

invalid PSI value (avg10): "avg10=foo"

So, I'd remove this commit entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ you're right! Yup, will remove

I'll also squash all commits if things progress, but thought it would help review describing my steps

@@ -21,6 +24,11 @@ type Stats struct {
NetworkInterfaces []*NetworkInterface `json:"network_interfaces"`
}

type (
PSIData = cgroups.PSIData
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, not sure if we need the aliases at all; I need to look into history to understand why we did so for the other ones (separate types from libcontainer); perhaps we don't need those either.

It may have been to allow external consumer to use the "types" package without requiring libcontainer, but if that was the design, we already broke that.

If, on the other hand, nobody is expected to consume "types", we may not need all the duplicated definitions and we could just alias or use them directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect types to be consistent i.e. not depending on libcontainer internals, i..e consumer of runc events exect the format to be consistent even though we may have change libcontainer/cgroups for some reasons.

So some of the duplication is intentional, and some differences are also intentional i.e. camelCase vs snake_case.

dqminh and others added 7 commits October 8, 2022 12:50
We read output from the following files if they exists:
- cpu.pressure
- memory.pressure
- io.pressure

Each are in format:

```
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
```

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
So that no conversion is needed

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

thaJeztah commented Oct 8, 2022

Failure on centos stream 9;
https://github.com/opencontainers/runc/pull/3627/checks?check_run_id=8779692013

# runc events --stats test_busybox (status=1):
# time="2022-10-08T10:57:34Z" level=error msg="unable to get container cgroup stats: error while statting cgroup v2: [unable to parse /sys/fs/cgroup/system.slice/runc-test_busybox.scope/cpu.pressure: read /sys/fs/cgroup/system.slice/runc-test_busybox.scope/cpu.pressure: operation not supported unable to parse /sys/fs/cgroup/system.slice/runc-test_busybox.scope/memory.pressure: read /sys/fs/cgroup/system.slice/runc-test_busybox.scope/memory.pressure: operation not supported unable to parse /sys/fs/cgroup/system.slice/runc-test_busybox.scope/io.pressure: read /sys/fs/cgroup/system.slice/runc-test_busybox.scope/io.pressure: operation not supported]"
ok 55 events --stats with psi data # skip test requires psi

Wrapped for readability:

unable to get container cgroup stats:
error while statting cgroup v2:

[
    unable to parse /sys/fs/cgroup/system.slice/runc-test_busybox.scope/cpu.pressure:
    read /sys/fs/cgroup/system.slice/runc-test_busybox.scope/cpu.pressure:
    operation not supported
    
    unable to parse /sys/fs/cgroup/system.slice/runc-test_busybox.scope/memory.pressure:
    read /sys/fs/cgroup/system.slice/runc-test_busybox.scope/memory.pressure:
    operation not supported
    
    unable to parse /sys/fs/cgroup/system.slice/runc-test_busybox.scope/io.pressure:
    read /sys/fs/cgroup/system.slice/runc-test_busybox.scope/io.pressure: operation not supported
 ]

So, the interesting bit may be that the error is on read; could it be that the cgroups.OpenFile() works, but the scan fails?

func statPSI(dirPath string, file string, stats *cgroups.PSIStats) error {
	f, err := cgroups.OpenFile(dirPath, file, os.O_RDONLY)
	if err != nil {
		if errors.Is(err, os.ErrNotExist) || errors.Is(err, syscall.ENOTSUP) {
			// open *.pressure file returns
			// - ErrNotExist when kernel < 4.20 or CONFIG_PSI is disabled
			// - ENOTSUP when we requires psi=1 in kernel command line to enable PSI support
			return nil
		}
		return err
	}
	defer f.Close()

	sc := bufio.NewScanner(f)
	for sc.Scan() {
		// etc
	}

@thaJeztah
Copy link
Member Author

Still not sure what's failing on Fedora (tests seem to pass, but then it marks it as failed (exit status 1))

Moves error handling to this function, so that the caller doesn't
have to be aware of which errors to ignore.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@kolyshkin
Copy link
Contributor

=== RUN   TestNilResources
--- FAIL: TestNilResources (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x59c2b6]
goroutine 6 [running]:
testing.tRunner.func1.2({0x5cc6e0, 0x75b020})
	/usr/lib/golang/src/testing/testing.go:1389 +0x24e
testing.tRunner.func1()
	/usr/lib/golang/src/testing/testing.go:1392 +0x39f
panic({0x5cc6e0, 0x75b020})
	/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/opencontainers/runc/libcontainer/cgroups/fs2.statPSI({0xc00001a180, 0x29}, {0x6009ca, 0xc}, 0x0)
	/vagrant/libcontainer/cgroups/fs2/psi.go:33 +0x1f6
github.com/opencontainers/runc/libcontainer/cgroups/fs2.(*manager).GetStats(0xc000054060)
	/vagrant/libcontainer/cgroups/fs2/fs2.go:122 +0x5fe
github.com/opencontainers/runc/libcontainer/cgroups/manager.TestNilResources(0xc00007b1e0)
	/vagrant/libcontainer/cgroups/manager/manager_test.go:40 +0x1ad
testing.tRunner(0xc00007b1e0, 0x60e8d0)
	/usr/lib/golang/src/testing/testing.go:1439 +0x102
created by testing.(*T).Run
	/usr/lib/golang/src/testing/testing.go:1486 +0x35f
FAIL	github.com/opencontainers/runc/libcontainer/cgroups/manager	0.014s

This is the test case I've added in commit 57edce4659725c7c4a9727e7cca5e9bc27898572to make sure we don't panic if resources were not provided.

@kolyshkin
Copy link
Contributor

if err := statPSI(m.dirPath, "cpu.pressure", st.CpuStats.PSI); err != nil {

Apparently the PSI pointer is never initialized so it's nil.

Perhaps it's better to revert the struct to pointer-to-struct conversion patch (and ditch the omitempty as it does not make sense).

@kolyshkin
Copy link
Contributor

@thaJeztah PTAL ^^^ (prev comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants