fix systemd-notify when using a different PID namespace #1308

giuseppe · 2017-02-03T14:29:03Z

The current support of systemd-notify has a race condition as the
message send to the systemd notify socket might be dropped if the sender
process is not running by the time systemd checks for the sender of the
datagram. A proper fix of this in systemd would require changes to the
kernel to maintain the cgroup of the sender process when it is dead (but
it is not probably going to happen...)
Generally, the solution to this issue is to specify the PID in the
message itself so that systemd has not to guess the sender, but this
wouldn't work when running in a PID namespace as the container will pass
the PID known in its namespace (something like PID=1,2,3..) and systemd
running on the host is not able to map it to the runc service.

The proposed solution is to have a proxy in runc that forwards the
messages to the host systemd.

Example of this issue:

projectatomic/atomic-system-containers#24

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

crosbymichael · 2017-02-04T00:10:48Z

How does sd_notify work when you have Type=forking in your service file? Wouldn't this be the same type of problem? Couldn't you have runc write the pid for the container process to fix the issue?

Type=forking
ExecStart=/usr/local/sbin/runc run -d --pid-file /run/runc/%i.pid %i
PIDFile=/run/runc/%i.pid

giuseppe · 2017-02-04T00:49:49Z

systemd doesn't wait for the notify message before it considers the service running when Type=forking is used. To verify it in my test container I have a sleep before calling sd_notify, using Type=forking and runc run -d, I get:

# time systemctl restart test
real	0m0.035s
user	0m0.004s
sys	0m0.010s

Differently, if I use Type=notify:

# time systemctl restart test

real	0m1.498s
user	0m0.004s
sys	0m0.013s

This is with my patched version, with the current runc I hit the race condition almost every time, so systemd misses the message and the systemctl start hangs and eventually time outs

crosbymichael · 2017-02-04T00:54:30Z

Ok, thanks. I cannot think of any other ways to solve this problem other systemd fixing their issues.

giuseppe · 2017-02-04T00:57:37Z

an alternative is to mount /run/runc/%i.pid inside the container so that sd_notify could use that PID (I have not tried but I can't see why it shouldn't work). But this would be a quite ugly hack...

rhatdan · 2017-02-04T11:26:00Z

@shishir-a412ed PTAL

@giuseppe Have you talked to systemd maintainers about this issue?

giuseppe · 2017-02-04T20:43:16Z

@rhatdan, yes, I've discussed this issue with @poettering

cyphar · 2017-02-06T08:46:54Z

Yeah, this is way too tightly integrating systemd into runC's container creation code. Surely there's a way to get systemd to fix their method of handling Type=forking management.

giuseppe · 2017-02-06T09:35:30Z

well, the advantage of fixing it here is that we won't need to change programs that are already using sd_notify. Any other fix will probably require changes to the systemd notify protocol.

poettering · 2017-02-06T11:59:44Z

I have no clue on runc, but note that on the systemd side, due to kernel limitations we can't unconditionally and securely match all sd_notify() messages back to the services they come from, except if they are sent from the "main" process of a service.

Specifically, if a process exits immediately after enqueing the message, and it is not the "main" process of the service, then systemd will have trouble to use the SCM_CREDENTIALS data (specifically: the PID field of it) the kernel attaches to the message for looking up the service in /proc/$PID, because that directory might already have been removed.

This problem does not exist for the "main" process of a service, as systemd tracks that process explicitly, and always knows which service it belongs to.

I don't know runc, but as I understand it wants to permit processes running inside a container to notify the init system about readiness. To make this reliable there a two options:

Forward the notification in runc: make runc take the message from the container payload, validate it, maybe enhance it, and propagate it up. This is what has been proposed here, afaics.
Make the container payload process the "main" process of the service. There are multiple ways to do this. For example, you could write out the PID in in a PID file, and use systemd's PIDFile= setting to make it read that. You can also dynamically change the main PID of a service by issuing an sd_notify() message with the MAINPID= field. Note that doing this is kinda problematic however: if you do this you should also reparent the new main process to PID 1, so that it will properly get SIGCHLD for it (UNIX...). systemd cannot properly track processes that aren't its immediate children.

I personally would recommend option 1. If you do this, you could even pass interesting additional information to systemd, which would show this in "systemctl status", for example a STATUS= text, augmenting the information that the container sends.

But anway, I have no stake in this, I am just giving background, why we can't "fix our method of handling sd_notify()".

And yeah, this is a well-known issue, and there have been at least two attempts to fix this in the kernel, but no success...

poettering · 2017-02-06T17:24:33Z

(BTW, one more thing: for security reasons, you really want to go the proxy way btw. I figure your container payload is less trusted than your container manager, right? In that case you really shouldn't permit it to send MAINPID= with arbitrary PIDs via sd_notify(), because you can do evil shit with that. That means you should sanitize whatever the container wants to send, and only propagate safe stuff)

crosbymichael · 2017-02-06T18:20:30Z

The best way to handle this is some type of proxy. We don't want to be giving a container an external pid so the question is, should this be handled by runc or should you have some type of shim process that launches runc -d and does the proxy?

giuseppe · 2017-02-07T10:30:47Z

I've done some changes so that the new code gets less in the way of the existing signals handling.

Also, I added a new patch to address the issues @poettering reported, now I send back to the host only READY= and STATUS= messages.

@crosbymichael IMHO, there is not really much new code/logic to deserve a shim process. These patches simply fix the existing implementation of the sdnotify integration which is already in runc.

yuqi-zhang · 2017-02-07T22:17:33Z

I tested a bit with some system containers and the functionality looks good.

crosbymichael · 2017-02-08T18:59:42Z

@giuseppe it maybe "a little code" to fix it but it changes a lot of things. If you want to use sd_notify you have to keep runc running at all times during the container's execution and you cannot use detach.

I was just asking because runc has a larger memory footprint and didn't know if you are fine with 100s of runcs running all the time for this functionality or if a shim is a better place for this where it could be written in C to reduce memory usage. Also maybe having an sd_notify proxy in a container runtime is not the right place for this type of functionality.

rhatdan · 2017-02-08T19:21:36Z

No chance of having runc exist after it delivers the sd_notify?

giuseppe · 2017-02-08T21:04:30Z

we could exit after we receive READY=1 but doing that we miss the possibility of setting customs STATUS= messages, which is probably fine since we already provide a subset of the functionalities (with the second patch).

Please keep in mind though that this functionality is affecting only containers that are using Type=notify (where systemd sets the NOTIFY_SOCKET env variable), other containers won't have a running proxy running all the time. I expect only system services that are already using it, such as Flannel in the case of Atomic system containers, to keep doing it in the same way in a container.

My suggestion is:

change the second patch to support only READY=1 and drop STATUS=
exit runc when READY=1 is received

Do you agree with this?

giuseppe · 2017-02-08T22:49:55Z

I had a look at this, and it seems that it will never work like I was suggesting in my previous comment.
Type=notify expects the service to not fork itself, so we need to keep the runc process alive.

We should still keep suggesting to use Type=forking and runc -d, the notify stuff must be the exception used only when necessary.

poettering · 2017-02-08T23:09:31Z

I had a look at this, and it seems that it will never work like I was suggesting in my previous comment.
Type=notify expects the service to not fork itself, so we need to keep the runc process alive.

You can change the main PID of a service at runtime by sending the PID= message, and first making sure the kid is properly reparented to PID 1.

giuseppe · 2017-02-09T15:46:07Z

pushed a patch to support detach also when using Type=notify and exit runc once we receive READY=1

crosbymichael · 2017-02-09T17:57:55Z

@giuseppe nice! i'll test today

giuseppe · 2017-02-14T17:49:06Z

@crosbymichael have you had a chance to enjoy the notifications to systemd from a runc container? :)

crosbymichael · 2017-02-14T18:04:37Z

signals.go

+		var out bytes.Buffer
+		for _, line := range bytes.Split(buf[0:r], []byte{'\n'}) {
+			if bytes.HasPrefix(line, []byte("READY=")) {
+				out.Write(line)


We should probably be checking errors in this entire func

crosbymichael · 2017-02-14T18:05:22Z

signals.go

@@ -30,7 +35,9 @@ func newSignalHandler(enableSubreaper bool) *signalHandler {
 	// handle all signals for the process.
 	signal.Notify(s)
 	return &signalHandler{
-		signals: s,
+		signals:          s,
+		notifySocket:     notifySocket,


can you create a struct for this, we have the two fields and are passing them around everywhere. A struct will clean up the code alot and you can add methods for the socket there.

crosbymichael

Overall looks good. I think if you add a struct with methods for the notify socket it will clean up the code a lot as its currently spread everywhere.

poettering · 2017-02-14T18:13:12Z

(btw, in case you are looking for more integration of runc and systemd: it probably makes sense to set $container properly for your containers, so that systemd can properly recognize it, see the table here:
https://github.com/systemd/systemd/blob/master/src/basic/virt.c#L390 — i'd be happy to add the matching counterpart to systemd.)

crosbymichael · 2017-02-14T18:18:02Z

@poettering I think since runc is meant to be used by a higher level system, the caller can inject the $container env var for their system.

giuseppe · 2017-02-15T00:16:46Z

@crosbymichael thanks for the review. I've implemented the requested changes in this new revision ⬆️

mrunalp · 2017-02-17T16:14:06Z

@poettering Going by what's already in the code, I guess systemd can probably recognize runc containers when VIRTUALIZATION_RUNC is set?
@crosbymichael Are we okay with just documenting something like that in the runc documentation for systemd integration? It is up to the caller to set it in the config.

poettering · 2017-02-17T16:42:59Z

@mrunalp hmm? what do you mean? on the systemd side we won't recognize arbitrary container managers, and there's no specific entry for "runc" defined either. I was proposing adding that.

mrunalp · 2017-02-17T17:42:28Z

@poettering I meant a +1 to your proposal to add code to systemd to recognize runc.
Also, I propose that we add that information to the runc documentation on systemd integration, so runc users know what to set in their config.json.

crosbymichael · 2017-02-22T19:11:07Z

The code is still spread out across many different files. Half of the implementation is in one file and the other half is in another. Can you please clean this up? You can create a new file for the notify socket implementation, have a single constructor that returns it, then have proxy method on it that handles the proxying of data.

The current support of systemd-notify has a race condition as the message send to the systemd notify socket might be dropped if the sender process is not running by the time systemd checks for the sender of the datagram. A proper fix of this in systemd would require changes to the kernel to maintain the cgroup of the sender process when it is dead (but it is not probably going to happen...) Generally, the solution to this issue is to specify the PID in the message itself so that systemd has not to guess the sender, but this wouldn't work when running in a PID namespace as the container will pass the PID known in its namespace (something like PID=1,2,3..) and systemd running on the host is not able to map it to the runc service. The proposed solution is to have a proxy in runc that forwards the messages to the host systemd. Example of this issue: projectatomic/atomic-system-containers#24 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Accept only READY= notify messages from the container. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

let runc run until READY= is received and then proceed with detaching the process. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2017-02-22T21:36:07Z

@crosbymichael I've pushed a new version with all the new code is in notify_socket.go

crosbymichael · 2017-02-22T21:39:29Z

@giuseppe nice, Thank you.

LGTM

giuseppe · 2017-02-22T22:48:57Z

@mrunalp, does it look fine to you?

mrunalp · 2017-02-24T19:05:10Z

LGTM

giuseppe mentioned this pull request Feb 3, 2017

Update flannel to use sd-notify projectatomic/atomic-system-containers#24

Closed

giuseppe force-pushed the fix-systemd-notify branch 2 times, most recently from ab91bd6 to 8dd4b6c Compare February 3, 2017 21:05

poettering mentioned this pull request Feb 6, 2017

man: document that sd_notify() is racy in some cases systemd/systemd#5239

Merged

giuseppe force-pushed the fix-systemd-notify branch from 8dd4b6c to 5d1aa7e Compare February 7, 2017 10:19

giuseppe force-pushed the fix-systemd-notify branch 2 times, most recently from 29b8ab7 to 00dffbc Compare February 9, 2017 12:28

crosbymichael reviewed Feb 14, 2017

View reviewed changes

crosbymichael requested changes Feb 14, 2017

View reviewed changes

giuseppe force-pushed the fix-systemd-notify branch from 00dffbc to ef6b29c Compare February 15, 2017 00:15

giuseppe force-pushed the fix-systemd-notify branch from ef6b29c to 7d6060a Compare February 22, 2017 21:25

giuseppe added 3 commits February 22, 2017 22:27

sanitize systemd-notify message

c8593c4

Accept only READY= notify messages from the container. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

signals: support detach and notify socket together

d5026f0

let runc run until READY= is received and then proceed with detaching the process. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the fix-systemd-notify branch from 7d6060a to d5026f0 Compare February 22, 2017 21:28

crosbymichael approved these changes Feb 22, 2017

View reviewed changes

mrunalp merged commit 899b074 into opencontainers:master Feb 24, 2017

kolyshkin mentioned this pull request Dec 8, 2021

notify_socket.go: use sd_notify_barrier mechanism #3291

Merged

fix systemd-notify when using a different PID namespace #1308

fix systemd-notify when using a different PID namespace #1308

Conversation

giuseppe commented Feb 3, 2017

crosbymichael commented Feb 4, 2017

giuseppe commented Feb 4, 2017

crosbymichael commented Feb 4, 2017

giuseppe commented Feb 4, 2017

rhatdan commented Feb 4, 2017

giuseppe commented Feb 4, 2017

cyphar commented Feb 6, 2017

giuseppe commented Feb 6, 2017

poettering commented Feb 6, 2017 • edited Loading

poettering commented Feb 6, 2017

crosbymichael commented Feb 6, 2017

giuseppe commented Feb 7, 2017

yuqi-zhang commented Feb 7, 2017

crosbymichael commented Feb 8, 2017

rhatdan commented Feb 8, 2017

giuseppe commented Feb 8, 2017

giuseppe commented Feb 8, 2017

poettering commented Feb 8, 2017

giuseppe commented Feb 9, 2017

crosbymichael commented Feb 9, 2017

giuseppe commented Feb 14, 2017

crosbymichael Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

crosbymichael Feb 14, 2017

Choose a reason for hiding this comment

crosbymichael left a comment

Choose a reason for hiding this comment

poettering commented Feb 14, 2017

crosbymichael commented Feb 14, 2017

giuseppe commented Feb 15, 2017

mrunalp commented Feb 17, 2017

poettering commented Feb 17, 2017

mrunalp commented Feb 17, 2017

crosbymichael commented Feb 22, 2017

giuseppe commented Feb 22, 2017

crosbymichael commented Feb 22, 2017 • edited by caniszczyk Loading

giuseppe commented Feb 22, 2017

mrunalp commented Feb 24, 2017 • edited by caniszczyk Loading

poettering commented Feb 6, 2017 •

edited

Loading

crosbymichael Feb 14, 2017 •

edited

Loading

crosbymichael commented Feb 22, 2017 •

edited by caniszczyk

Loading

mrunalp commented Feb 24, 2017 •

edited by caniszczyk

Loading