Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix systemd-notify when using a different PID namespace #1308

Merged
merged 3 commits into from
Feb 24, 2017

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Feb 3, 2017

The current support of systemd-notify has a race condition as the
message send to the systemd notify socket might be dropped if the sender
process is not running by the time systemd checks for the sender of the
datagram. A proper fix of this in systemd would require changes to the
kernel to maintain the cgroup of the sender process when it is dead (but
it is not probably going to happen...)
Generally, the solution to this issue is to specify the PID in the
message itself so that systemd has not to guess the sender, but this
wouldn't work when running in a PID namespace as the container will pass
the PID known in its namespace (something like PID=1,2,3..) and systemd
running on the host is not able to map it to the runc service.

The proposed solution is to have a proxy in runc that forwards the
messages to the host systemd.

Example of this issue:

projectatomic/atomic-system-containers#24

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

@crosbymichael
Copy link
Member

How does sd_notify work when you have Type=forking in your service file? Wouldn't this be the same type of problem? Couldn't you have runc write the pid for the container process to fix the issue?

Type=forking
ExecStart=/usr/local/sbin/runc run -d --pid-file /run/runc/%i.pid %i
PIDFile=/run/runc/%i.pid

@giuseppe
Copy link
Member Author

giuseppe commented Feb 4, 2017

systemd doesn't wait for the notify message before it considers the service running when Type=forking is used. To verify it in my test container I have a sleep before calling sd_notify, using Type=forking and runc run -d, I get:

# time systemctl restart test
real	0m0.035s
user	0m0.004s
sys	0m0.010s

Differently, if I use Type=notify:

# time systemctl restart test

real	0m1.498s
user	0m0.004s
sys	0m0.013s

This is with my patched version, with the current runc I hit the race condition almost every time, so systemd misses the message and the systemctl start hangs and eventually time outs

@crosbymichael
Copy link
Member

Ok, thanks. I cannot think of any other ways to solve this problem other systemd fixing their issues.

@giuseppe
Copy link
Member Author

giuseppe commented Feb 4, 2017

an alternative is to mount /run/runc/%i.pid inside the container so that sd_notify could use that PID (I have not tried but I can't see why it shouldn't work). But this would be a quite ugly hack...

@rhatdan
Copy link
Contributor

rhatdan commented Feb 4, 2017

@shishir-a412ed PTAL

@giuseppe Have you talked to systemd maintainers about this issue?

@giuseppe
Copy link
Member Author

giuseppe commented Feb 4, 2017

@rhatdan, yes, I've discussed this issue with @poettering

@cyphar
Copy link
Member

cyphar commented Feb 6, 2017

Yeah, this is way too tightly integrating systemd into runC's container creation code. Surely there's a way to get systemd to fix their method of handling Type=forking management.

@giuseppe
Copy link
Member Author

giuseppe commented Feb 6, 2017

well, the advantage of fixing it here is that we won't need to change programs that are already using sd_notify. Any other fix will probably require changes to the systemd notify protocol.

@poettering
Copy link

poettering commented Feb 6, 2017

I have no clue on runc, but note that on the systemd side, due to kernel limitations we can't unconditionally and securely match all sd_notify() messages back to the services they come from, except if they are sent from the "main" process of a service.

Specifically, if a process exits immediately after enqueing the message, and it is not the "main" process of the service, then systemd will have trouble to use the SCM_CREDENTIALS data (specifically: the PID field of it) the kernel attaches to the message for looking up the service in /proc/$PID, because that directory might already have been removed.

This problem does not exist for the "main" process of a service, as systemd tracks that process explicitly, and always knows which service it belongs to.

I don't know runc, but as I understand it wants to permit processes running inside a container to notify the init system about readiness. To make this reliable there a two options:

  1. Forward the notification in runc: make runc take the message from the container payload, validate it, maybe enhance it, and propagate it up. This is what has been proposed here, afaics.
  2. Make the container payload process the "main" process of the service. There are multiple ways to do this. For example, you could write out the PID in in a PID file, and use systemd's PIDFile= setting to make it read that. You can also dynamically change the main PID of a service by issuing an sd_notify() message with the MAINPID= field. Note that doing this is kinda problematic however: if you do this you should also reparent the new main process to PID 1, so that it will properly get SIGCHLD for it (UNIX...). systemd cannot properly track processes that aren't its immediate children.

I personally would recommend option 1. If you do this, you could even pass interesting additional information to systemd, which would show this in "systemctl status", for example a STATUS= text, augmenting the information that the container sends.

But anway, I have no stake in this, I am just giving background, why we can't "fix our method of handling sd_notify()".

And yeah, this is a well-known issue, and there have been at least two attempts to fix this in the kernel, but no success...

@poettering
Copy link

(BTW, one more thing: for security reasons, you really want to go the proxy way btw. I figure your container payload is less trusted than your container manager, right? In that case you really shouldn't permit it to send MAINPID= with arbitrary PIDs via sd_notify(), because you can do evil shit with that. That means you should sanitize whatever the container wants to send, and only propagate safe stuff)

@crosbymichael
Copy link
Member

The best way to handle this is some type of proxy. We don't want to be giving a container an external pid so the question is, should this be handled by runc or should you have some type of shim process that launches runc -d and does the proxy?

@giuseppe
Copy link
Member Author

giuseppe commented Feb 7, 2017

I've done some changes so that the new code gets less in the way of the existing signals handling.

Also, I added a new patch to address the issues @poettering reported, now I send back to the host only READY= and STATUS= messages.

@crosbymichael IMHO, there is not really much new code/logic to deserve a shim process. These patches simply fix the existing implementation of the sdnotify integration which is already in runc.

@yuqi-zhang
Copy link

I tested a bit with some system containers and the functionality looks good.

@crosbymichael
Copy link
Member

@giuseppe it maybe "a little code" to fix it but it changes a lot of things. If you want to use sd_notify you have to keep runc running at all times during the container's execution and you cannot use detach.

I was just asking because runc has a larger memory footprint and didn't know if you are fine with 100s of runcs running all the time for this functionality or if a shim is a better place for this where it could be written in C to reduce memory usage. Also maybe having an sd_notify proxy in a container runtime is not the right place for this type of functionality.

@rhatdan
Copy link
Contributor

rhatdan commented Feb 8, 2017

No chance of having runc exist after it delivers the sd_notify?

@giuseppe
Copy link
Member Author

giuseppe commented Feb 8, 2017

we could exit after we receive READY=1 but doing that we miss the possibility of setting customs STATUS= messages, which is probably fine since we already provide a subset of the functionalities (with the second patch).

Please keep in mind though that this functionality is affecting only containers that are using Type=notify (where systemd sets the NOTIFY_SOCKET env variable), other containers won't have a running proxy running all the time. I expect only system services that are already using it, such as Flannel in the case of Atomic system containers, to keep doing it in the same way in a container.

My suggestion is:

  1. change the second patch to support only READY=1 and drop STATUS=
  2. exit runc when READY=1 is received

Do you agree with this?

@giuseppe
Copy link
Member Author

giuseppe commented Feb 8, 2017

I had a look at this, and it seems that it will never work like I was suggesting in my previous comment.
Type=notify expects the service to not fork itself, so we need to keep the runc process alive.

We should still keep suggesting to use Type=forking and runc -d, the notify stuff must be the exception used only when necessary.

@poettering
Copy link

I had a look at this, and it seems that it will never work like I was suggesting in my previous comment.
Type=notify expects the service to not fork itself, so we need to keep the runc process alive.

You can change the main PID of a service at runtime by sending the PID= message, and first making sure the kid is properly reparented to PID 1.

@giuseppe giuseppe force-pushed the fix-systemd-notify branch 2 times, most recently from 29b8ab7 to 00dffbc Compare February 9, 2017 12:28
@giuseppe
Copy link
Member Author

giuseppe commented Feb 9, 2017

pushed a patch to support detach also when using Type=notify and exit runc once we receive READY=1

@crosbymichael
Copy link
Member

@giuseppe nice! i'll test today

@giuseppe
Copy link
Member Author

@crosbymichael have you had a chance to enjoy the notifications to systemd from a runc container? :)

signals.go Outdated
var out bytes.Buffer
for _, line := range bytes.Split(buf[0:r], []byte{'\n'}) {
if bytes.HasPrefix(line, []byte("READY=")) {
out.Write(line)
Copy link
Member

@crosbymichael crosbymichael Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be checking errors in this entire func

signals.go Outdated
@@ -30,7 +35,9 @@ func newSignalHandler(enableSubreaper bool) *signalHandler {
// handle all signals for the process.
signal.Notify(s)
return &signalHandler{
signals: s,
signals: s,
notifySocket: notifySocket,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a struct for this, we have the two fields and are passing them around everywhere. A struct will clean up the code alot and you can add methods for the socket there.

Copy link
Member

@crosbymichael crosbymichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I think if you add a struct with methods for the notify socket it will clean up the code a lot as its currently spread everywhere.

@poettering
Copy link

(btw, in case you are looking for more integration of runc and systemd: it probably makes sense to set $container properly for your containers, so that systemd can properly recognize it, see the table here:
https://github.com/systemd/systemd/blob/master/src/basic/virt.c#L390 — i'd be happy to add the matching counterpart to systemd.)

@crosbymichael
Copy link
Member

@poettering I think since runc is meant to be used by a higher level system, the caller can inject the $container env var for their system.

@giuseppe
Copy link
Member Author

@crosbymichael thanks for the review. I've implemented the requested changes in this new revision ⬆️

@mrunalp
Copy link
Contributor

mrunalp commented Feb 17, 2017

@poettering Going by what's already in the code, I guess systemd can probably recognize runc containers when VIRTUALIZATION_RUNC is set?
@crosbymichael Are we okay with just documenting something like that in the runc documentation for systemd integration? It is up to the caller to set it in the config.

@poettering
Copy link

@mrunalp hmm? what do you mean? on the systemd side we won't recognize arbitrary container managers, and there's no specific entry for "runc" defined either. I was proposing adding that.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 17, 2017

@poettering I meant a +1 to your proposal to add code to systemd to recognize runc.
Also, I propose that we add that information to the runc documentation on systemd integration, so runc users know what to set in their config.json.

@crosbymichael
Copy link
Member

The code is still spread out across many different files. Half of the implementation is in one file and the other half is in another. Can you please clean this up? You can create a new file for the notify socket implementation, have a single constructor that returns it, then have proxy method on it that handles the proxying of data.

The current support of systemd-notify has a race condition as the
message send to the systemd notify socket might be dropped if the sender
process is not running by the time systemd checks for the sender of the
datagram.  A proper fix of this in systemd would require changes to the
kernel to maintain the cgroup of the sender process when it is dead (but
it is not probably going to happen...)
Generally, the solution to this issue is to specify the PID in the
message itself so that systemd has not to guess the sender, but this
wouldn't work when running in a PID namespace as the container will pass
the PID known in its namespace (something like PID=1,2,3..) and systemd
running on the host is not able to map it to the runc service.

The proposed solution is to have a proxy in runc that forwards the
messages to the host systemd.

Example of this issue:

projectatomic/atomic-system-containers#24

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Accept only READY= notify messages from the container.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
let runc run until READY= is received and then proceed with
detaching the process.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

@crosbymichael I've pushed a new version with all the new code is in notify_socket.go

@crosbymichael
Copy link
Member

crosbymichael commented Feb 22, 2017

@giuseppe nice, Thank you.

LGTM

Approved with PullApprove

@giuseppe
Copy link
Member Author

@mrunalp, does it look fine to you?

@mrunalp
Copy link
Contributor

mrunalp commented Feb 24, 2017

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 899b074 into opencontainers:master Feb 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants