-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix systemd-notify when using a different PID namespace #1308
Conversation
ab91bd6
to
8dd4b6c
Compare
How does sd_notify work when you have Type=forking
ExecStart=/usr/local/sbin/runc run -d --pid-file /run/runc/%i.pid %i
PIDFile=/run/runc/%i.pid |
systemd doesn't wait for the notify message before it considers the service running when
Differently, if I use Type=notify:
This is with my patched version, with the current runc I hit the race condition almost every time, so systemd misses the message and the systemctl start hangs and eventually time outs |
Ok, thanks. I cannot think of any other ways to solve this problem other systemd fixing their issues. |
an alternative is to mount |
@shishir-a412ed PTAL @giuseppe Have you talked to systemd maintainers about this issue? |
@rhatdan, yes, I've discussed this issue with @poettering |
Yeah, this is way too tightly integrating systemd into runC's container creation code. Surely there's a way to get systemd to fix their method of handling |
well, the advantage of fixing it here is that we won't need to change programs that are already using |
I have no clue on runc, but note that on the systemd side, due to kernel limitations we can't unconditionally and securely match all sd_notify() messages back to the services they come from, except if they are sent from the "main" process of a service. Specifically, if a process exits immediately after enqueing the message, and it is not the "main" process of the service, then systemd will have trouble to use the SCM_CREDENTIALS data (specifically: the PID field of it) the kernel attaches to the message for looking up the service in /proc/$PID, because that directory might already have been removed. This problem does not exist for the "main" process of a service, as systemd tracks that process explicitly, and always knows which service it belongs to. I don't know runc, but as I understand it wants to permit processes running inside a container to notify the init system about readiness. To make this reliable there a two options:
I personally would recommend option 1. If you do this, you could even pass interesting additional information to systemd, which would show this in "systemctl status", for example a STATUS= text, augmenting the information that the container sends. But anway, I have no stake in this, I am just giving background, why we can't "fix our method of handling sd_notify()". And yeah, this is a well-known issue, and there have been at least two attempts to fix this in the kernel, but no success... |
(BTW, one more thing: for security reasons, you really want to go the proxy way btw. I figure your container payload is less trusted than your container manager, right? In that case you really shouldn't permit it to send MAINPID= with arbitrary PIDs via sd_notify(), because you can do evil shit with that. That means you should sanitize whatever the container wants to send, and only propagate safe stuff) |
The best way to handle this is some type of proxy. We don't want to be giving a container an external pid so the question is, should this be handled by runc or should you have some type of shim process that launches runc -d and does the proxy? |
8dd4b6c
to
5d1aa7e
Compare
I've done some changes so that the new code gets less in the way of the existing signals handling. Also, I added a new patch to address the issues @poettering reported, now I send back to the host only @crosbymichael IMHO, there is not really much new code/logic to deserve a shim process. These patches simply fix the existing implementation of the sdnotify integration which is already in runc. |
I tested a bit with some system containers and the functionality looks good. |
@giuseppe it maybe "a little code" to fix it but it changes a lot of things. If you want to use sd_notify you have to keep runc running at all times during the container's execution and you cannot use detach. I was just asking because runc has a larger memory footprint and didn't know if you are fine with 100s of runcs running all the time for this functionality or if a shim is a better place for this where it could be written in C to reduce memory usage. Also maybe having an sd_notify proxy in a container runtime is not the right place for this type of functionality. |
No chance of having runc exist after it delivers the sd_notify? |
we could exit after we receive Please keep in mind though that this functionality is affecting only containers that are using Type=notify (where systemd sets the My suggestion is:
Do you agree with this? |
I had a look at this, and it seems that it will never work like I was suggesting in my previous comment. We should still keep suggesting to use |
You can change the main PID of a service at runtime by sending the PID= message, and first making sure the kid is properly reparented to PID 1. |
29b8ab7
to
00dffbc
Compare
pushed a patch to support detach also when using Type=notify and exit runc once we receive |
@giuseppe nice! i'll test today |
@crosbymichael have you had a chance to enjoy the notifications to systemd from a runc container? :) |
signals.go
Outdated
var out bytes.Buffer | ||
for _, line := range bytes.Split(buf[0:r], []byte{'\n'}) { | ||
if bytes.HasPrefix(line, []byte("READY=")) { | ||
out.Write(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be checking errors in this entire func
signals.go
Outdated
@@ -30,7 +35,9 @@ func newSignalHandler(enableSubreaper bool) *signalHandler { | |||
// handle all signals for the process. | |||
signal.Notify(s) | |||
return &signalHandler{ | |||
signals: s, | |||
signals: s, | |||
notifySocket: notifySocket, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create a struct for this, we have the two fields and are passing them around everywhere. A struct will clean up the code alot and you can add methods for the socket there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I think if you add a struct with methods for the notify socket it will clean up the code a lot as its currently spread everywhere.
(btw, in case you are looking for more integration of runc and systemd: it probably makes sense to set $container properly for your containers, so that systemd can properly recognize it, see the table here: |
@poettering I think since runc is meant to be used by a higher level system, the caller can inject the |
00dffbc
to
ef6b29c
Compare
@crosbymichael thanks for the review. I've implemented the requested changes in this new revision ⬆️ |
@poettering Going by what's already in the code, I guess systemd can probably recognize runc containers when VIRTUALIZATION_RUNC is set? |
@mrunalp hmm? what do you mean? on the systemd side we won't recognize arbitrary container managers, and there's no specific entry for "runc" defined either. I was proposing adding that. |
@poettering I meant a +1 to your proposal to add code to systemd to recognize runc. |
The code is still spread out across many different files. Half of the implementation is in one file and the other half is in another. Can you please clean this up? You can create a new file for the notify socket implementation, have a single constructor that returns it, then have proxy method on it that handles the proxying of data. |
ef6b29c
to
7d6060a
Compare
The current support of systemd-notify has a race condition as the message send to the systemd notify socket might be dropped if the sender process is not running by the time systemd checks for the sender of the datagram. A proper fix of this in systemd would require changes to the kernel to maintain the cgroup of the sender process when it is dead (but it is not probably going to happen...) Generally, the solution to this issue is to specify the PID in the message itself so that systemd has not to guess the sender, but this wouldn't work when running in a PID namespace as the container will pass the PID known in its namespace (something like PID=1,2,3..) and systemd running on the host is not able to map it to the runc service. The proposed solution is to have a proxy in runc that forwards the messages to the host systemd. Example of this issue: projectatomic/atomic-system-containers#24 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Accept only READY= notify messages from the container. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
let runc run until READY= is received and then proceed with detaching the process. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
7d6060a
to
d5026f0
Compare
@crosbymichael I've pushed a new version with all the new code is in |
@giuseppe nice, Thank you. LGTM |
@mrunalp, does it look fine to you? |
The current support of systemd-notify has a race condition as the
message send to the systemd notify socket might be dropped if the sender
process is not running by the time systemd checks for the sender of the
datagram. A proper fix of this in systemd would require changes to the
kernel to maintain the cgroup of the sender process when it is dead (but
it is not probably going to happen...)
Generally, the solution to this issue is to specify the PID in the
message itself so that systemd has not to guess the sender, but this
wouldn't work when running in a PID namespace as the container will pass
the PID known in its namespace (something like PID=1,2,3..) and systemd
running on the host is not able to map it to the runc service.
The proposed solution is to have a proxy in runc that forwards the
messages to the host systemd.
Example of this issue:
projectatomic/atomic-system-containers#24
Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com