-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_controller_ng
fails to start successfully, even though bosh
thinks it has
#125
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/163859284 The labels on this github issue will be updated when the story is started. |
cloud_controller_ng
fails to start successfully, even though thinks it hascloud_controller_ng
fails to start successfully, even though bosh
thinks it has
Logs attached for a new instance created via (note these are for different times than the log snippets shown in the original report)
cloud_controller_ng.log Let me know if additional logs would be helpful. |
I continue to think the biggest issue here is that (that wouldn't have fixed Looking at the
Although a post-start script is defined: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/post-start.sh.erb It does not wait for |
Hi @aeijdenberg, this is the first I've ever heard of a Cloud Controller failing to deploy in this way -- especially the fact that it reported healthy at one point. Can you clarify what happened here?
Are you saying that as the Diego cells deployed and apps were evacuated on to the new cells they couldn't start because the APIs were down? Under normal operating circumstances control plane downtime shouldn't cause app downtime, but during an upgrade I think what I described above is possible and I want to confirm that's what you observed.
My understanding though is that BOSH's idea of process healthiness should be the same as Monit's. Monit is simply invoking If you're able to get them, we'd appreciate some of the other Cloud Controller logs as well in Thanks! |
I've just made a simple PR that adds to the @tcdowney - yes. v7.2.0 -> v7.3.0 of
And yes, as you said the control plane outage did not immediately cause damage to the rest of the system, however as the Once we figured out how to jerry-rig the Happy to post those additional logs now. |
Note also that while The docs also hint that so long as the start command keep alive for long enough, then
(from https://bosh.io/docs/job-lifecycle/) Then the BPM docs I linked above seem to say we need a I also noticed, that once running, the |
post-start.stderr.log |
Hmm, didn't see anything in the logs that would explain why it took so long to start listening on that socket... talked it through a bit with some of the folks on BPM and they agree that doing some health-checking in the We'll try to repro using the incorrect socket approach you mentioned in #126 and review that PR! |
I came up with this hacky command to print out how long between bosh ssh -r -d cf api -c "echo \"cloud_controller_ng took \$((\$(stat /var/vcap/data/cloud_controller_ng/cloud_controller.sock --printf %Y)-\$(date --date=\"\$(sudo grep \"start service 'cloud_controller_ng' on user request\" /var/vcap/monit/monit.log | tail -n 1 | cut -c 2-20)\" +%s))) seconds to become healthy\"" Our prod environment (still running cf-deployment v7.2.0) shows:
Our users staging environment (cf-deployment v7.3.0):
(I think that outlier is due to me disabling the monit health-check partially through to create the logs above) Our dev environment (cf-deployment v7.4.0):
In any case, it seems that 30-40 seconds is not unusual in our environments (we're using |
I don't think that From our experimentation, if you want the health check to be retried within a time period until the test passes, you need to do I inserted a I'm not sure the changes to the |
@ericpromislow - glad that you've found improved health checking configuration for I disagree that this removes the changes to That is separate from other changes that are necessary (such adjustments to the monit health checking) to fix this particular reason why i.e. both are needed as they address different issues. You're right that your fix would likely make the deployment succeed, and that mine specifically wouldn't (instead it would make it fail before nuking the whole deployment), as they address different layers of the swiss cheese model. |
During some more experimentation I note that when I run: monit stop cloud_controller_ng
... wait a bit for it to stop ...
monit start cloud_controller_ng then the script above shows a startup time of 10 seconds before becoming healthy. If instead I run: monit stop all
... wait a bit for all to stop ...
monit start all then it takes 25-27 seconds before becoming healthy. Running |
The post-start change seems like the right direction in terms of process orchestration. Confirming that the pid is up is not sufficient to move on with the deployment, we should confirm that it passes a basic healthcheck as well. I think there's better script tooling in monit_utils.sh to handle that than the raw curl in @aeijdenberg 's PR, but that's really an implementation detail. The behavior where diego loses apps occurs if the cells are evacuating and CC is down. @goonzoid brought this up in slack at some point, but I'm struggling to recall details of how that goes down. Poking further at @ericpromislow 's experiment, though, the endless-restart situation does seem pretty troubling. Reading monit docs, I think
is more easily understood written like this:
The crazy part that leads to infinite restarts is that, after 5 attempts and a restart, monit doesn't seem to give it another 5 attempts, and instead triggers a restart for each successive failed get request. The man page for 5.2.5 say it should reset the counter on This is a pretty awkward pair of bugs, thanks for all your help tracking them down, @aeijdenberg . I wish we had seen either of them in isolation before you had to go seeing both of them in tandem. |
We changed our AWS instances types from Clearly that's not an actual fix to the root problems, but I think enough to stabilise that environment for us. |
And I think we figured out why we didn't hit this issue in our development environment. That env defaults to really small instance types ( |
we confirmed today in isolation that the restart loop seems to happen without any of the bpm or cc bits, but on monit 5.2.5 there's no other obvious way to get that network healthcheck failing over to restart behavior. Honestly I'm still a little confounded by the startup times you're seeing, so if you could dig there more that'd be interesting! |
I did bit more poking around on one of our dev API servers today. $ bosh ssh -d cf api/0
$ sudo su vcap
$ source /var/vcap/jobs/cloud_controller_ng/bin/ruby_version.sh
$ cd /var/vcap/packages/cloud_controller_ng/cloud_controller_ng
$ time /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c bogusconffile
ERROR: Failed loading config from file 'bogusconffile': No such file or directory @ rb_sysopen - bogusconffile
real 0m8.977s
user 0m7.056s
sys 0m0.969s Looks like starting When I run it again, with $ strace -o $HOME/mystrace -tt -T /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c bogusconffile and take a look at the output: $ cat $HOME/mystrace | grep "open(" | wc -l
230234
$ cat $HOME/mystrace | grep "open(" | grep "No such file or directory" | wc -l
223695 I note during those 9 seconds, the Ruby process attempts to open 230K files - of which 223K do not exist! I'm not sure if this is normal or not for Ruby, but my guess is that a search path is perhaps borked, and as such perhaps it's looking for files in a very inefficient manner? I'll keep digging, but thought this worthy of a note. |
Here's a more concise command to demonstrate the potential issue: bosh ssh -d cf api/0 'cd /var/vcap/packages/cloud_controller_ng/cloud_controller_ng && PATH=/var/vcap/packages/ruby-2.4-r5/bin:$PATH strace -e trace=open -c /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c bogusconffile' Gives output:
I realise it only claims 0.4 seconds of total system time, but this still feels very suspicious of a lot of unnecessary churn. |
@aeijdenberg we had the same problem on our staging environment. After upgrading our api VM from single core two dual core CPU the problem was resolved. |
Yeah, our "realistic" CI environment takes ~15 seconds to start up from a monit perspective, double checked today. Ruby is not the quickest language to start up, and perhaps our ruby load process is globbing a bit aggressively, but it does sound like your environment was under-provisioned. |
@cwlbraa - couldn't disagree more. :) Objectively we've had no issues with performance on our API servers and they were consistently sitting at less than 30% CPU utilisation, thus they were not under-provisioned. The only "performance" issue we've had is that this particular job has a misconfigured health check which fails to account for a slow but successful startup time. ie we should be sizing these VMs to optimise capacity to handle peak loads of actual traffic - it makes no sense to optimise simply for a fast startup time due to a poorly designed health check. And I am truly blown away by the inefficiency of how Ruby handles file dependencies - but given that limitation, and that it seems to get worse in roughly O(n^2) as more Gem's are added to the project, we can't keep bumping VMs sizes (that we pay for by the hour) forever. Perhaps this health check should simply be removed from the |
You're not the first person to suggest this IRT this issue, and I've been on board with removing it from the moment we confirmed that monit doesn't restart 5 times in a row! :-) We even went and searched for other ways to implement that behaviour and came up empty handed. The main thing holding us back here is the implications this will have on CCs shipped and deployed in environments with less experienced or attentive operators. A lot of our monit health checking was added with the intent of allowing the CC to self-heal to some degree, unattended, and that's the intent of the network health check as well. The obvious kicker here is that, in this scenario, "self-healing" ends up being a sort of "self-malpractice" situation. We don't have a good way to find out if that health check is actually healing environments successfully in the wild, and discovering that it was after removing it would be pretty painful. Truthfully though, I suspect that this health check is doing more harm than good. |
Why the Only 1 of the 4 issues originally identified has been resolved - and now a deployment will fail fast instead of bricking a CF, but it will still fail. |
@aeijdenberg I think this was delivered because the story created by the PR for the first issue (PR #126) has made it through our CI and is out for acceptance and it's tied to this issue. It's difficult on our end to track multiple problems with a single Tracker story, so we're going to make separate stories in our backlog for each of these and we'll link them here. @cwlbraa is going to log one to cover the monit health check issue. As far as I can tell that's the next most important one here, but we'll need to do more research on that one. I'll log the other two, but just to be clear the third issue you found is the Ruby process trying to open a large number of files on startup which we think is responsible for the slow startup times. What's the fourth one? |
Also the issue title is solved by the PR. Here's the monit issue in our tracker: https://www.pivotaltracker.com/story/show/164263056 |
@cwlbraa - fair call, The four parts that I reported are:
This is now fixed.
This is now understood, however it isn't yet mitigated, as once in a failing state, then a single failed cycle will attempted a restart, and given cycles are every 10 seconds, which is roughly how long it takes to start, there's a very good chance this will occur. But let's call that tracked in: https://www.pivotaltracker.com/n/projects/966314/stories/164263056
This does still feel wrong to me. I think it's OK for the post-start check, which is essentially an integration test of all of the components that it started, but it still feels flaky for the health check on
So long as the healthcheck that runs every 10 seconds is a thing, then this is still a problem. Bumping machine specs just for the purpose of an already flaky health check during startup is a workaround, not a proper fix. |
We tried to reproduce bug #4 by taking the following steps:
We then exited the
|
* ancient bosh monit will continuously restart us at 5+1 monit cycles whenever CC might be having a hard time getting up within 10 seconds, so this produces more predictable behavior where it only restarts every 5 failed curls. it also doesn't interfere with initial startup. [fixes #125] [fixes #164263056] Co-authored-by: Connor Braa <cbraa@pivotal.io> Co-authored-by: Tim Downey <tdowney@pivotal.io>
Issue
We updated our CF staging environment to version v7.3.0 today (CAPI 176.0) and while the bosh deployment claimed to have succeeded, when we ran
bosh is -d cf
afterwards it showed ourapi
instances in thestarting
state, and much breakage and alerts across our system as both instances of ourcloud_controller_ng
were hard down, which as the diego-cells were also updating, eventually caused all apps on the system to go hard down too.Upon investigation we noted that
/var/vcap/jobs/cloud_controller_ng/monit
contained:Monit logs showed the following output:
Note that that only 13 seconds after the start request,
monit
is has decided thatcloud_controller_ng
has failed the protocol test and attempts to restart it.When we commented out the
if failed host
healthcheck above, and tried again, we noted the following:Note that it takes about 26 seconds between the process starting, and the socket being open such that the healthcheck may be able to pass.
Steps to Reproduce
Unclear - this did not affect our smaller environment, but did affect our staging (approx 300 apps).
Expected result
The
bosh
canary process should have stopped this propagating to the 2ndapi
instance - I don't understand whybosh
believed that the process started correctly, when it did not?Without understanding the
monit
health check syntax - it reads as though it should require 5 failures, 60 second apart before attempting a restart. Instead a single failure 13 seconds after startup is triggering a restart.It seems dangerous that
cloud_controller_ng
is depending onnginx_cc
which in turn depends oncloud_controller_ng
for a health-check. Having a circular dependency feels like it's asking for trouble.Taking 26 seconds to write out a PID and open a listening socket feels like a long time? Unclear why this is slow in this particular environment.
Current result
Our users staging environment went hard down.
We think we have made a partial recover by editing the monit files as above, and continue to monitor the situation.
We're likely offline until tomorrow Sydney time - wanted to get this report in now, even if incomplete, due to the potential seriousness of the issue and in case it affects others and in case it's related to that latest CF 7.3.0 / CAPI 176.0 releases (I can't seen an obvious culprit?)
The text was updated successfully, but these errors were encountered: