Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_controller_ng fails to start successfully, even though bosh thinks it has #125

Closed
aeijdenberg opened this issue Feb 11, 2019 · 27 comments

Comments

@aeijdenberg
Copy link
Contributor

Issue

We updated our CF staging environment to version v7.3.0 today (CAPI 176.0) and while the bosh deployment claimed to have succeeded, when we ran bosh is -d cf afterwards it showed our api instances in the starting state, and much breakage and alerts across our system as both instances of our cloud_controller_ng were hard down, which as the diego-cells were also updating, eventually caused all apps on the system to go hard down too.

Upon investigation we noted that /var/vcap/jobs/cloud_controller_ng/monit contained:

check process cloud_controller_ng
  with pidfile /var/vcap/sys/run/bpm/cloud_controller_ng/cloud_controller_ng.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start cloud_controller_ng"
  stop program "/var/vcap/jobs/bpm/bin/bpm stop cloud_controller_ng"
  group vcap
  
  ...

  if failed host 10.24.20.9 port 9022 protocol http
      and request '/v2/info'
      with timeout 60 seconds for 5 cycles
  then restart

Monit logs showed the following output:

[UTC Feb 11 08:53:41] info     : start service 'cloud_controller_ng' on user request
[UTC Feb 11 08:53:41] info     : monit daemon at 11162 awakened
[UTC Feb 11 08:53:41] info     : Awakened by User defined signal 1
[UTC Feb 11 08:53:41] info     : 'cloud_controller_ng' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Feb 11 08:53:42] info     : 'nginx_cc' start: /var/vcap/jobs/bpm/bin/bpm

[UTC Feb 11 08:53:43] nginx.pid is observed to be created, no pid for cloud_controller_ng is observed

[UTC Feb 11 08:53:44] info     : 'cloud_controller_ng' start action done
[UTC Feb 11 08:53:54] error    : HTTP error: Server returned status 502
[UTC Feb 11 08:53:54] error    : 'cloud_controller_ng' failed protocol test [HTTP] at INET[10.24.20.9:9022/v2/info] via TCP
[UTC Feb 11 08:53:54] info     : 'cloud_controller_ng' trying to restart
[UTC Feb 11 08:53:54] info     : 'nginx_cc' stop: /var/vcap/jobs/bpm/bin/bpm
[UTC Feb 11 08:53:55] info     : 'cloud_controller_ng' stop: /var/vcap/jobs/bpm/bin/bpm
... cycle repeats indefinitely ...

Note that that only 13 seconds after the start request, monit is has decided that cloud_controller_ng has failed the protocol test and attempts to restart it.

When we commented out the if failed host healthcheck above, and tried again, we noted the following:

[UTC Feb 11 08:57:14] info     : start service 'cloud_controller_ng' on user request
[UTC Feb 11 08:57:14] info     : monit daemon at 11162 awakened
[UTC Feb 11 08:57:14] info     : Awakened by User defined signal 1
[UTC Feb 11 08:57:14] info     : 'cloud_controller_ng' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Feb 11 08:57:15] info     : 'nginx_cc' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Feb 11 08:57:16] info     : 'cloud_controller_ng' start action done

[UTC Feb 11 08:57:16] nginx.pid is observed to be created
[UTC Feb 11 08:57:38] cloud_controller_ng.pid is observed to be created
[UTC Feb 11 08:57:40] cloud_controller.sock is observed to be created

Note that it takes about 26 seconds between the process starting, and the socket being open such that the healthcheck may be able to pass.

Steps to Reproduce

Unclear - this did not affect our smaller environment, but did affect our staging (approx 300 apps).

Expected result

  1. The bosh canary process should have stopped this propagating to the 2nd api instance - I don't understand why bosh believed that the process started correctly, when it did not?

  2. Without understanding the monit health check syntax - it reads as though it should require 5 failures, 60 second apart before attempting a restart. Instead a single failure 13 seconds after startup is triggering a restart.

  3. It seems dangerous that cloud_controller_ng is depending on nginx_cc which in turn depends on cloud_controller_ng for a health-check. Having a circular dependency feels like it's asking for trouble.

  4. Taking 26 seconds to write out a PID and open a listening socket feels like a long time? Unclear why this is slow in this particular environment.

Current result

Our users staging environment went hard down.

We think we have made a partial recover by editing the monit files as above, and continue to monitor the situation.

We're likely offline until tomorrow Sydney time - wanted to get this report in now, even if incomplete, due to the potential seriousness of the issue and in case it affects others and in case it's related to that latest CF 7.3.0 / CAPI 176.0 releases (I can't seen an obvious culprit?)

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/163859284

The labels on this github issue will be updated when the story is started.

@aeijdenberg aeijdenberg changed the title cloud_controller_ng fails to start successfully, even though thinks it has cloud_controller_ng fails to start successfully, even though bosh thinks it has Feb 11, 2019
@aeijdenberg
Copy link
Contributor Author

Logs attached for a new instance created via bosh recreate (which claims to succeed, even though cloud_controller_ng keeps getting killed before it has time to start):

(note these are for different times than the log snippets shown in the original report)

$ bosh recreate -d cf api/e4e79851-959a-4400-b1e9-e9d08731085a
Using environment 'director.bosh.cld.internal' as client 'admin'

Using deployment 'cf'

Continue? [yN]: y

Task 493247

Task 493247 | 20:48:28 | Preparing deployment: Preparing deployment (00:00:08)
Task 493247 | 20:48:36 | Preparing deployment: Rendering templates (00:00:51)
Task 493247 | 20:49:31 | Preparing package compilation: Finding packages to compile (00:00:02)
Task 493247 | 20:49:35 | Updating instance api: api/e4e79851-959a-4400-b1e9-e9d08731085a (0) (canary) (00:07:15)

Task 493247 Started  Mon Feb 11 20:48:28 UTC 2019
Task 493247 Finished Mon Feb 11 20:56:50 UTC 2019
Task 493247 Duration 00:08:22
Task 493247 done

Succeeded

cloud_controller_ng.log
monit.log
nginx-error.log

Let me know if additional logs would be helpful.

@aeijdenberg
Copy link
Contributor Author

I continue to think the biggest issue here is that bosh thought that cloud_controller_ng was up and running, but the monit scripts felt otherwise.

(that wouldn't have fixed cloud_controller_ng from not starting in time, but it would have left us with a failed deploy but still running system, instead of hard downtime for all users)

Looking at the bpm docs, they state that:

Your process is started and has an unlimited amount of time to start up. You should use a post-start script and a health check if you want your job to only say it has completed deploying after it has started up.

Although a post-start script is defined: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/post-start.sh.erb

It does not wait for cloud_controller_ng to report health via the same check that monit is using.

@tcdowney
Copy link
Member

Hi @aeijdenberg, this is the first I've ever heard of a Cloud Controller failing to deploy in this way -- especially the fact that it reported healthy at one point.

Can you clarify what happened here?

which as the diego-cells were also updating, eventually caused all apps on the system to go hard down too

Are you saying that as the Diego cells deployed and apps were evacuated on to the new cells they couldn't start because the APIs were down? Under normal operating circumstances control plane downtime shouldn't cause app downtime, but during an upgrade I think what I described above is possible and I want to confirm that's what you observed.

I continue to think the biggest issue here is that bosh thought that cloud_controller_ng was up and running, but the monit scripts felt otherwise.

My understanding though is that BOSH's idea of process healthiness should be the same as Monit's. Monit is simply invoking bpm to run the Cloud Controller jobs so it should be contained within it. You're probably on to something though with that bpm doc you linked. Maybe bpm exiting successfully was enough to cause Monit to consider it successfully running for a brief moment -- we will follow up with some of the BOSH and bpm folks to understand this interaction better.

If you're able to get them, we'd appreciate some of the other Cloud Controller logs as well in /var/vcap/sys/log/cloud_controller_ng/, particularly the pre-start.* and post-start.* logs. I'm wondering if there is anything interesting in those regarding the slow startup.

Thanks!

@aeijdenberg
Copy link
Contributor Author

I've just made a simple PR that adds to the post-start to ensure it fails if the same check that monit uses fails: #126

@tcdowney - yes. v7.2.0 -> v7.3.0 of cf-deployment included a stemcell update as well as the CAPI bump from 1.74.0 to 1.76.0. As such all instances were replaced - and as far as bosh was concerned, it was a success!

Task 491978 | 04:44:28 | Updating instance adapter: adapter/f24f3d05-f888-4cc0-aa34-a6aec220db7e (0) (canary)
Task 491978 | 04:44:28 | Updating instance nats: nats/1f5c32f4-56c7-4e44-ab0a-076e0b1ba686 (0) (canary) (00:03:41)
Task 491978 | 04:48:09 | Updating instance nats: nats/c726e52d-c4db-4029-b7f3-240e65253080 (1)
Task 491978 | 04:48:29 | Updating instance adapter: adapter/f24f3d05-f888-4cc0-aa34-a6aec220db7e (0) (canary) (00:04:01)
Task 491978 | 04:48:29 | Updating instance adapter: adapter/c2213620-0cb6-43bc-a5e1-eaa52f2e32e7 (1)
Task 491978 | 04:51:21 | Updating instance nats: nats/c726e52d-c4db-4029-b7f3-240e65253080 (1) (00:03:12)
Task 491978 | 04:52:10 | Updating instance adapter: adapter/c2213620-0cb6-43bc-a5e1-eaa52f2e32e7 (1) (00:03:41)
Task 491978 | 04:52:10 | Updating instance diego-api: diego-api/c3c93431-988f-4242-8d99-c14821e17c38 (0) (canary) (00:04:45)
Task 491978 | 04:56:55 | Updating instance diego-api: diego-api/75ef3c96-4c7e-48fd-ae89-6479d3590e97 (1) (00:02:57)
Task 491978 | 04:59:52 | Updating instance tls-credhub: tls-credhub/f35d472b-042e-4aae-b99a-0ef4e03afe3a (0) (canary)
Task 491978 | 04:59:53 | Updating instance uaa: uaa/2a346cc8-5751-43d3-8f4f-9913985159ef (0) (canary)
Task 491978 | 04:59:53 | Updating instance cc-worker: cc-worker/01744c0b-2b56-46a5-974b-d72294c219e2 (0) (canary)
Task 491978 | 04:59:53 | Updating instance api: api/e4e79851-959a-4400-b1e9-e9d08731085a (0) (canary)
Task 491978 | 05:04:18 | Updating instance tls-credhub: tls-credhub/f35d472b-042e-4aae-b99a-0ef4e03afe3a (0) (canary) (00:04:26)
Task 491978 | 05:04:41 | Updating instance cc-worker: cc-worker/01744c0b-2b56-46a5-974b-d72294c219e2 (0) (canary) (00:04:48)
Task 491978 | 05:04:41 | Updating instance cc-worker: cc-worker/bd686fed-576c-4253-ab36-86fcf6636ad6 (1)
Task 491978 | 05:05:54 | Updating instance uaa: uaa/2a346cc8-5751-43d3-8f4f-9913985159ef (0) (canary) (00:06:01)
Task 491978 | 05:05:54 | Updating instance uaa: uaa/8788ddbc-0abc-428b-9d81-378161c4c141 (1)
Task 491978 | 05:07:12 | Updating instance api: api/e4e79851-959a-4400-b1e9-e9d08731085a (0) (canary) (00:07:19)
Task 491978 | 05:07:12 | Updating instance api: api/205d4033-3705-4957-a0f4-9f0e89de2fc0 (1)
Task 491978 | 05:07:39 | Updating instance cc-worker: cc-worker/bd686fed-576c-4253-ab36-86fcf6636ad6 (1) (00:02:58)
Task 491978 | 05:11:34 | Updating instance uaa: uaa/8788ddbc-0abc-428b-9d81-378161c4c141 (1) (00:05:40)
Task 491978 | 05:11:59 | Updating instance api: api/205d4033-3705-4957-a0f4-9f0e89de2fc0 (1) (00:04:47)
Task 491978 | 05:11:59 | Updating instance router: router/afef4495-635a-4025-822c-cc942635e003 (0) (canary) (00:10:06)
Task 491978 | 05:22:05 | Updating instance router: router/bc4f362a-0368-45c3-8d77-62a0b749f4fb (2) (00:10:12)
Task 491978 | 05:32:17 | Updating instance router: router/5145c5a0-6dd2-461f-8b96-97ab688b0cc7 (1) (00:10:11)
Task 491978 | 05:42:28 | Updating instance router: router/59b012bb-0b87-48b4-b414-a83d5fc90b38 (3) (00:09:31)
Task 491978 | 05:52:00 | Updating instance scheduler: scheduler/5de9507b-183d-45cd-90a4-e4be53c0b2ec (0) (canary) (00:04:29)
Task 491978 | 05:56:29 | Updating instance scheduler: scheduler/bf22d5bd-74d7-4192-afa2-a2c60700fb90 (1) (00:09:53)
Task 491978 | 06:06:22 | Updating instance doppler: doppler/7c8c8dfc-ab3d-4228-9e36-f4a627a37438 (0) (canary)
Task 491978 | 06:06:22 | Updating instance diego-cell: diego-cell/51fd965b-d39b-4523-9fd8-f48606afb85f (0) (canary)
Task 491978 | 06:10:48 | Updating instance doppler: doppler/7c8c8dfc-ab3d-4228-9e36-f4a627a37438 (0) (canary) (00:04:26)
Task 491978 | 06:10:48 | Updating instance doppler: doppler/4318ea1e-cd50-4595-90f7-100b695054b8 (2)
Task 491978 | 06:11:34 | Updating instance diego-cell: diego-cell/51fd965b-d39b-4523-9fd8-f48606afb85f (0) (canary) (00:05:12)
Task 491978 | 06:11:34 | Updating instance diego-cell: diego-cell/0218c6b5-1d46-4166-9fe7-991f43aa952a (2)
Task 491978 | 06:15:08 | Updating instance doppler: doppler/4318ea1e-cd50-4595-90f7-100b695054b8 (2) (00:04:20)
Task 491978 | 06:15:08 | Updating instance doppler: doppler/7d3d5da9-0a68-473f-aac2-ad8bdc1118ea (1)
Task 491978 | 06:16:14 | Updating instance diego-cell: diego-cell/0218c6b5-1d46-4166-9fe7-991f43aa952a (2) (00:04:40)
Task 491978 | 06:16:14 | Updating instance diego-cell: diego-cell/8eb25663-c41d-4bd0-98de-51bb43d55a0a (4)
Task 491978 | 06:19:12 | Updating instance doppler: doppler/7d3d5da9-0a68-473f-aac2-ad8bdc1118ea (1) (00:04:04)
Task 491978 | 06:19:12 | Updating instance doppler: doppler/d91aa2bc-087a-4b93-a1cb-eb5c7c91fcdc (3)
Task 491978 | 06:21:24 | Updating instance diego-cell: diego-cell/8eb25663-c41d-4bd0-98de-51bb43d55a0a (4) (00:05:10)
Task 491978 | 06:21:24 | Updating instance diego-cell: diego-cell/b283bd27-a20c-4a50-affa-14609d6fd46e (1)
Task 491978 | 06:24:27 | Updating instance doppler: doppler/d91aa2bc-087a-4b93-a1cb-eb5c7c91fcdc (3) (00:05:15)
Task 491978 | 06:26:00 | Updating instance diego-cell: diego-cell/b283bd27-a20c-4a50-affa-14609d6fd46e (1) (00:04:36)
Task 491978 | 06:26:00 | Updating instance diego-cell: diego-cell/b688f6b5-0dd5-4ac2-9da8-aa3b635443f7 (3) (00:04:03)
Task 491978 | 06:30:03 | Updating instance diego-cell: diego-cell/ceedf0f8-d3c2-41aa-9ddf-71989413fb78 (5) (00:05:15)
Task 491978 | 06:35:18 | Updating instance log-api: log-api/cacbdb3e-e78b-47bb-abfe-f6fa4383e30a (0) (canary) (00:04:40)
Task 491978 | 06:39:58 | Updating instance log-api: log-api/dbdb98ae-647e-462f-b691-bed3f7109456 (1) (00:04:37)
Task 491978 | 06:44:35 | Updating instance credhub: credhub/4b645c56-7804-4014-8766-2a95268bf478 (0) (canary) (00:04:22)
Task 491978 | 06:48:57 | Updating instance credhub: credhub/fa9aed58-35c2-43a6-9cff-c77f0ae875fa (1) (00:03:38)

And yes, as you said the control plane outage did not immediately cause damage to the rest of the system, however as the diego-cells started being replaced, we eventually lost access to all application instances. ie I'm not 100% sure whether they stopped running or not, but we were unable to access them via the Gorouters - possibly just the route-registration went down?

Once we figured out how to jerry-rig the cloud_controller_ng into starting again, then eventually all of our apps seemed to come back without further tinkering.

Happy to post those additional logs now.

@aeijdenberg
Copy link
Contributor Author

Note also that while bosh thinks the job started, so far as I could tell (while repeatedly running stat /var/vcap/data/cloud_controller_ng/*) that cloud_controller_ng never actually fully finished starting.

The docs also hint that so long as the start command keep alive for long enough, then bosh thinks it's healthy:

  1. monit start is called for each process in no particular order
  2. each job can specify zero or more processes
  3. times out based on canary_watch_time/update_watch_time settings

(from https://bosh.io/docs/job-lifecycle/)

Then the BPM docs I linked above seem to say we need a post-start to do the kind of healthcheck style that's being used here.

I also noticed, that once running, the pid files both showed a pid of 1. I'm guessing this is due to containerisation - but if monit is monitoring that PID 1 from the host - then maybe it's monitoring the wrong one? (I don't know much about how either monit or bpm work)

@aeijdenberg
Copy link
Contributor Author

cloud_controller_ng pre-and-post start logs.

post-start.stderr.log
post-start.stdout.log
pre-start.stderr.log
pre-start.stdout.log

@tcdowney
Copy link
Member

Hmm, didn't see anything in the logs that would explain why it took so long to start listening on that socket... talked it through a bit with some of the folks on BPM and they agree that doing some health-checking in the post-start is a good practice in general. (See this convo.)

We'll try to repro using the incorrect socket approach you mentioned in #126 and review that PR!

@aeijdenberg
Copy link
Contributor Author

I came up with this hacky command to print out how long between monit trying to start cloud_controller_ng, and it becoming healthy.

bosh ssh -r -d cf api -c "echo \"cloud_controller_ng took \$((\$(stat /var/vcap/data/cloud_controller_ng/cloud_controller.sock --printf %Y)-\$(date --date=\"\$(sudo grep \"start service 'cloud_controller_ng' on user request\" /var/vcap/monit/monit.log | tail -n 1 | cut -c 2-20)\" +%s))) seconds to become healthy\""

Our prod environment (still running cf-deployment v7.2.0) shows:

Instance   api/aa77c8f2-8395-4957-98de-088d98095e2f  
Stdout     cloud_controller_ng took 34 seconds to become healthy  
             
Instance   api/dae7f344-9bec-47e6-b115-ea45e52ccdcc  
Stdout     cloud_controller_ng took 32 seconds to become healthy  

Our users staging environment (cf-deployment v7.3.0):

Instance   api/1f1334a7-9bee-4ce4-a3f3-e9d493398436  
Stdout     cloud_controller_ng took 26 seconds to become healthy  
         
Instance   api/e4e79851-959a-4400-b1e9-e9d08731085a  
Stdout     cloud_controller_ng took 492 seconds to become healthy  

(I think that outlier is due to me disabling the monit health-check partially through to create the logs above)

Our dev environment (cf-deployment v7.4.0):

Instance   api/67d7a896-d497-4180-b9b3-eeab4d11491f  
Stdout     cloud_controller_ng took 27 seconds to become healthy  

In any case, it seems that 30-40 seconds is not unusual in our environments (we're using m3.medium instance types for these).

@ericpromislow
Copy link
Contributor

I don't think that timeout 60 in the monit config file does what the original author thought it did. It seems to have no effect at all.

From our experimentation, if you want the health check to be retried within a time period until the test passes, you need to do for N cycles where N = / 10 .

I inserted a sleep 240 before the cloud controller starts the thin server and changed the monit config to for 30 cycles and it now works. The part I don't like is that if the cloud controller hangs it will now be 5 minutes before monit finally restarts it.

I'm not sure the changes to the post_start and /spec file are needed if the monit syntax is adjusted this way

@aeijdenberg
Copy link
Contributor Author

@ericpromislow - glad that you've found improved health checking configuration for monit, that will help address issue 2 above.

I disagree that this removes the changes to post_start though. Those changes ensure that cloud_controller_ng is verified to complete startup before allowing a deployment to proceed further. (ie addresses issue 1 above).

That is separate from other changes that are necessary (such adjustments to the monit health checking) to fix this particular reason why cloud_controller_ng may fail to complete starting correctly (issue 2 above).

i.e. both are needed as they address different issues. You're right that your fix would likely make the deployment succeed, and that mine specifically wouldn't (instead it would make it fail before nuking the whole deployment), as they address different layers of the swiss cheese model.

@aeijdenberg
Copy link
Contributor Author

During some more experimentation I note that when I run:

monit stop cloud_controller_ng
... wait a bit for it to stop ...
monit start cloud_controller_ng

then the script above shows a startup time of 10 seconds before becoming healthy.

If instead I run:

monit stop all
... wait a bit for all to stop ...
monit start all

then it takes 25-27 seconds before becoming healthy.

Running top during this period shows 3 Ruby processes each competing for CPU.

@cwlbraa
Copy link
Contributor

cwlbraa commented Feb 14, 2019

The post-start change seems like the right direction in terms of process orchestration. Confirming that the pid is up is not sufficient to move on with the deployment, we should confirm that it passes a basic healthcheck as well. I think there's better script tooling in monit_utils.sh to handle that than the raw curl in @aeijdenberg 's PR, but that's really an implementation detail. The behavior where diego loses apps occurs if the cells are evacuating and CC is down. @goonzoid brought this up in slack at some point, but I'm struggling to recall details of how that goes down.

Poking further at @ericpromislow 's experiment, though, the endless-restart situation does seem pretty troubling. Reading monit docs, I think

 if failed host 10.24.20.9 port 9022 protocol http
      and request '/v2/info'
      with timeout 60 seconds for 5 cycles
  then restart

is more easily understood written like this:

 if failed
      host 10.24.20.9
      port 9022
      protocol http
      request '/v2/info' # GET  http://10.24.20.9:9022/v2/info
      timeout 60 seconds # wait a max of 60 seconds for a response each time
      for 5 cycles # if it fails 5 times,
  then restart #  then restart

The crazy part that leads to infinite restarts is that, after 5 attempts and a restart, monit doesn't seem to give it another 5 attempts, and instead triggers a restart for each successive failed get request. The man page for 5.2.5 say it should reset the counter on restart, but I still I can't tell if this is a monit bug or is somehow related to our misuse of bpm/lack of post-start healthcheck. Nailing that down will require additional investigation.

This is a pretty awkward pair of bugs, thanks for all your help tracking them down, @aeijdenberg . I wish we had seen either of them in isolation before you had to go seeing both of them in tandem.

@aeijdenberg
Copy link
Contributor Author

We changed our AWS instances types from m3.medium to m4.large today and that was enough to mask these issues again in the environment that originally broke. We observed startup times of 18-21 seconds (so 1-3 monit health failures - not enough to into the crash loop previously seen).

Clearly that's not an actual fix to the root problems, but I think enough to stabilise that environment for us.

@aeijdenberg
Copy link
Contributor Author

And I think we figured out why we didn't hit this issue in our development environment. That env defaults to really small instance types (t2.small) - which, probably due to their burstable nature, are actually faster than the more expensive m3.medium's that we were using in our other environments.

@cwlbraa
Copy link
Contributor

cwlbraa commented Feb 20, 2019

we confirmed today in isolation that the restart loop seems to happen without any of the bpm or cc bits, but on monit 5.2.5 there's no other obvious way to get that network healthcheck failing over to restart behavior. Honestly I'm still a little confounded by the startup times you're seeing, so if you could dig there more that'd be interesting!

@aeijdenberg
Copy link
Contributor Author

I did bit more poking around on one of our dev API servers today.

$ bosh ssh -d cf api/0
$ sudo su vcap
$ source /var/vcap/jobs/cloud_controller_ng/bin/ruby_version.sh
$ cd /var/vcap/packages/cloud_controller_ng/cloud_controller_ng
$ time /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c bogusconffile
ERROR: Failed loading config from file 'bogusconffile': No such file or directory @ rb_sysopen - bogusconffile

real	0m8.977s
user	0m7.056s
sys	0m0.969s

Looks like starting cloud_controller_ng takes nearly 9 seconds to discover that the configuration file doesn't exist.

When I run it again, with strace:

$ strace -o $HOME/mystrace -tt -T /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c bogusconffile

and take a look at the output:

$ cat $HOME/mystrace | grep "open(" | wc -l
230234
$ cat $HOME/mystrace | grep "open(" | grep "No such file or directory" | wc -l
223695

I note during those 9 seconds, the Ruby process attempts to open 230K files - of which 223K do not exist!

I'm not sure if this is normal or not for Ruby, but my guess is that a search path is perhaps borked, and as such perhaps it's looking for files in a very inefficient manner?

I'll keep digging, but thought this worthy of a note.

@aeijdenberg
Copy link
Contributor Author

Here's a more concise command to demonstrate the potential issue:

bosh ssh -d cf api/0 'cd /var/vcap/packages/cloud_controller_ng/cloud_controller_ng && PATH=/var/vcap/packages/ruby-2.4-r5/bin:$PATH strace -e trace=open -c /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -c bogusconffile'

Gives output:

Using environment 'director.bosh.cld.internal' as client 'admin'

Using deployment 'cf'

Task 522034. Done
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stderr | Unauthorized use is strictly prohibited. All access and activity
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stderr | is subject to logging and monitoring.
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stdout | ERROR: Failed loading config from file 'bogusconffile': No such file or directory @ rb_sysopen - bogusconffile
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stdout | % time     seconds  usecs/call     calls    errors syscall
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stdout | ------ ----------- ----------- --------- --------- ----------------
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stdout | 100.00    0.421282           2    230234    223710 open
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stdout | ------ ----------- ----------- --------- --------- ----------------
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stdout | 100.00    0.421282                230234    223710 total
api/67d7a896-d497-4180-b9b3-eeab4d11491f: stderr | Connection to 10.3.16.14 closed.

I realise it only claims 0.4 seconds of total system time, but this still feels very suspicious of a lot of unnecessary churn.

@gdenn
Copy link

gdenn commented Feb 21, 2019

@aeijdenberg we had the same problem on our staging environment. After upgrading our api VM from single core two dual core CPU the problem was resolved.

@cwlbraa
Copy link
Contributor

cwlbraa commented Feb 22, 2019

Yeah, our "realistic" CI environment takes ~15 seconds to start up from a monit perspective, double checked today. Ruby is not the quickest language to start up, and perhaps our ruby load process is globbing a bit aggressively, but it does sound like your environment was under-provisioned.

@aeijdenberg
Copy link
Contributor Author

@cwlbraa - couldn't disagree more. :) Objectively we've had no issues with performance on our API servers and they were consistently sitting at less than 30% CPU utilisation, thus they were not under-provisioned.

The only "performance" issue we've had is that this particular job has a misconfigured health check which fails to account for a slow but successful startup time.

ie we should be sizing these VMs to optimise capacity to handle peak loads of actual traffic - it makes no sense to optimise simply for a fast startup time due to a poorly designed health check.

And I am truly blown away by the inefficiency of how Ruby handles file dependencies - but given that limitation, and that it seems to get worse in roughly O(n^2) as more Gem's are added to the project, we can't keep bumping VMs sizes (that we pay for by the hour) forever.

Perhaps this health check should simply be removed from the monit configuration, if we can't rewrite it to allow for realistic start-up times for this process?

@cwlbraa
Copy link
Contributor

cwlbraa commented Feb 22, 2019

Perhaps this health check should simply be removed from the monit configuration, if we can't rewrite it to allow for realistic start-up times for this process?

You're not the first person to suggest this IRT this issue, and I've been on board with removing it from the moment we confirmed that monit doesn't restart 5 times in a row! :-) We even went and searched for other ways to implement that behaviour and came up empty handed.

The main thing holding us back here is the implications this will have on CCs shipped and deployed in environments with less experienced or attentive operators. A lot of our monit health checking was added with the intent of allowing the CC to self-heal to some degree, unattended, and that's the intent of the network health check as well. The obvious kicker here is that, in this scenario, "self-healing" ends up being a sort of "self-malpractice" situation. We don't have a good way to find out if that health check is actually healing environments successfully in the wild, and discovering that it was after removing it would be pretty painful. Truthfully though, I suspect that this health check is doing more harm than good.

@aeijdenberg
Copy link
Contributor Author

Why the delivered label? (is there a commit I missed?)

Only 1 of the 4 issues originally identified has been resolved - and now a deployment will fail fast instead of bricking a CF, but it will still fail.

@tcdowney
Copy link
Member

@aeijdenberg I think this was delivered because the story created by the PR for the first issue (PR #126) has made it through our CI and is out for acceptance and it's tied to this issue. It's difficult on our end to track multiple problems with a single Tracker story, so we're going to make separate stories in our backlog for each of these and we'll link them here.

@cwlbraa is going to log one to cover the monit health check issue. As far as I can tell that's the next most important one here, but we'll need to do more research on that one.

I'll log the other two, but just to be clear the third issue you found is the Ruby process trying to open a large number of files on startup which we think is responsible for the slow startup times. What's the fourth one?

@cwlbraa
Copy link
Contributor

cwlbraa commented Feb 27, 2019

Also the issue title is solved by the PR.

Here's the monit issue in our tracker: https://www.pivotaltracker.com/story/show/164263056

@aeijdenberg
Copy link
Contributor Author

@cwlbraa - fair call, bosh no longer thinks that cloud_controller_ng has started successfully, so yes the title of the issue is addressed. :)

The four parts that I reported are:

  • The bosh canary process should have stopped this propagating to the 2nd api instance - I don't understand why bosh believed that the process started correctly, when it did not?

This is now fixed.

  • Without understanding the monit health check syntax - it reads as though it should require 5 failures, 60 second apart before attempting a restart. Instead a single failure 13 seconds after startup is triggering a restart.

This is now understood, however it isn't yet mitigated, as once in a failing state, then a single failed cycle will attempted a restart, and given cycles are every 10 seconds, which is roughly how long it takes to start, there's a very good chance this will occur. But let's call that tracked in: https://www.pivotaltracker.com/n/projects/966314/stories/164263056

  • It seems dangerous that cloud_controller_ng is depending on nginx_cc which in turn depends on cloud_controller_ng for a health-check. Having a circular dependency feels like it's asking for trouble.

This does still feel wrong to me. I think it's OK for the post-start check, which is essentially an integration test of all of the components that it started, but it still feels flaky for the health check on cloud_controller_ng itself.

  • Taking 26 seconds to write out a PID and open a listening socket feels like a long time? Unclear why this is slow in this particular environment.

So long as the healthcheck that runs every 10 seconds is a thing, then this is still a problem. Bumping machine specs just for the purpose of an already flaky health check during startup is a workaround, not a proper fix.

@ericpromislow
Copy link
Contributor

We tried to reproduce bug #4 by taking the following steps:

  1. Deploy a bosh-lite with 2 VMs. Easiest way: cd to some bosh-lite dir and run deploy -o ~/workspace/capi-ci/cf-deployment-operations/scale-up-cloud-controller.yml
  2. bosh ssh api/1
  3. "Hide" the gem directory from ruby. We aren't sure this is the exact underlying cause of bug No support for Quota definition updates #4, but it might have the same symptoms:
cd /var/vcap/data/packages/ruby-2.4-r5/SOME-HASH/lib/ruby/gems/2.4.0/
mv gems hide-gems
cd /var/vcap/sys/log/cloud_controller_ng
monit restart cloud_controller_ng

tail -F cloud_controller*.log shows plenty of error messages trying to find the gems, now that the directory is hidden.

We then exited the bosh ssh and ran bosh restart api/1 and got lots of error messages:

    Error: Unable to render instance groups for deployment. Errors are:
  - Unable to render jobs for instance group 'nats'. Errors are:
    - Unable to render templates for job 'nats'. Errors are:
      - Error filling in template 'nats.conf.erb' (line 35: Can't find link 'nats')
    - Unable to render templates for job 'loggregator_agent'. Errors are:
      - Error filling in template 'bpm.yml.erb' (line 7: Can't find link 'doppler')
...
```  and bosh gives up.

However, `monit status` on the node is still cycling between 'initializing', 'Does not exist', and 'not monitored'. Should bosh tell monit to stop trying to bring the cloud controller when it hits those errors in the `Unable to render instance groups for deployment` section?

@ericpromislow  and @sannidhi 
@charleshansen -- would you have any suggestions on this behavior?

cwlbraa added a commit that referenced this issue May 17, 2019
* ancient bosh monit will continuously restart us at 5+1 monit cycles
  whenever CC might be having a hard time getting up within 10 seconds,
  so this produces more predictable behavior where it only restarts
  every 5 failed curls. it also doesn't interfere with initial startup.

[fixes #125]
[fixes #164263056]

Co-authored-by: Connor Braa <cbraa@pivotal.io>
Co-authored-by: Tim Downey <tdowney@pivotal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants