-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does APM contextPropagationOnly limit server capacity? #129585
Comments
src/cli/apm.js
, and toggled on by src/cli/dist.js
) is active, even if no APM configs are provided. This enabled integration reduces server capacity (how many concurrent requests can we handle) by about x3-4 times.  Attached is a [FlameGraph](https://www.brendangregg.com/flamegraphs.html) from a session where we serve static files from a kibana server and you can search and APM execution (APM wasn't explicitly enabled on that server). To search in the file, you need to download and open it in a new tab and then hit Ctrl+F. 
there was a suggestion to disable APM instrumentation for the static asset routes. see the old issue
This suggestion was revoked on today's call with the APM client team. @elastic/kibana-core, there are no other actionable items until the APM team finishes their investigation. |
I think that there's still one actionable issue on the @elastic/kibana-core side: make sure that we can completely turn off APM with configs if we need to (i.e. not call init at all). On the @elastic/apm-agent-node-js team's side, they mentioned they will look at the performance of the |
I think it depends. If it turns out the Node.js APM agent overhead is mostly due to its |
Lets see what comes up and then we consider our alternatives? |
FWIW, In the scope of #123737, we also have a performance check with noAPM vs contextPropagationOnly planned: elastic/kibana-load-testing#245 |
I performed some load testing for this issue: Test envkibana instances: GCP reference instancevanilla testvanilla elastic.apm.contextPropagationOnly: false
elastic.apm.active: false SummaryMy observations confirm @lizozom initial benchmarking, we have a significant difference in performance under heavy load when APM is enabled in
Raw resultsreference branch (apm in
|
@pgayvallet is there a way for you to capture a CPU profile with and without the agent for a scenario that sees a particularly high impact, such as
Have you conducted testing under lighter load, too? If so, is the impact of the agent also as significant? |
Btw, which is the version of the Node.js agent used in the tests? Version 3.31.0 comes with performance improvements. |
How is the run-to-run variance on these tests? Or in other words how reproducible are the results? I noticed that the standard deviation is about as high as the mean. |
Great questions, thank you! @pgayvallet is out for a few days, so we'll follow up next week. Anyway, lets wait for Pierre to get back for the rest of the questions. |
Sorry, should definitely have mentioned the version used. Tests were run with the version of the agent currently used by Kibana's
I did not, but I sure can. I agree with your assumption btw, and my guess is that the impact should be way less significant under light load, given the
Pretty low. I ran the suites 3 times for each scenarios with variance inferior to 10% for both. FWIW, the standard deviation being incredibly high seems 'normal' to me given the high difference between the low and high percentiles metrics.
I actually never performed CPU profiling on a Kibana instance, but I see @lizozom generated a flamegraph in the issue's initial description, so she can probably provide me with some insight here. FWIW, we can't use |
@pgayvallet and I synced up today. Wanted to add a few more highlights:
|
Pinging @elastic/apm-ui (Team:apm) |
@stratoula this is not an issue for |
Chatted with @pgayvallet |
Removing self-assignment as I'm not actively working on it atm |
There is now an elastic-apm-node@3.37.0 that includes elastic/apm-agent-nodejs#2786 which should significantly reduce overhead from the APM agent for Kibana. @lizozom or others: What do you think are best next steps for this issue? How about:
|
@trentm your proposed next steps seem like the logical way to progress here. |
Renovate already took care of that: #136657 |
Ha thanks. I hadn't yet managed to bootstrap a kibana clone. :) |
I'm so glad to see that this issue ended up yielding some performance improvements. Thank you so much for investing the time into this. I'm OOO in the next couple of weeks, but maybe @pgayvallet could run the benchmarks to verify? More broadly speaking, we want to start running some capacity benchmarks in the foreseeable future, so we will obviously track this as well. |
While benchmarking our server capacity and running a node profiler, we saw that our server side apm node integration (defined in
src/cli/apm.js
, and toggled on bysrc/cli/dist.js
) is active, even if no APM configs are provided (It is enabled withcontextPropagationOnly
).This enabled integration reduces server capacity (how many concurrent requests can we handle) by about 300-400% for static routes and by 25-50% for other dynamic routes.
According to the @elastic/apm-agent-node-js team, this is understandable behavior, especially for more simple routes, where the impact of APM is relatively more significant compared to the work the route does (static routes).
Attached is a FlameGraph from a session where we serve static files from a kibana server and you can search and APM execution (APM wasn't explicitly enabled on that server). To search in the file, you need to download and open it in a new tab and then hit Ctrl+F.
The text was updated successfully, but these errors were encountered: