-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default APM Configurations for monitoring of kibana clusters #117492
Comments
@tylersmalley once you get that subsomain for our server url, mind updating it here? https://github.com/elastic/infra/pull/32868 |
I believe we need to have |
@joshdover what does |
@trentm can probably comment in more detail, but in performance testing we did earlier this year, we saw a pretty dramatic increase in overhead when this is enabled. I believe it creates stack traces for every span and sub-span which has memory, network, and CPU overhead. |
Pinging @elastic/apm-ui (Team:apm) |
@lizozom Approving from the RUM agent side. Looks good interns of perf and data concerns. |
@vigneshshanmugam can you help me understand the parts marked with
(read the docs, still not sure) Also, what is the intent behind the |
It reduces the network communication overhead at the cost of higher memory consumption Are you going to use |
Thanks @mshustov I will add However I still would love a real clarification of what each thing does. For example I think that the existing docs require a large amount of context I don't have ATM. |
yes, I've had the same problem so it required some time to dig through the APM repo, but I found this doc describing this specific setting |
This is a value that we can tune up and down between 0.0 and 1.0 based on a balance between lower measured load (on the traced Kibana and on the receiving kibana-cloud-apm.elastic.dev infra) and having more complete trace data.
@mshustov Found the original design document. When we were doing perf work on the Node.js agent at the start of 2021, breakdown metrics calculation was ~2.5% of CPU time in a simulated app that did nothing but tracing, i.e. approaching worst case. I don't know the RUM agent that well, but its https://www.elastic.co/guide/en/apm/agent/rum-js/current/breakdown-metrics-docs.html document suggests it might be more valuable data for front-end analysis. As @mshustov pointed out above, @vigneshshanmugam commented earlier that setting this
This setting is only relevant for the Node.js APM agent. When initial perf work was being done, this setting was the huge contributor to APM perf impact. Since then, Node.js APM Agent version 3.16.0, fixed a few perf issues that were likely the major contributors to this impact. However, since then, we have also just stuck with
This is relevant only to the RUM agent. When support for the 'tracestate' header was added to our agents, the RUM agent could not just start sending it, because that could break some customer usage because of CORS. In our case, we would need to make sure that https://kibana-cloud-apm.elastic.dev is configured to accept CORS requests including the tracestate header from any Kibana frontends using the RUM agent. https://www.elastic.co/guide/en/apm/agent/rum-js/current/distributed-tracing-guide.html#enable-tracestate discusses this. I don't have experience configuring this myself.
This is relevant only to the Node.js agent. It tells the agent to only gather, serialize and send metrics to APM server every 120s, rather than the default of 30s. I don't believe we have particular data that shows whether and how much perf impact this has. The majority of metrics perf impact was in the breakdown metrics calculation, and we have separately disabled that.
@joshdover expressed a preference for leaving this out. The APM configuration code in Kibana will set this to one of "development", "production" or "ci". There had been recent discussion on the related Telemetry issue about setting |
@trentm Has covered most of the RUM stuff as well.
Breakdown metrics calculatiion depends totally on the number of different types of spans a given transaction contains when it comes to RUM agent. Kibana being heavy on the number of network requests, Its good to keep it simple and not do that. Page load transactions are different category as we don't rely on number of spans, instead use the metrics from Browsers Navigation Timing API. We do heavy CPU and memory benchmarks on the RUM agent, if we ever need to check any RUM related stuff. we can always use this - https://github.com/elastic/apm-agent-rum-js/blob/master/packages/rum/test/benchmarks/BENCHMARKS.md |
As a side note: Kibana allows users to configure CORS. server.cors.enabled:..
server.cors.allowCredentials:...
server.cors.allowOrigin:... use https://www.elastic.co/guide/en/kibana/current/settings.html for references. Note that CORS is disabled by default. |
@alex-fedotyev care to comment about the configuration proposed here? |
I am updating the APM server in #117749 and checking that our settings match, which looks like they do. I think we should avoid setting things that are the same as the default. With that PR, we should be able to just set:
v7.16 > will be able to drop I was going to add |
A clarification, https://kibana-cloud-apm.elastic.dev/ is the Kibana instance for us to access the data, while https://kibana-cloud-apm-server.elastic.dev is the underlying APM server we will configure to collect the data. |
@tylersmalley I updated the issue with the smaller block for 7.16+. I saw that your PR includes the @mshustov @trentm thanks for the explanations! Let me know if you think the configs above are satisfactory in your opinion? |
@lizozom, I also check the pre 7.16 (examples, 7.11, 7.14, 7.15) configs, and looks like we only need to set Pre 7.16: elastic.apm.active: true
// Report apm info to central location
elastic.apm.serverUrl: "https://kibana-cloud-apm.apm.us-east-1.aws.found.io"
elastic.apm.secretToken: "JpBCcOQxN81D5yucs2"
// Labeling and discoverability
elastic.apm.globalLabels.deploymentId: <ESS Cluster UUID>
elastic.apm.globalLabels.deploymentName: <ESS Cluster Name>
// Performance fine tuning
elastic.apm.transactionSampleRate: 0.1
elastic.apm.captureSpanStackTraces: false |
As of now, Kibana supports a merged configuration only. We can revisit this decision later, I'm going to create an issue to discuss.
From this docs https://www.elastic.co/guide/en/apm/agent/rum-js/current/distributed-tracing-guide.html#enable-cors it seems that if the Kibana browser app is served from the same domain, where it sends |
See also https://github.com/elastic/telemetry/issues/1319 regarding perhaps wanting to set |
I've added |
My main concern (if I read the code correctly) is confusion on whether and when the kibana/packages/kbn-apm-config-loader/src/config.ts Lines 26 to 48 in 19706fb
I'm not sure the best place to discuss it (perhaps on #117749 ?), but what about moving all the
Until Kibana is updated to a future Node.js APM agent with a fix for elastic/apm-agent-nodejs#2427 then the "Post 7.16" config should also set |
@trentm thanks for the clarification. |
Yes, I think so. |
|
As an experiment I tried to analyze why hitting Kibana's telemetry endpoint might cause performance degradation (a scenario we've seen happen in a few SDH's). APM's traces gives us good detail about why a certain API might take long (e.g. there's many serial requests to elasticsearch) but it's not easy to see why the event loop is blocked and the entire server is slow. After some experiments with Søren we got some promising data from We got a recommendation to disable this setting in #78792 (comment) but that was based on the assumption that it could improve the performance of the RUM agent which we don't use. We seem to think that the performance impact has also been lowered since our initial testing #78792 (comment) Finding the reason for a slow event loop is probably the number one performance diagnosis problem we're struggling to solve at the moment. So I think we should use |
If the APM team doesn't own it, maybe
Could you provide an example of
Another option might be to give access to Kibana developers to adjust these settings. It will require some additional work from the Infra team to set up access to the APM settings only, so I doubt the possibility of implementing this option. |
It was my impression that this was owned by the Kibana Operations team since they are codeowners of that part of the codebase and would have the most knowledge. The APM team (both ui, server and agents) are of course ready to help with anything. |
This is an interesting question. I think that there is a level of uncertainty still around the performance implications of some of these configs. If we had up to date benchmarks, we could probably own this.
Currently multiple teams are integrating with APM (see issue in dev repo), wondering if we can have some guidelines.
This requires |
Benchmarking is owned by the apm agent devs (this specifically falls in the lap of the nodejs agent devs). |
@mshustov can Kibana own these settings? If so, which team would it be? At the momentm Kibana-core owns the integration, right? |
I'd suggest that the (virtual) team working on |
If we don't have any obvious owners APM UI can take this on. Is this just a matter of updating the default settings in |
@sqren Also, I think dealing with possible confusion with I'm happy to help here on establishing a reasonable base config. |
Yes, I remember that comment and agree 100%. I was also having a hard time following what settings would apply for which environment. Having it less dynamic would help. |
@sqren the main part is not only to update the defaults in the loader, but to establish the values that cloud will use to prepopulate cloud settings. I don't think there's much technical work to be done here, it's just about signing off these values as good enough and sufficient. @mshustov no dedicated virtual team for this yet. It's being worked on as part of each team's engineering efforts at the moment. Guess we can discuss this when we have the sync (2022). |
While we might endup changing the sampling rate, so far these defaults seem to serve us well. |
We are in the process of enabling APM monitoring on some of our own internal clusters (such as telemetry, gh-stats, ci-builds etc).
To clarify, enabling APM in this context means that data collected by
apm-rum
andapm-node
is sent out to another kibana cluster ("the monitoring cluster") where we can review the actual performance of each monitored cluster.We want to have a well defined, clear and uniform, set of default configurations for these clusters.
APM Agent
Post 7.16
Once #117749 is merged, we can use a minimal config and only define the unique default configs:
Full configuration
Once we make sure the defaults in Kibana are good, we will be able to use a more minimal configuration:
APM Server
Following a chat with @felixbarny, we should set
keep_unsampled: false
on the APM servers so we don't save unsampled transactions.In 8.0 this will be the default behavior and APM agents will stop sending the unsampled transactions, hence helping us better control the volumes of ingested data.
Read More
breakdownMetrics
propagateTracestate
elastic.apm.environment
empty, and use it to filter based on production \ qa \ devThe text was updated successfully, but these errors were encountered: