Skip to content
This repository was archived by the owner on Dec 6, 2024. It is now read-only.

Commit 6bd9fad

Browse files
authored
Merge branch 'main' into main
2 parents 1f4fa6e + 6e18af5 commit 6bd9fad

File tree

1 file changed

+188
-0
lines changed

1 file changed

+188
-0
lines changed
+188
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Propose OpenTelemetry profiling vision
2+
3+
The following are high-level items that define our long-term vision for
4+
Profiling support in the OpenTelemetry project that we aspire to achieve.
5+
6+
While this vision document reflects our current desires, it is meant to be a
7+
guide towards a collectively agreed upon set of objectives rather than a
8+
checklist of requirements. A group of OpenTelemetry community members have
9+
participated in a series of bi-weekly meetings for 2 months. The group
10+
represents a cross-section of industry and domain expertise, who have found
11+
common cause in the creation of this document. It is our shared intention to
12+
continue to ensure alignment moving forward. As our vision evolves and matures,
13+
we intend to incorporate our learnings further to facilitate an optimal outcome.
14+
15+
This document and efforts thus far are motivated by:
16+
17+
- This [long-standing issue](https://github.com/open-telemetry/oteps/issues/139)
18+
created in October 2020
19+
- A conversation about priorities at the in-person OpenTelemetry meeting at Kubecon EU
20+
2022
21+
- Increasing community interest in profiling as an observability signal
22+
alongside logs, metrics, and traces
23+
24+
## What is profiling
25+
26+
While the terms "profile" and "profiling" can have slightly different meanings
27+
depending on the context, for the purposes of this OTEP we are defining the two
28+
terms as follows:
29+
30+
- Profile: A collection of stack traces with some metric associated with each
31+
stack trace, typically representing the number of times that stack trace was
32+
encountered
33+
- Profiling: The process of collecting profiles from a running program,
34+
application, or the system
35+
36+
## How profiling aligns with the OpenTelemetry vision
37+
38+
The [OpenTelemetry
39+
vision](https://opentelemetry.io/mission/#vision-mdash-the-world-we-imagine-for-otel-end-users)
40+
states:
41+
42+
_Effective observability is powerful because it enables developers to innovate
43+
faster while maintaining high reliability. But effective observability
44+
absolutely requires high-quality telemetry – and the performant, consistent
45+
instrumentation that makes it possible._
46+
47+
While existing OpenTelemetry signals fit all of these criteria, until recently
48+
no effort has been explicitly geared towards creating performant and consistent
49+
instrumentation based upon profiling data.
50+
51+
## Making a well-rounded observability suite by adding profiling
52+
53+
Currently Logs, Metrics, and Traces are widely accepted as the main “pillars” of
54+
observability, each providing a different set of data from which a user can
55+
query to answer questions about their system/application.
56+
57+
Profiling data can help further this goal by answering certain questions about a
58+
system or application which logs, metrics, and traces are less equipped to
59+
answer. We aim to facilitate implementations capable of best-in-class support
60+
for collecting, processing, and transporting this profiling data.
61+
62+
Our goals for profiling align with those of OpenTelemetry as a whole:
63+
64+
- **Profiling should be easy**: the nature of profiling offers fast
65+
time-to-value by often being able to optionally drop in a minimal amount of
66+
code and instantly have details about application resource utilization
67+
- **Profiling should be universal**: currently profiling is slightly different
68+
across different languages, but with a little effort the representation of
69+
profiling data can be standardized in a way where not only are languages
70+
consistent, but profiling data itself is also consistent with the other
71+
observability signals as well
72+
- **Profiling should be vendor neutral**: From one profiling agent, users should
73+
be able to send data to whichever vendor they like (or no vendor at all) and
74+
interoperate with other OSS projects
75+
76+
## Current state of profilers
77+
78+
As it currently stands, the method for collecting profiles for an application
79+
and the format of the profiles collected varies greatly depending on several
80+
factors such as:
81+
82+
- Language (and language runtime)
83+
- Profiler Type
84+
- Data type being profiled (i.e. cpu, memory, etc)
85+
- Availability or utilization of symbolic information
86+
87+
A fairly comprehensive taxonomy of various profiling formats can be found on the
88+
[profilerpedia website](https://profilerpedia.markhansen.co.nz/formats/).
89+
90+
As a result of this variation, the tooling and collection of profiling data
91+
lacks in exactly the areas in which OpenTelemetry has built as its core
92+
engineering values:
93+
94+
- Profiling currently lacks compatibility: Each vendor, open source project, and
95+
language has different ways of collecting, sending, and storing profiling data
96+
and often with no regard to linking to other signals
97+
- Profiling currently lacks consistency: Currently profiling agents and formats
98+
can change arbitrarily with no unified criteria for how to take end-users into
99+
account
100+
101+
## Making profiling compatible with other signals
102+
103+
Profiles are particularly useful in the context of other signals. For example,
104+
having a profile for a particular “slow” span in a trace yields more actionable
105+
information than simply knowing that the span was slow.
106+
107+
OpenTelemetry will define how profiles will be correlated with logs, traces, and
108+
metrics and how this correlation information will be stored.
109+
110+
Correlation will work across 2 major dimensions:
111+
112+
- To correlate telemetry emitted for the same request (also known as request or
113+
trace context correlation)
114+
- To correlate telemetry emitted from the same source (also known as resource
115+
context correlation)
116+
117+
## Standardize profiling data model for industry-wide sharing and reuse
118+
119+
We will design a profiling data model that will aim to represent the vast
120+
majority of profiling data with the following goals in mind:
121+
122+
- Profiling formats should be as compact as possible
123+
- Profiling data should be transferred as efficiently as possible and the model
124+
should be lossless with intentional bias for enabling efficient marshaling,
125+
transcoding (to and from other formats), and analysis
126+
- Profiling formats should be able to be unambiguously mapped to the
127+
standardized data model (i.e. collapsed, pprof, JFR, etc.)
128+
- Profiling formats should contain mechanisms for representing relationships
129+
between other telemetry signals (i.e. linking call stacks with spans)
130+
131+
## Supporting legacy profiling formats
132+
133+
For existing profilers we will provide instructions on how these legacy formats
134+
can emit profiles in a manner that makes them compatible with OpenTelemetry’s
135+
approach and enables telemetry data correlation.
136+
137+
Particularly for popular profilers such as the ones native to Golang and Java
138+
(JFR) we will help to have them produce OpenTelemetry-compatible profiles with
139+
minimal overhead.
140+
141+
## Performance considerations
142+
143+
Profiling agents can be architected in a variety of differing ways, with
144+
reasonable trade offs made that may impact performance, completeness, accuracy
145+
and so on. Similarly, the manner in which such a profiler might produce or
146+
consume OpenTelemetry-compatible data could vary significantly. As such, in our
147+
standardization effort it is not feasible to be prescriptive on the matter of
148+
resource usage for profilers.
149+
150+
However, the output of OpenTelemetry's standardization effort must take into
151+
account that some existing profilers are designed to be low overhead and high
152+
performance. For example, they may operate in a whole-datacenter, always-on
153+
manner, and/or in environments where they must guarantee low CPU/RAM/network
154+
usage. The OpenTelemetry standardisation effort should take this into account
155+
and strive to produce a format that is usable by profilers of this nature
156+
without sacrificing their performance guarantees.
157+
158+
Similar to other OpenTelemetry signals, we target production environments. Thus, the
159+
profiling signal must be implementable with low overhead and conforming to
160+
OpenTelemetry-wide runtime overhead / intrusiveness and wire data size requirements.
161+
162+
## Promoting cloud-native best practices with profiling
163+
164+
The CNCF’s mission states: _Cloud native technologies empower organizations to
165+
build and run scalable applications in modern, dynamic environments such as
166+
public, private, and hybrid clouds_
167+
168+
We will have best-in-class support for profiles emitted in cloud native
169+
environments (e.g. Kubernetes, serverless, etc), including legacy applications
170+
running in those environments. As we aim to achieve this goal we will center our
171+
efforts around making profiling applications resilient, manageable and
172+
observable. This is in line with the Cloud Native Computing Foundation and
173+
OpenTelemetry missions and will thus allow us to further expand and leverage
174+
those communities to further the respective missions.
175+
176+
## Profiling use cases
177+
178+
- Tracking resource utilization of an application over time to understand how
179+
code changes, hardware configuration changes, and ephemeral environmental
180+
issues influence performance
181+
- Understanding what code is responsible for consuming resources (i.e. CPU, Ram,
182+
disk, network)
183+
- Planning for resource allotment for a group of services running in production
184+
- Comparing profiles of different versions of code to understand how code has
185+
improved or degraded over time
186+
- Detecting frequently used and "dead" code in production
187+
- Breaking a trace span into code-level granularity (i.e. function call and line
188+
of code) to understand the performance for that particular unit

0 commit comments

Comments
 (0)