-
Notifications
You must be signed in to change notification settings - Fork 168
Proposal: Expand and Improve Span Events #69
Changes from all commits
481ca8d
c9c2c69
baf5a64
5def710
22bf3f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
# Typed Events Specification | ||
|
||
Add enumerated subtype and subtype-specific data structures to the existing generic tracing `TimedEvent`. Initial supported subtypes would be `ANNOTATION` (attributes only), `MESSAGE_SENT`, `MESSAGE_RECEIVED`, `ERROR`. | ||
|
||
## Motivation | ||
|
||
The current tracing span event specification does not provide adequate data structures for recording information from errors, faults and exceptions which impact the status of a span. A number of backend tracing systems (i.e. Stackdriver, X-Ray) provide support for detailed error info. It is important for the OpenTelemetry API to support recording this information so operators can take full advantage of backend system capabilities. | ||
|
||
There are a high percentatge of low-volume enterprise applications which log all message content. In addition, even if there are not plans to log or otherwise store message content in production, it is often helpful to do so during the early stages of an application's development. Providing the ability to record message content will improve the utility and adoption of the OpenTelemetry API. It is anticipated that logging-only exporters may be the only exporters which ever use this data. | ||
|
||
## Explanation | ||
|
||
Quickly resolving issues are often aided by knowing exactly where in the code execution path the error occurred. In addition, it is often helpful to know the actual contents of local variables when edge cases trigger errors. | ||
|
||
However this can result in substantial data. Oftentimes an exception is triggered by a different exception which is in turn triggered by another exception. Much of the resulting stacktraces is not helpful because it comes from framework methods which bear no relevance to the offending source code. | ||
|
||
Therefore the `ERROR` event provides flexibility to record as much or as little of the details as makes sense in a particular situation with metatdata to indicate what data was omitted. With each populated event linked to the current span, root cause analysis is made easier and quicker. | ||
|
||
## Internal details | ||
|
||
This section specifies data format in Protocol Buffers for `TimedEvent` messages within the overall OTLP. It follows and expands on the flattened data structure in [#59 OTLP Trace Data Format](https://github.com/open-telemetry/oteps/pull/59). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The data format has evolved after OTEP acceptance. The current entity is named |
||
|
||
### Resource | ||
|
||
```protobuf | ||
// TimedEvent is a time-stamped annotation of the span, consisting of either | ||
// user-supplied key-value pairs, details of a message sent/received between Spans or | ||
// information about an error, fault or exception | ||
message TimedEvent { | ||
enum Type { | ||
// Unknown event type. | ||
TYPE_UNSPECIFIED = 0; | ||
// Contains only timestamp and attributes. | ||
ANNOTATION = 1; | ||
// Indicates a sent message. | ||
MESSAGE_SENT = 2; | ||
// Indicates a received message. | ||
MESSAGE_RECEIVED = 3; | ||
// Indicates an error, fault or exception occurred. | ||
ERROR = 4; | ||
} | ||
|
||
// The type of MessageEvent. Indicates whether the message was sent or | ||
// received. | ||
Type type = 1; | ||
|
||
// time_unixnano is the time the event occurred. | ||
int64 time_unixnano = 2; | ||
|
||
// name is a user-supplied description of the event. | ||
string name = 3; | ||
|
||
// attributes is a collection of attribute key/value pairs on the event. | ||
repeated AttributeKeyValue attributes = 4; | ||
|
||
// dropped_attributes_count is the number of dropped attributes. If the value is 0, | ||
// then no attributes were dropped. | ||
int32 dropped_attributes_count = 5; | ||
|
||
//// Fields for use only by MESSAGE_SENT and MESSAGE_RECEIVED //// | ||
|
||
// An identifier for the MessageEvent's message that can be used to match | ||
// SENT and RECEIVED MessageEvents. For example, this field could | ||
// represent a sequence ID for a streaming RPC. It is recommended to be | ||
// unique within a Span. | ||
uint64 message_id = 6; | ||
|
||
// The number of uncompressed bytes sent or received. | ||
uint64 uncompressed_size = 7; | ||
|
||
// The number of compressed bytes sent or received. If zero, assumed to | ||
// be the same size as uncompressed. | ||
uint64 compressed_size = 8; | ||
|
||
// The content or body of the message. | ||
bytes message_content = 9; | ||
|
||
//// Fields for use only by ERROR //// | ||
|
||
message Exception { | ||
// Unique identifier within a parent span for the exception. | ||
bytes id = 1; | ||
// The exception message. | ||
string messsage = 2; | ||
// The exception class or type. | ||
string type = 3; | ||
// Exception ID of the exception's parent, that is, the exception that caused this exception. | ||
bytes cause = 4; | ||
// The stack. | ||
StackTrace stack = 5; | ||
} | ||
|
||
// Collection of exceptions which triggered the error or fault. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear what a "collection" means semantically. Usually exceptions are nested. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I just copied the AWS X-Ray approach with parent ids denoting nesting. I will look at improving comprehensibility. |
||
repeated exceptions = 10; | ||
|
||
// Method argument values in use when the error occurred. | ||
repeated AttributeKeyValue arguments = 11; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't this be a part of StackFrame? I remember in Sentry for Python I could inspect the exception and look at local variables (not just method args) at every frame. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea to support by stack frame. In many cases, the exception may be caught and recorded a substantial distance from where it was thrown. In those cases, the only values available may be parameter values submitted by the external system. Those values need to be recorded also because oftentimes they prove valuable in diagnosing issues. |
||
|
||
} | ||
|
||
// The full details of a call stack. | ||
message StackTrace { | ||
// A single stack frame in a stack trace. | ||
message StackFrame { | ||
// The fully-qualified name that uniquely identifies the function or | ||
// method that is active in this frame. | ||
string function_name = 1; | ||
// The name of the source file where the function call appears. | ||
string file_name = 3; | ||
// The line number in `file_name` where the function call appears. | ||
int64 line_number = 4; | ||
// The column number where the function call appears, if available. | ||
// This is important in JavaScript because of its anonymous functions. | ||
int64 column_number = 5; | ||
// The binary module from where the code was loaded. | ||
string load_module = 6; | ||
// The version of the deployed source code. | ||
string source_version = 7; | ||
} | ||
|
||
// Stack frames in this call stack. | ||
repeated StackFrame frames = 1; | ||
// The number of stack frames that were dropped because there | ||
// were too many stack frames. | ||
// If this value is 0, then no stack frames were dropped. | ||
int32 dropped_frames_count = 2; | ||
|
||
// The hash ID is used to conserve network bandwidth for duplicate | ||
// stack traces within a single trace. | ||
// | ||
// Often multiple spans will have identical stack traces. | ||
// The first occurrence of a stack trace should contain both | ||
// `stack_frames` and a value in `stack_trace_hash_id`. | ||
// | ||
// Subsequent spans within the same request can refer | ||
// to that stack trace by setting only `stack_trace_hash_id`. | ||
// | ||
uint64 stack_trace_hash_id = 3; | ||
} | ||
``` | ||
|
||
## Trade-offs and mitigations | ||
|
||
Exception info and message content can be a substantial size which may impact performance as well as application memory usage. Exporters to backend systems which do not use this info can drop it to reduce network traffic. SDKs may also want to provide configuration flags on whether this information is recorded or not when received from API calls. | ||
|
||
## Prior art and alternatives | ||
|
||
The error reporting APIs of AWS X-Ray, Google Stackdriver Error Reporting and Rollbar were analyzed to determine the data supported by each. This specification includes structures for providing most or all of the data these systems support. The AWS X-Ray SDKs for various languages provide an example of how this data can be populated by SDK implementations. | ||
|
||
The annotation and sent/received message type data structures are from the OpenCensus data protocol. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When proposing an Events data model I believe it is important to consider a wider range of prior art, including legacy formats and protocols (e.g. SYSLOG RFC5424, SNMP, Windows Events, etc). In my opinion it should be a goal to unambiguously represent legacy formats. This is important in order to build tools that understand all telemetry data types and can correlate telemetry data generated in legacy formats with telemetry data generated according to newer standards. This will make the data model applicable to wider range of real-world systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of subtypes is not clear. Why this specific list? What is the reasoning? Do these subtypes unambiguously map to event subtypes commonly used by event generation, transmission and recording systems and protocols? What is the proposed mapping approach for subtypes that do not fit one of these?