Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic conventions for when to record exceptions #521

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions specification/data-error-semantic-conversions.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
This document defines semantic conventions for recording logs and errors.

## Recording an Error
Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span.

An error event created by an API call SHOULD be associated with the current span from which the API call is made.
Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only happen one time per span though? Then we can use span attributes instead of events as currently suggested in most PRs. I would agree with that approach 👍 We can add semantic conventions for unhandled errors and handled errors separately though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not arguing, but why storing attributes in a span is better than storing same attributes in an event? It looks the same for me except that

  • there can be multiple events (errors/logs)
  • event has a time

Copy link
Member

@Oberon00 Oberon00 Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is simpler/faster to process. If you want to know if your span has an error on the backend, you can check mySpan.hasAttribute('error.whatever') instead of any(e.name == 'error' for e in mySpan.events).
EDIT: On second thought, you could also have an hasEventNamed('error') method on your backend span, but that would still most likely have a more complex implementation than a hash map lookup.

Copy link
Author

@murphpdx murphpdx Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's true we would not run into double errors when we auto instrument, a user can still call an API to record an error. Error attributes don't handle if the user calls the API twice on the same span or if they call it once when we auto instrumented an error on the same span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this tension very strongly; my personal preference has been one error recorded per span, but in go we went in the direction of having separate (potentially multiple) spanevents recorded, one per error. I still do feel that in general it is a useful optimization to flag the spans that have spanevent children indicating error, so that they're more easily searchable (rather than requiring parent HAVING child.error = true type search syntax)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do feel that in general it is a useful optimization to flag the spans that have spanevent children indicating error, so that they're more easily searchable (rather than requiring parent HAVING child.error = true type search syntax)

@lizthegrey Do you mean marking parents of a failed child as failed?
This won't be doable since a child can end (and thus fail) after the parent has ended in async scenarious and at that time the parent might have been exported already (or is at least no longer accessible to the instrumentation). This would be an optimization the backend would have to take care of when ingesting and analyzing the tree of spans.
Furthermore, a parent can also gracefully handle failures, e.g., by retrying the failed child operation or falling back (to alternative/cached data, for example), and thus succeed as a whole despite failed child spans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's possible that a parent span can succeed despite a child span failing; however, our current schema is using SpanEvents (not child Spans, and not storing the information on the span itself) to record error statuses. I guess that's what the overall status of the span in terms of gRPC code is for though.

Copy link

@vmihailenco vmihailenco Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nested spans and error propagating? E.g.

span1
    span2
        span3 <- error happens here and propagated up to span1

Should the same error be recorded 3 times in span3, span2, and span1? It seems wasteful, but it is also not clear how to prevent/deduplicate it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the user will know if/where it was handled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth including in the PR if this is your intention.

But:

  • it is not true in my case - errors tend to be recorded several times, but propagation stops at instrumentation boundaries (it is fixable for sure), not when error is handled
  • it is expensive and slightly annoying - especially if/when user stores backtraces

E.g. in Go call depth of 10-20 frames is pretty common and errors are usually handled at the very top level - duplicating error information with backtraces 10-20 times is a no go...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In e.g. Java the following situation is quite common:

  • span3 throws an exception, let us call it "low level error". E.g. that database call has failed
  • span2 catches it and wraps it into another exception object. This can be some sort of "business process" exception, e.g. "failed to update inventory"
  • span1 catches that and wraps it again into yet another exception object. This time more user-oriented one, e.g. "checkout failed".

One can argue that all three exceptions should be recorded and propagated to the backend. And all three spans should be marked as failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree mostly with this spec, I would prefer to also specify:

Errors and exceptions MUST NOT be recorded unless an unhandled exception or error leaves the boundary of a span.

If we want to support any such situations, we would need to add something to distinguish them from unhandled (in the above sense) execptions, probably a simplified variant of what is suggested in #747.


An error event created by an API call SHOULD be associated with the current span from which the API call is made.