Skip to content
This repository was archived by the owner on Aug 30, 2022. It is now read-only.

Fix deadlock if steady_clock::now() returns the same value twice #132

Merged
merged 2 commits into from
Sep 18, 2018

Conversation

tudor
Copy link
Contributor

@tudor tudor commented Sep 15, 2018

Span::isFinished() returns true iff the span's duration is zero. So, if steady_clock::now() returns the same value at span creation as at Finish() (unlikely, but possible, especially in virtualized environments), the following scenario can happen:

  • Span::FinishWithOptions sets _duration to zero
  • and then calls reportSpan, which calls RemoteReporter::report
  • which acquires the reporter mutex and
  • adds the span to the reporter's _queue

Later:

  • RemoteReporter::sweepQueue acquires the reporter mutex
  • RemoteReporter::sweepQueue makes a copy of the span:
    const auto span = _queue.front();
  • The span is serialized in RemoteReporter::sendSpan
  • The span (the copy made above) needs to be destroyed because the block in sweepQueue ends.
  • The span's destructor is called, which calls FinishWithOptions again
  • ... which sees that _duration is zero, so isFinished() returns false:
    if (isFinished()) {
  • so it tries to call reportSpan again
  • which tries to acquire the reporter's mutex
  • but we're holding the lock in sweepQueue, so
  • deadlock.

Relevant stack trace:

(gdb) bt
 #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 #1  0x00007fb09c722023 in __GI___pthread_mutex_lock (mutex=0x7fb088b1c1c0) at ../nptl/pthread_mutex_lock.c:78
 #2  0x00007fb08860987b in jaegertracing::reporters::RemoteReporter::report(jaegertracing::Span const&) () from /usr/local/lib/libjaegertracing_plugin.so
 #3  0x00007fb0885b6506 in jaegertracing::Span::FinishWithOptions(opentracing::v2::FinishSpanOptions const&) [clone .localalias.361] () from /usr/local/lib/libjaegertracing_plugin.so
 #4  0x00007fb08860849a in jaegertracing::Span::~Span() [clone .constprop.253] () from /usr/local/lib/libjaegertracing_plugin.so
 #5  0x00007fb08860a8fa in jaegertracing::reporters::RemoteReporter::sweepQueue() () from /usr/local/lib/libjaegertracing_plugin.so
 #6  0x00007fb0886f2f5f in execute_native_thread_routine () from /usr/local/lib/libjaegertracing_plugin.so
 #7  0x00007fb09c71f6db in start_thread (arg=0x7fb087574700) at pthread_create.c:463
 #8  0x00007fb0989ae88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Fixed by enforcing a minimum span duration of 1 SteadyClock tick.

Signed-off-by: Tudor Bosman tudor@rockset.com

Which problem is this PR solving?

Short description of the changes

  • Fix deadlock when span duration is zero by enforcing a minimum span duration of 1 tick

`Span::isFinished()` returns true iff the span's duration is zero.
So, if `steady_clock::now()` returns the same value at span creation as
at `Finish()` (unlikely, but possible, especially in virtualized
environments), the following scenario can happen:

- `Span::FinishWithOptions` sets `_duration` to zero
- and then calls `reportSpan`, which calls `RemoteReporter::report`
- which acquires the reporter mutex and
- adds the span to the reporter's `_queue`

Later:
- `RemoteReporter::sweepQueue` acquires the reporter mutex
- `RemoteReporter::sweepQueue` makes a copy of the span
(https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/reporters/RemoteReporter.cpp#L95)
- The span is serialized in `RemoteReporter::sendSpan`
- The span (the copy made above) needs to be destroyed because the block in `sweepQueue` ends.
- The span's destructor is called, which calls `FinishWithOptions` again
- ... which sees that `_duration` is zero, so `isFinished()` returns
false:
https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/Span.cpp#L94
- so it tries to call `reportSpan` again
- which tries to acquire the reporter's mutex
- but we're holding the lock in `sweepQueue`, so
- deadlock.

Relevant stack trace:
```
(gdb) bt
(gdb) bt
 #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 jaegertracing#1  0x00007fb09c722023 in __GI___pthread_mutex_lock (mutex=0x7fb088b1c1c0) at ../nptl/pthread_mutex_lock.c:78
 jaegertracing#2  0x00007fb08860987b in jaegertracing::reporters::RemoteReporter::report(jaegertracing::Span const&) () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#3  0x00007fb0885b6506 in jaegertracing::Span::FinishWithOptions(opentracing::v2::FinishSpanOptions const&) [clone .localalias.361] () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#4  0x00007fb08860849a in jaegertracing::Span::~Span() [clone .constprop.253] () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#5  0x00007fb08860a8fa in jaegertracing::reporters::RemoteReporter::sweepQueue() () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#6  0x00007fb0886f2f5f in execute_native_thread_routine () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#7  0x00007fb09c71f6db in start_thread (arg=0x7fb087574700) at pthread_create.c:463
 jaegertracing#8  0x00007fb0989ae88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Fixed by enforcing a minimum span duration of 1 `SteadyClock` tick.

Signed-off-by: Tudor Bosman <tudor@rockset.com>
@tudor
Copy link
Contributor Author

tudor commented Sep 15, 2018

CI fails in the same way as #130. That test is flaky.

[ RUN      ]Reporter.testRemoteReporter
/home/travis/build/jaegertracing/jaeger-client-cpp/src/jaegertracing/reporters/ReporterTest.cpp:80: Failure
Expected equality of these values:
  i + 1
    Which is: 59
  spans.size()
    Which is: 58
[  FAILED  ]Reporter.testRemoteReporter (277 ms)

@isaachier
Copy link
Contributor

CI failure is a real issue I need to fix. Just added #134. The deadlock actually seems to be related to #127. See issue there and let me know if you still think this is the answer.

@tudor
Copy link
Contributor Author

tudor commented Sep 17, 2018

The deadlock actually seems to be related to #127. See issue there and let me know if you still think this is the answer.

#127 has to do with destroying the Tracer object (see Tracer::~Tracer on the stacks there), which is not happening here. In our case (in production), the tracer is global, created at process startup, and never deleted until the process exits.

These are two separate issues.

@@ -96,6 +96,12 @@ void Span::FinishWithOptions(
return;
}
_duration = finishTimeSteady - _startTimeSteady;
if (_duration <= SteadyClock::duration()) {
// Enfoce minimum duration of 1 tick (1ns on Linux),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nit, but typo Enforce.

Signed-off-by: Tudor Bosman <tudor@rockset.com>
@tudor
Copy link
Contributor Author

tudor commented Sep 17, 2018

Fixed the typo. PTAL?

@tudor
Copy link
Contributor Author

tudor commented Sep 17, 2018

Thanks! Are you planning to cut a new release any time soon? (I'm okay using the master branch as well, just curious.)

@isaachier
Copy link
Contributor

I won't stop you if you'd like to release a minor version. See RELEASE.md for how to do it. I can build the plugin or do the entire release, but I'd like to see others take a stab at it.

@isaachier isaachier merged commit ffdd052 into jaegertracing:master Sep 18, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants