This repository was archived by the owner on Aug 30, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 127
Overflow bug in ProbabilisticSampler. #6
Labels
Comments
Thanks for reporting. cc @isaachier |
Interesting. Will submit fix. Thanks! |
@rnburn let me know what you think of @black-adder's logic here: jaegertracing/jaeger-client-go#229. |
Merged
Not sure why go only uses 63-bit trace ids, but #13 gives you a way to fix the bug while still keeping the same type of sampling logic. |
because that's what Go stdlib provided in the past, |
tudor
added a commit
to rockset/jaeger-client-cpp
that referenced
this issue
Sep 15, 2018
`Span::isFinished()` returns true iff the span's duration is zero. So, if `steady_clock::now()` returns the same value at span creation as at `Finish()` (unlikely, but possible, especially in virtualized environments), the following scenario can happen: - `Span::FinishWithOptions` sets `_duration` to zero - and then calls `reportSpan`, which calls `RemoteReporter::report` - which acquires the reporter mutex and - adds the span to the reporter's `_queue` Later: - `RemoteReporter::sweepQueue` acquires the reporter mutex - `RemoteReporter::sweepQueue` makes a copy of the span (https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/reporters/RemoteReporter.cpp#L95) - The span is serialized in `RemoteReporter::sendSpan` - The span (the copy made above) needs to be destroyed because the block in `sweepQueue` ends. - The span's destructor is called, which calls `FinishWithOptions` again - ... which sees that `_duration` is zero, so `isFinished()` returns false: https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/Span.cpp#L94 - so it tries to call `reportSpan` again - which tries to acquire the reporter's mutex - but we're holding the lock in `sweepQueue`, so - deadlock. Relevant stack trace: ``` (gdb) bt (gdb) bt #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 jaegertracing#1 0x00007fb09c722023 in __GI___pthread_mutex_lock (mutex=0x7fb088b1c1c0) at ../nptl/pthread_mutex_lock.c:78 jaegertracing#2 0x00007fb08860987b in jaegertracing::reporters::RemoteReporter::report(jaegertracing::Span const&) () from /usr/local/lib/libjaegertracing_plugin.so jaegertracing#3 0x00007fb0885b6506 in jaegertracing::Span::FinishWithOptions(opentracing::v2::FinishSpanOptions const&) [clone .localalias.361] () from /usr/local/lib/libjaegertracing_plugin.so jaegertracing#4 0x00007fb08860849a in jaegertracing::Span::~Span() [clone .constprop.253] () from /usr/local/lib/libjaegertracing_plugin.so jaegertracing#5 0x00007fb08860a8fa in jaegertracing::reporters::RemoteReporter::sweepQueue() () from /usr/local/lib/libjaegertracing_plugin.so jaegertracing#6 0x00007fb0886f2f5f in execute_native_thread_routine () from /usr/local/lib/libjaegertracing_plugin.so jaegertracing#7 0x00007fb09c71f6db in start_thread (arg=0x7fb087574700) at pthread_create.c:463 jaegertracing#8 0x00007fb0989ae88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ``` Fixed by enforcing a minimum span duration of 1 `SteadyClock` tick. Signed-off-by: Tudor Bosman <tudor@rockset.com>
isaachier
pushed a commit
that referenced
this issue
Sep 18, 2018
* Fix deadlock if steady_clock::now() returns the same value twice `Span::isFinished()` returns true iff the span's duration is zero. So, if `steady_clock::now()` returns the same value at span creation as at `Finish()` (unlikely, but possible, especially in virtualized environments), the following scenario can happen: - `Span::FinishWithOptions` sets `_duration` to zero - and then calls `reportSpan`, which calls `RemoteReporter::report` - which acquires the reporter mutex and - adds the span to the reporter's `_queue` Later: - `RemoteReporter::sweepQueue` acquires the reporter mutex - `RemoteReporter::sweepQueue` makes a copy of the span (https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/reporters/RemoteReporter.cpp#L95) - The span is serialized in `RemoteReporter::sendSpan` - The span (the copy made above) needs to be destroyed because the block in `sweepQueue` ends. - The span's destructor is called, which calls `FinishWithOptions` again - ... which sees that `_duration` is zero, so `isFinished()` returns false: https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/Span.cpp#L94 - so it tries to call `reportSpan` again - which tries to acquire the reporter's mutex - but we're holding the lock in `sweepQueue`, so - deadlock. Relevant stack trace: ``` (gdb) bt (gdb) bt #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007fb09c722023 in __GI___pthread_mutex_lock (mutex=0x7fb088b1c1c0) at ../nptl/pthread_mutex_lock.c:78 #2 0x00007fb08860987b in jaegertracing::reporters::RemoteReporter::report(jaegertracing::Span const&) () from /usr/local/lib/libjaegertracing_plugin.so #3 0x00007fb0885b6506 in jaegertracing::Span::FinishWithOptions(opentracing::v2::FinishSpanOptions const&) [clone .localalias.361] () from /usr/local/lib/libjaegertracing_plugin.so #4 0x00007fb08860849a in jaegertracing::Span::~Span() [clone .constprop.253] () from /usr/local/lib/libjaegertracing_plugin.so #5 0x00007fb08860a8fa in jaegertracing::reporters::RemoteReporter::sweepQueue() () from /usr/local/lib/libjaegertracing_plugin.so #6 0x00007fb0886f2f5f in execute_native_thread_routine () from /usr/local/lib/libjaegertracing_plugin.so #7 0x00007fb09c71f6db in start_thread (arg=0x7fb087574700) at pthread_create.c:463 #8 0x00007fb0989ae88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ``` Fixed by enforcing a minimum span duration of 1 `SteadyClock` tick. Signed-off-by: Tudor Bosman <tudor@rockset.com> * fix typo Signed-off-by: Tudor Bosman <tudor@rockset.com>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The ProbabilisticSampler uses the following logic to determine a sampling bound:
kMaxRandomNumber * _samplingRate
gets converted to a double which rounds to a higher value thankMaxRandomNumber
so that when it's casted back to a uint64_t it overflows to zero and nothing is sampled.This short program will demonstrate the problem
It outputs
The text was updated successfully, but these errors were encountered: