Skip to content
This repository was archived by the owner on Aug 30, 2022. It is now read-only.

Overflow bug in ProbabilisticSampler. #6

Closed
rnburn opened this issue Nov 14, 2017 · 5 comments
Closed

Overflow bug in ProbabilisticSampler. #6

rnburn opened this issue Nov 14, 2017 · 5 comments
Labels

Comments

@rnburn
Copy link
Contributor

rnburn commented Nov 14, 2017

The ProbabilisticSampler uses the following logic to determine a sampling bound:

    explicit ProbabilisticSampler(double samplingRate)
        : _samplingRate(std::max(0.0, std::min(samplingRate, 1.0)))
        , _samplingBoundary(
              static_cast<uint64_t>(kMaxRandomNumber * _samplingRate))
        , _tags({ { kSamplerTypeTagKey, kSamplerTypeProbabilistic },
                  { kSamplerParamTagKey, _samplingRate } })
    {
    }

kMaxRandomNumber * _samplingRate gets converted to a double which rounds to a higher value than kMaxRandomNumber so that when it's casted back to a uint64_t it overflows to zero and nothing is sampled.

This short program will demonstrate the problem

#include <limits>
#include <cstddef>
#include <iostream>

int main() {
  constexpr auto max = std::numeric_limits<uint64_t>::max();
  std::cout << max << "\n";
  auto t = max*1.0;
  std::cout << t << "\n";
  auto x = static_cast<uint64_t>(t);
  std::cout << x << "\n";
  return 0;
}

It outputs

18446744073709551615
1.84467e+19
0
@yurishkuro
Copy link
Member

yurishkuro commented Nov 14, 2017

Thanks for reporting. cc @isaachier

@yurishkuro yurishkuro added the bug label Nov 14, 2017
@isaachier
Copy link
Contributor

Interesting. Will submit fix. Thanks!

@isaachier
Copy link
Contributor

@rnburn let me know what you think of @black-adder's logic here: jaegertracing/jaeger-client-go#229.

@rnburn
Copy link
Contributor Author

rnburn commented Nov 17, 2017

Not sure why go only uses 63-bit trace ids, but #13 gives you a way to fix the bug while still keeping the same type of sampling logic.

@yurishkuro
Copy link
Member

Not sure why go only uses 63-bit trace ids

because that's what Go stdlib provided in the past, Int63 . I can see that Uint64 was added later, not sure from which version.

tudor added a commit to rockset/jaeger-client-cpp that referenced this issue Sep 15, 2018
`Span::isFinished()` returns true iff the span's duration is zero.
So, if `steady_clock::now()` returns the same value at span creation as
at `Finish()` (unlikely, but possible, especially in virtualized
environments), the following scenario can happen:

- `Span::FinishWithOptions` sets `_duration` to zero
- and then calls `reportSpan`, which calls `RemoteReporter::report`
- which acquires the reporter mutex and
- adds the span to the reporter's `_queue`

Later:
- `RemoteReporter::sweepQueue` acquires the reporter mutex
- `RemoteReporter::sweepQueue` makes a copy of the span
(https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/reporters/RemoteReporter.cpp#L95)
- The span is serialized in `RemoteReporter::sendSpan`
- The span (the copy made above) needs to be destroyed because the block in `sweepQueue` ends.
- The span's destructor is called, which calls `FinishWithOptions` again
- ... which sees that `_duration` is zero, so `isFinished()` returns
false:
https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/Span.cpp#L94
- so it tries to call `reportSpan` again
- which tries to acquire the reporter's mutex
- but we're holding the lock in `sweepQueue`, so
- deadlock.

Relevant stack trace:
```
(gdb) bt
(gdb) bt
 #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 jaegertracing#1  0x00007fb09c722023 in __GI___pthread_mutex_lock (mutex=0x7fb088b1c1c0) at ../nptl/pthread_mutex_lock.c:78
 jaegertracing#2  0x00007fb08860987b in jaegertracing::reporters::RemoteReporter::report(jaegertracing::Span const&) () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#3  0x00007fb0885b6506 in jaegertracing::Span::FinishWithOptions(opentracing::v2::FinishSpanOptions const&) [clone .localalias.361] () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#4  0x00007fb08860849a in jaegertracing::Span::~Span() [clone .constprop.253] () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#5  0x00007fb08860a8fa in jaegertracing::reporters::RemoteReporter::sweepQueue() () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#6  0x00007fb0886f2f5f in execute_native_thread_routine () from /usr/local/lib/libjaegertracing_plugin.so
 jaegertracing#7  0x00007fb09c71f6db in start_thread (arg=0x7fb087574700) at pthread_create.c:463
 jaegertracing#8  0x00007fb0989ae88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Fixed by enforcing a minimum span duration of 1 `SteadyClock` tick.

Signed-off-by: Tudor Bosman <tudor@rockset.com>
isaachier pushed a commit that referenced this issue Sep 18, 2018
* Fix deadlock if steady_clock::now() returns the same value twice

`Span::isFinished()` returns true iff the span's duration is zero.
So, if `steady_clock::now()` returns the same value at span creation as
at `Finish()` (unlikely, but possible, especially in virtualized
environments), the following scenario can happen:

- `Span::FinishWithOptions` sets `_duration` to zero
- and then calls `reportSpan`, which calls `RemoteReporter::report`
- which acquires the reporter mutex and
- adds the span to the reporter's `_queue`

Later:
- `RemoteReporter::sweepQueue` acquires the reporter mutex
- `RemoteReporter::sweepQueue` makes a copy of the span
(https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/reporters/RemoteReporter.cpp#L95)
- The span is serialized in `RemoteReporter::sendSpan`
- The span (the copy made above) needs to be destroyed because the block in `sweepQueue` ends.
- The span's destructor is called, which calls `FinishWithOptions` again
- ... which sees that `_duration` is zero, so `isFinished()` returns
false:
https://github.com/jaegertracing/jaeger-client-cpp/blob/7e9a135b362c47f7e4d42e0693b05b91a70e6888/src/jaegertracing/Span.cpp#L94
- so it tries to call `reportSpan` again
- which tries to acquire the reporter's mutex
- but we're holding the lock in `sweepQueue`, so
- deadlock.

Relevant stack trace:
```
(gdb) bt
(gdb) bt
 #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 #1  0x00007fb09c722023 in __GI___pthread_mutex_lock (mutex=0x7fb088b1c1c0) at ../nptl/pthread_mutex_lock.c:78
 #2  0x00007fb08860987b in jaegertracing::reporters::RemoteReporter::report(jaegertracing::Span const&) () from /usr/local/lib/libjaegertracing_plugin.so
 #3  0x00007fb0885b6506 in jaegertracing::Span::FinishWithOptions(opentracing::v2::FinishSpanOptions const&) [clone .localalias.361] () from /usr/local/lib/libjaegertracing_plugin.so
 #4  0x00007fb08860849a in jaegertracing::Span::~Span() [clone .constprop.253] () from /usr/local/lib/libjaegertracing_plugin.so
 #5  0x00007fb08860a8fa in jaegertracing::reporters::RemoteReporter::sweepQueue() () from /usr/local/lib/libjaegertracing_plugin.so
 #6  0x00007fb0886f2f5f in execute_native_thread_routine () from /usr/local/lib/libjaegertracing_plugin.so
 #7  0x00007fb09c71f6db in start_thread (arg=0x7fb087574700) at pthread_create.c:463
 #8  0x00007fb0989ae88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Fixed by enforcing a minimum span duration of 1 `SteadyClock` tick.

Signed-off-by: Tudor Bosman <tudor@rockset.com>

* fix typo

Signed-off-by: Tudor Bosman <tudor@rockset.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants