-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the accuracy problem of allclose op when using float64 data type #27891
Conversation
Thanks for your contribution! |
T operator()(const framework::Tensor& tensor) const { | ||
const T* data = tensor.data<T>(); | ||
T value; | ||
cudaMemcpy(&value, data, sizeof(T), cudaMemcpyDeviceToHost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to use memory::Copy(platform::CPUPlace(), &value, gpu_place, data, sizeof(T), dev_ctx.stream());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
auto* in_a = in.data<T>(); | ||
auto* in_b = other.data<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use better var name instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} else { | ||
T left = (a > b ? a - b : b - a); | ||
T right = atol + (b > 0 ? rtol * b : (-rtol) * b); | ||
T dif = (left > right ? left - right : right - left); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use diff
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
}; | ||
|
||
template struct AllcloseFunctor<platform::CPUDeviceContext, double>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} else { | ||
T left = (a > b ? a - b : b - a); | ||
T right = atol + (b > 0 ? rtol * b : (-rtol) * b); | ||
T dif = (left > right ? left - right : right - left); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use diff
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
atomicAnd(reinterpret_cast<int*>(&val_), static_cast<int>(val)); | ||
__syncthreads(); | ||
if (tid == 0) { | ||
*out_data = static_cast<bool>(val_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here static_cast
is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already used parallel reduction here.
}; | ||
|
||
template <typename T> | ||
__global__ void AllcloseCUDAKernel(const T* in_a, const T* in_b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance here is too poor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already used parallel reduction here.
int grid = 1; | ||
int block = in_dims; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think over and try to instead these ridiculous codes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -22,19 +22,20 @@ class TestAllcloseOp(OpTest): | |||
def set_args(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
3570913
to
53ee2cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑到 allclose 在推理里几乎用不到
这次升级这里暂不考虑兼容
…ut, and added an unittest for it.
afdd373
to
fbc7d20
Compare
class TestAllcloseOpFloat64(TestAllcloseOp): | ||
def set_args(self): | ||
self.input = np.array([10.1]).astype("float64") | ||
self.other = np.array([10]).astype("float64") | ||
self.rtol = np.array([0.01]).astype("float64") | ||
self.atol = np.array([0]).astype("float64") | ||
self.equal_nan = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the same unit test for float32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
}; | ||
|
||
template struct AllcloseFunctor<platform::CUDADeviceContext, double>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Still has bugs. * Fixed allclose_op bug, which cannot deal with some cases of fp64 inputs. * improved CUDA kernel performance. * Changed CUDA code. * Fixed a bug in cuda kernel which cannot deal with large dimension input, and added an unittest for it. * Add a test case for float32 input.
PR types
Bug fixes
PR changes
OPs
Describe
This PR fixed a bug in allclose_op, which cannot get the expected output when fp64 as input in some cases.
BUG reproduce process:
This result is expected to be True but it returns false.
Problem reason
Floating point number cannot be determined equal directly, since floating point number cannot have a precise experssion in
Computer. For example, 0.1 might be 0.09999 or 0.100001 in computer. So when we want to determine two "0.1" if they are equaled, the compute might executes if 0.09999 equal to 0.100001. This is how the false result comes.
Solving approach:
Add an extremely small value (1e-15) when determine if two double varibles are equaled.
Change the date type of rtol and atol from float32 to float64, since the accuracy of rtol and atol will also impact the final result.