-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*Use Halley method iterations in cuberoot function #544
*Use Halley method iterations in cuberoot function #544
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev/gfdl #544 +/- ##
============================================
- Coverage 37.20% 37.19% -0.01%
============================================
Files 271 271
Lines 80355 80346 -9
Branches 14985 14982 -3
============================================
- Hits 29894 29887 -7
+ Misses 44903 44902 -1
+ Partials 5558 5557 -1 ☔ View full report in Codecov by Sentry. |
This appears to be an improvement in computational performance, with a more questionable impact on accuracy. I compared various methods to the quadrature float (
The orange/blue difference in the "divisionless" plots is due to removal all of the if-block convergence tests, and simply letting the solvers run for a fixed number of iterations. This did not cause much change in accuracy. I did have to tweak the number of iterations for values closer to 1/8. There's also lots of fine structure in those curves, lost in the low resolution image, but the general trends and bounds should be clear. To me, these are the main observations:
The next question to consider is performance. If I sweep over a large array (16M) in a vector-favorable way with
(Note: All solvers lead with a conditional "if" test for zero.) If you break it down, the Halley test with the convergence tests and Newton correction blocks ends up being the fastest, although the straightforward Halley solver is close and has higher accuracy. If I turn on the optimization (
Most are not very optimization-friendly with the exception of the straightforward Newton iterator, which has a major speedup and is clearly the fastest. As for any lessons here:
I think this answers many of the accuracy and performance questions from earlier today. But it doesn't yet address the issue of platform-independence. For my recommendation: I think it's always worth betting on optimization and would go with standard Newton iteration. It's easy to read and easy to maintain. But if the decision is to go with Halley, then let's at least drop all of the convergence tests and the additional Newton iteration step. Sample code can be check here: https://github.com/marshallward/cuberoot |
2c227c9
to
d125dcf
Compare
Thank you for the detailed performance analysis. This PR has now been modified to avoid any of the convergence testing and doing a fixed number of iterations (5 total - 3 of Halley and 2 of Newton) regardless of the input value. The final Newton's method iteration uses the correction form that polishes the root and should give accuracy comparable to the best methods described in the performance analysis. |
I tested out your unsimplified It had the added effect of reducing the number of Newton iterations from 7 to 6. I was not able to reduce the number of Halley iterations, due to poor convergence near 0.125. Here are the updated timings. (Sorry, different computer; number are slower.)
Based on these numbers, I think we should forget the no-division solvers and just use either the standard Newton or Halley solvers, with the finalizing Newton step. |
I bite my tongue - and I now support the "no-division" Halley solver! After some modifications to the various methods, they are all now producing errors on the order of the ULP (1e-16). The candidates are:
Versions of each code are still here: https://github.com/marshallward/cuberoot/blob/main/cubes.f90 I made various tweaks to these: reducing iterations as much as possible, sometimes removing steps which are no longer needed such as the numerator-denominator renormalization. Here are the errors, all are nice and tiny (though unfortunately above ULP): And here are the (new) speeds:
(Note that the earlier numbers had an unrealistic vectorization which does not appear in these numbers. Also, different machine etc...) The No-division Halley iteration (3 Halley + 1 Newton) is the clear winner here, optimization or otherwise. It even beats the GNU math library, although Intel still wallops us in performance. Why is the new version slower? Solely due to this term:
Replacing it with the old version:
will restore the performance. Also, these rescaled initial values:
have identical performance and accuracy to the original unscaled values
which are easier to identify as the first iteration. So perhaps switch them back. VERDICT: 3x Halley (no division) + 1x Newton (with division) is the best, lets do that! |
(All a mistake, see below) |
Following up after talking with @Hallberg-NOAA
One other unrelated point
Given this new information:
|
Modified the cuberoot function to do 3 iterations with Halley's method starting with a first guess that balances the errors at the two ends of the range of the iterations, before a final iteration with Newton's method that polishes the root and gives a solution that is accurate to machine precision. Following on performance testing of the previous version, all convergence testing has been removed and the same number of iterations are applied regardless of the input value. This changes answers at roundoff for code that uses the cuberoot function, so ideally this PR would be dealt with before the cuberoot becomes widely used.
d125dcf
to
4afaf21
Compare
This PR has been updated to simply do 3 iterations with Halley's method, starting from an optimal first guess that balances the errors at the ends of the range after 2 iterations, followed by a single Newton's method step to polish the root. It also adds the parentheses that were identified as necessary to give identical solutions for the intel and gnu compiler. |
This looks like the best option to me as well, thank you @Hallberg-NOAA ! |
I sent you one last tweak which should speed up by 20% without breaking anything. |
Applying the first iteration explicitly appears to speed up the cuberoot function by a bit over 20%: Before: Halley Final: 0.14174999999999999 After: Halley Final: 0.11080000000000001 There is an assumption that compilers will precompute the constants like `0.7 * (0.7)**3`, and that all will do so in the same manner.
Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/22078 ✔️ |
Modified the cuberoot function to do 3 iterations with Halley's method before switching to Newton's method as before, which saves about 2 iterations. Also added a check for a perfect solution as a stopping point during the Newton's method iterations, because this prevents about half the instances where we would compare successive solutions, avoiding some divisions and extra iterations. This changes answers at roundoff for code that uses the cuberoot function, so ideally this PR would be dealt with before the cuberoot becomes widely used.