Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*Use Halley method iterations in cuberoot function #544

Merged
merged 2 commits into from
Jan 26, 2024

Conversation

Hallberg-NOAA
Copy link
Member

Modified the cuberoot function to do 3 iterations with Halley's method before switching to Newton's method as before, which saves about 2 iterations. Also added a check for a perfect solution as a stopping point during the Newton's method iterations, because this prevents about half the instances where we would compare successive solutions, avoiding some divisions and extra iterations. This changes answers at roundoff for code that uses the cuberoot function, so ideally this PR would be dealt with before the cuberoot becomes widely used.

@Hallberg-NOAA Hallberg-NOAA added the enhancement New feature or request label Jan 19, 2024
Copy link

codecov bot commented Jan 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d7d126a) 37.20% compared to head (d670263) 37.19%.

Additional details and impacted files
@@             Coverage Diff              @@
##           dev/gfdl     #544      +/-   ##
============================================
- Coverage     37.20%   37.19%   -0.01%     
============================================
  Files           271      271              
  Lines         80355    80346       -9     
  Branches      14985    14982       -3     
============================================
- Hits          29894    29887       -7     
+ Misses        44903    44902       -1     
+ Partials       5558     5557       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marshallward
Copy link
Member

This appears to be an improvement in computational performance, with a more questionable impact on accuracy. I compared various methods to the quadrature float (real128) values over a dense range of numbers with exact values in both double and quadrature. The absolute errors of five methods are shown below.

  • x**(1./3.)
  • Standard Newton iteration: x = x - f / f'
  • The "no division" Newton iteration, as in cuberoot()
  • Standard Halley iteration x = x - 2 f f' / (2 f' f' - f f'')
  • The proposed "no division" Halley iteration.

cube_error

The orange/blue difference in the "divisionless" plots is due to removal all of the if-block convergence tests, and simply letting the solvers run for a fixed number of iterations. This did not cause much change in accuracy. I did have to tweak the number of iterations for values closer to 1/8.

There's also lots of fine structure in those curves, lost in the low resolution image, but the general trends and bounds should be clear.

To me, these are the main observations:

  • All are "accurate", if we are not concerned about a trailing bit or two.

  • The order of accuracy is x**(1./3.) intrinsic > "normal" iterators > "divisionless" iterators. The intrinsic is perhaps even "exact". (Not surprising, since most of libm is devoted to handling of the ULP).


The next question to consider is performance. If I sweep over a large array (16M) in a vector-favorable way with -g -O2 flags, then I get the following numbers:

Solver Time (s)
x**(1./3.) 0.225
Newton 0.2212
Newton No_div (w/ converge tests) 0.8704
Newton No_div 0.46865
Halley 0.1352
Halley No_div (w/ converge tests) 0.8284
Halley No_div 0.10955

(Note: All solvers lead with a conditional "if" test for zero.)

If you break it down, the Halley test with the convergence tests and Newton correction blocks ends up being the fastest, although the straightforward Halley solver is close and has higher accuracy.

If I turn on the optimization (-g -O3 -mavx -mfma), then I get the following

Solver Time (s)
x**(1./3.) 0.1986
Newton 0.0697
Newton No_div 0.47125
Halley 0.12195
Halley No_div 0.10395

Most are not very optimization-friendly with the exception of the straightforward Newton iterator, which has a major speedup and is clearly the fastest.


As for any lessons here:

  • x**(1./3.) remains the "best", although not the fastest. But with its ambiguous solver (sometimes libm, but not necessarily), it will always be volatile.

  • All solutions remain (more or less) within 4e-16, so should be accurate within 1-2 bits.

  • Convergence tests appear to not provide much speedup here. The cost to query exceed the benefit of exiting early.

  • The Halley numerator/denominator split is the faster unoptimized solver. Standard Newton iteration is the faster optimized solver.

I think this answers many of the accuracy and performance questions from earlier today. But it doesn't yet address the issue of platform-independence.

For my recommendation: I think it's always worth betting on optimization and would go with standard Newton iteration. It's easy to read and easy to maintain. But if the decision is to go with Halley, then let's at least drop all of the convergence tests and the additional Newton iteration step.

Sample code can be check here: https://github.com/marshallward/cuberoot

@Hallberg-NOAA
Copy link
Member Author

Thank you for the detailed performance analysis. This PR has now been modified to avoid any of the convergence testing and doing a fixed number of iterations (5 total - 3 of Halley and 2 of Newton) regardless of the input value. The final Newton's method iteration uses the correction form that polishes the root and should give accuracy comparable to the best methods described in the performance analysis.

@marshallward
Copy link
Member

marshallward commented Jan 23, 2024

I tested out your unsimplified x = x - f/f' trick in the other solvers and it had a similar effect across the board. Impressive!

err

It had the added effect of reducing the number of Newton iterations from 7 to 6. I was not able to reduce the number of Halley iterations, due to poor convergence near 0.125.

Here are the updated timings. (Sorry, different computer; number are slower.)

Solver Time (s) Optimized Time
x**(1./3.) 0.684 0.232
Newton 0.316 0.068
Halley 0.216 0.157
Halley No_div (v1, minus convergence tests) 0.129 0.125
Halley No_div (v2; current PR) 0.543 0.480

Based on these numbers, I think we should forget the no-division solvers and just use either the standard Newton or Halley solvers, with the finalizing Newton step.

@marshallward
Copy link
Member

marshallward commented Jan 25, 2024

I bite my tongue - and I now support the "no-division" Halley solver! After some modifications to the various methods, they are all now producing errors on the order of the ULP (1e-16).

The candidates are:

  • x**(1./3.): Math library intrinsic (presumably)
  • Newton: Six Newton iterations, with divisions
  • Halley: Three Halley + 1 Newton iterations, with division
  • Newton No-div: Five no-div Newtons + 1 "div" Newton
  • Halley No-div: Three "No-div" Halleys + 1 "div" Newton
  • Halley Final: This PR (close to Halley No-div with some extra work).

Versions of each code are still here: https://github.com/marshallward/cuberoot/blob/main/cubes.f90

I made various tweaks to these: reducing iterations as much as possible, sometimes removing steps which are no longer needed such as the numerator-denominator renormalization.

Here are the errors, all are nice and tiny (though unfortunately above ULP):

err

And here are the (new) speeds:

Solver GNU GNU -O3 Intel Intel -O3
x**(1./3.) 0.233 0.201 0.068 0.067
Newton 0.213 0.205 0.246 0.221
Newton No-div 0.158 0.140 0.162 0.152
Halley 0.124 0.123 0.137 0.130
Halley No-div 0.105 0.089 0.114 0.097
Halley No-div (v2) 0.145 0.128 0.158 0.144

(Note that the earlier numbers had an unrealistic vectorization which does not appear in these numbers. Also, different machine etc...)

The No-division Halley iteration (3 Halley + 1 Newton) is the clear winner here, optimization or otherwise. It even beats the GNU math library, although Intel still wallops us in performance.


Why is the new version slower?

Solely due to this term:

root_asx = (2.0 * num**3 + asx * den**3) / ( 3.0 * (den * num**2) )

Replacing it with the old version:

root_asx = num / den

will restore the performance.

Also, these rescaled initial values:

    num = 0.5 + asx
    den = 1.0 + 0.5*asx

have identical performance and accuracy to the original unscaled values

    num = 1. + 2.*asx
    den = 2 + asx

which are easier to identify as the first iteration. So perhaps switch them back.

VERDICT: 3x Halley (no division) + 1x Newton (with division) is the best, lets do that!

@marshallward
Copy link
Member

marshallward commented Jan 25, 2024

Oh no! I take it back! I just inspected the errors from Intel and they are uncomfortably higher than on GNU:

If the no-divs cannot be sorted out, then regular Halley would seem to be the best option.

(All a mistake, see below)

@marshallward
Copy link
Member

marshallward commented Jan 25, 2024

Nevermind, false alarm, somehow an old executable got dumped on Gaea. Here is the corrected error on Gaea using Intel, which looks fine:

err

No-division Halley once again seems the best choice.

@marshallward
Copy link
Member

Following up after talking with @Hallberg-NOAA

  • There is a computational difference between the cube root a^(1/3) and the root of x^3 - a = 0 where a small ULP-like error in one may be larger in the other and vice-versa. I have been plotting errors for the root, but there are times when we may be more interested in the solution of x**3 - a = 0.

    The relative errors (cuberoot(x)**3 - x) / x are plotted below.

    err_v1

    Blue is -O2 optimization, orange is with FMAs. FMA smooths things out and generally (but not always) reduces the error.

    The main observation is that all but the latest no-div Halley show problems around 0.125.

  • To this point, most schemes add a ULP error to cuberoot(0.125). The solver should aspire for accuracy of 0.125 and/or 1.0. The current rescalings convert 1.0 to 0.125, so this is a problem. This is the current state

    Solver 0.125**1/3 - 0.5 1.**1/3 - 1.
    x**(1./3.) 0. 0.
    Newton 1e-16 0.
    Newton No-div 1e-16 0.
    Halley 1e-16 0.
    Halley No-div 1e-16 0.
    Halley No-div (v2) 0. 0.

    Only the current no-div Halley solver satisfies this property. But the next point addresses this.

  • The initial value of (3./8.)**(1./3.) = 0.7211.. does not help speed up convergence, but it does appear to improve errors around 0.125.

    err_v2

    It also ensures that cuberoot(0.125) = 0.5 without error in all methods, and does not affect performance.

One other unrelated point

  • In the new Halley No-div, applying 0.5 to the initial numerator and denominator values helps dampen the values, which are can grow without constraint, even if the ratio converges to the solution. So the current form is a better choice.

Given this new information:

  • All of the schemes satisfy the requirements as long as we start with (3/8)^(1/3) = 0.7211..

  • Errors look about the same to me, though Newton and the new Halley no-div look a bit better around 0.125.

  • No-div Halley using root = num/den before the Newton finalizer remains the fastest. But the No-div Halley using root = (2 * num**3 + x * den**3) / (3 * den * num**2) is still quite fast and shows some potentially reduced error.

  Modified the cuberoot function to do 3 iterations with Halley's method
starting with a first guess that balances the errors at the two ends of the
range of the iterations, before a final iteration with Newton's method that
polishes the root and gives a solution that is accurate to machine precision.
Following on performance testing of the previous version, all convergence
testing has been removed and the same number of iterations are applied
regardless of the input value.  This changes answers at roundoff for code that
uses the cuberoot function, so ideally this PR would be dealt with before the
cuberoot becomes widely used.
@Hallberg-NOAA
Copy link
Member Author

This PR has been updated to simply do 3 iterations with Halley's method, starting from an optimal first guess that balances the errors at the ends of the range after 2 iterations, followed by a single Newton's method step to polish the root. It also adds the parentheses that were identified as necessary to give identical solutions for the intel and gnu compiler.

@marshallward
Copy link
Member

This looks like the best option to me as well, thank you @Hallberg-NOAA !

@marshallward
Copy link
Member

I sent you one last tweak which should speed up by 20% without breaking anything.

Hallberg-NOAA#11

Applying the first iteration explicitly appears to speed up the cuberoot
function by a bit over 20%:

 Before:
 Halley Final:  0.14174999999999999

 After:
 Halley Final:  0.11080000000000001

There is an assumption that compilers will precompute the constants like
`0.7 * (0.7)**3`, and that all will do so in the same manner.
@marshallward
Copy link
Member

Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/22078 ✔️

@marshallward marshallward merged commit 17f1c40 into NOAA-GFDL:dev/gfdl Jan 26, 2024
12 checks passed
@Hallberg-NOAA Hallberg-NOAA mentioned this pull request Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants