*Use Halley method iterations in cuberoot function #544

Hallberg-NOAA · 2024-01-19T21:39:40Z

Modified the cuberoot function to do 3 iterations with Halley's method before switching to Newton's method as before, which saves about 2 iterations. Also added a check for a perfect solution as a stopping point during the Newton's method iterations, because this prevents about half the instances where we would compare successive solutions, avoiding some divisions and extra iterations. This changes answers at roundoff for code that uses the cuberoot function, so ideally this PR would be dealt with before the cuberoot becomes widely used.

codecov · 2024-01-19T21:42:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d7d126a) 37.20% compared to head (d670263) 37.19%.

Additional details and impacted files

@@             Coverage Diff              @@
##           dev/gfdl     #544      +/-   ##
============================================
- Coverage     37.20%   37.19%   -0.01%     
============================================
  Files           271      271              
  Lines         80355    80346       -9     
  Branches      14985    14982       -3     
============================================
- Hits          29894    29887       -7     
+ Misses        44903    44902       -1     
+ Partials       5558     5557       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

marshallward · 2024-01-23T05:05:14Z

This appears to be an improvement in computational performance, with a more questionable impact on accuracy. I compared various methods to the quadrature float (real128) values over a dense range of numbers with exact values in both double and quadrature. The absolute errors of five methods are shown below.

x**(1./3.)
Standard Newton iteration: x = x - f / f'
The "no division" Newton iteration, as in cuberoot()
Standard Halley iteration x = x - 2 f f' / (2 f' f' - f f'')
The proposed "no division" Halley iteration.

The orange/blue difference in the "divisionless" plots is due to removal all of the if-block convergence tests, and simply letting the solvers run for a fixed number of iterations. This did not cause much change in accuracy. I did have to tweak the number of iterations for values closer to 1/8.

There's also lots of fine structure in those curves, lost in the low resolution image, but the general trends and bounds should be clear.

To me, these are the main observations:

All are "accurate", if we are not concerned about a trailing bit or two.
The order of accuracy is x**(1./3.) intrinsic > "normal" iterators > "divisionless" iterators. The intrinsic is perhaps even "exact". (Not surprising, since most of libm is devoted to handling of the ULP).

The next question to consider is performance. If I sweep over a large array (16M) in a vector-favorable way with -g -O2 flags, then I get the following numbers:

Solver	Time (s)
x**(1./3.)	0.225
Newton	0.2212
Newton No_div (w/ converge tests)	0.8704
Newton No_div	0.46865
Halley	0.1352
Halley No_div (w/ converge tests)	0.8284
Halley No_div	0.10955

(Note: All solvers lead with a conditional "if" test for zero.)

If you break it down, the Halley test with the convergence tests and Newton correction blocks ends up being the fastest, although the straightforward Halley solver is close and has higher accuracy.

If I turn on the optimization (-g -O3 -mavx -mfma), then I get the following

Solver	Time (s)
x**(1./3.)	0.1986
Newton	0.0697
Newton No_div	0.47125
Halley	0.12195
Halley No_div	0.10395

Most are not very optimization-friendly with the exception of the straightforward Newton iterator, which has a major speedup and is clearly the fastest.

As for any lessons here:

x**(1./3.) remains the "best", although not the fastest. But with its ambiguous solver (sometimes libm, but not necessarily), it will always be volatile.
All solutions remain (more or less) within 4e-16, so should be accurate within 1-2 bits.
Convergence tests appear to not provide much speedup here. The cost to query exceed the benefit of exiting early.
The Halley numerator/denominator split is the faster unoptimized solver. Standard Newton iteration is the faster optimized solver.

I think this answers many of the accuracy and performance questions from earlier today. But it doesn't yet address the issue of platform-independence.

For my recommendation: I think it's always worth betting on optimization and would go with standard Newton iteration. It's easy to read and easy to maintain. But if the decision is to go with Halley, then let's at least drop all of the convergence tests and the additional Newton iteration step.

Sample code can be check here: https://github.com/marshallward/cuberoot

Hallberg-NOAA · 2024-01-23T14:39:31Z

Thank you for the detailed performance analysis. This PR has now been modified to avoid any of the convergence testing and doing a fixed number of iterations (5 total - 3 of Halley and 2 of Newton) regardless of the input value. The final Newton's method iteration uses the correction form that polishes the root and should give accuracy comparable to the best methods described in the performance analysis.

marshallward · 2024-01-23T16:32:49Z

I tested out your unsimplified x = x - f/f' trick in the other solvers and it had a similar effect across the board. Impressive!

It had the added effect of reducing the number of Newton iterations from 7 to 6. I was not able to reduce the number of Halley iterations, due to poor convergence near 0.125.

Here are the updated timings. (Sorry, different computer; number are slower.)

Solver	Time (s)	Optimized Time
x**(1./3.)	0.684	0.232
Newton	0.316	0.068
Halley	0.216	0.157
Halley No_div (v1, minus convergence tests)	0.129	0.125
Halley No_div (v2; current PR)	0.543	0.480

Based on these numbers, I think we should forget the no-division solvers and just use either the standard Newton or Halley solvers, with the finalizing Newton step.

marshallward · 2024-01-25T04:35:12Z

I bite my tongue - and I now support the "no-division" Halley solver! After some modifications to the various methods, they are all now producing errors on the order of the ULP (1e-16).

The candidates are:

x**(1./3.): Math library intrinsic (presumably)
Newton: Six Newton iterations, with divisions
Halley: Three Halley + 1 Newton iterations, with division
Newton No-div: Five no-div Newtons + 1 "div" Newton
Halley No-div: Three "No-div" Halleys + 1 "div" Newton
Halley Final: This PR (close to Halley No-div with some extra work).

Versions of each code are still here: https://github.com/marshallward/cuberoot/blob/main/cubes.f90

I made various tweaks to these: reducing iterations as much as possible, sometimes removing steps which are no longer needed such as the numerator-denominator renormalization.

Here are the errors, all are nice and tiny (though unfortunately above ULP):

And here are the (new) speeds:

Solver	GNU	GNU -O3	Intel	Intel -O3
x**(1./3.)	0.233	0.201	0.068	0.067
Newton	0.213	0.205	0.246	0.221
Newton No-div	0.158	0.140	0.162	0.152
Halley	0.124	0.123	0.137	0.130
Halley No-div	0.105	0.089	0.114	0.097
Halley No-div (v2)	0.145	0.128	0.158	0.144

(Note that the earlier numbers had an unrealistic vectorization which does not appear in these numbers. Also, different machine etc...)

The No-division Halley iteration (3 Halley + 1 Newton) is the clear winner here, optimization or otherwise. It even beats the GNU math library, although Intel still wallops us in performance.

Why is the new version slower?

Solely due to this term:

root_asx = (2.0 * num**3 + asx * den**3) / ( 3.0 * (den * num**2) )

Replacing it with the old version:

root_asx = num / den

will restore the performance.

Also, these rescaled initial values:

    num = 0.5 + asx
    den = 1.0 + 0.5*asx

have identical performance and accuracy to the original unscaled values

    num = 1. + 2.*asx
    den = 2 + asx

which are easier to identify as the first iteration. So perhaps switch them back.

VERDICT: 3x Halley (no division) + 1x Newton (with division) is the best, lets do that!

marshallward · 2024-01-25T04:50:15Z

~~Oh no! I take it back! I just inspected the errors from Intel and they are uncomfortably higher than on GNU:~~

~~If the no-divs cannot be sorted out, then regular Halley would seem to be the best option.~~

(All a mistake, see below)

marshallward · 2024-01-25T05:19:57Z

Nevermind, false alarm, somehow an old executable got dumped on Gaea. Here is the corrected error on Gaea using Intel, which looks fine:

No-division Halley once again seems the best choice.

marshallward · 2024-01-25T19:29:41Z

Following up after talking with @Hallberg-NOAA

There is a computational difference between the cube root a^(1/3) and the root of x^3 - a = 0 where a small ULP-like error in one may be larger in the other and vice-versa. I have been plotting errors for the root, but there are times when we may be more interested in the solution of x**3 - a = 0.

The relative errors (cuberoot(x)**3 - x) / x are plotted below.

Blue is -O2 optimization, orange is with FMAs. FMA smooths things out and generally (but not always) reduces the error.

The main observation is that all but the latest no-div Halley show problems around 0.125.
To this point, most schemes add a ULP error to cuberoot(0.125). The solver should aspire for accuracy of 0.125 and/or 1.0. The current rescalings convert 1.0 to 0.125, so this is a problem. This is the current state

Solver 0.125**1/3 - 0.5 1.**1/3 - 1.

x**(1./3.) 0. 0.

Newton 1e-16 0.

Newton No-div 1e-16 0.

Halley 1e-16 0.

Halley No-div 1e-16 0.

Halley No-div (v2) 0. 0.

Only the current no-div Halley solver satisfies this property. But the next point addresses this.
The initial value of (3./8.)**(1./3.) = 0.7211.. does not help speed up convergence, but it does appear to improve errors around 0.125.

It also ensures that cuberoot(0.125) = 0.5 without error in all methods, and does not affect performance.

One other unrelated point

In the new Halley No-div, applying 0.5 to the initial numerator and denominator values helps dampen the values, which are can grow without constraint, even if the ratio converges to the solution. So the current form is a better choice.

Given this new information:

All of the schemes satisfy the requirements as long as we start with (3/8)^(1/3) = 0.7211..
Errors look about the same to me, though Newton and the new Halley no-div look a bit better around 0.125.
No-div Halley using root = num/den before the Newton finalizer remains the fastest. But the No-div Halley using root = (2 * num**3 + x * den**3) / (3 * den * num**2) is still quite fast and shows some potentially reduced error.

Modified the cuberoot function to do 3 iterations with Halley's method starting with a first guess that balances the errors at the two ends of the range of the iterations, before a final iteration with Newton's method that polishes the root and gives a solution that is accurate to machine precision. Following on performance testing of the previous version, all convergence testing has been removed and the same number of iterations are applied regardless of the input value. This changes answers at roundoff for code that uses the cuberoot function, so ideally this PR would be dealt with before the cuberoot becomes widely used.

Hallberg-NOAA · 2024-01-26T09:36:04Z

This PR has been updated to simply do 3 iterations with Halley's method, starting from an optimal first guess that balances the errors at the ends of the range after 2 iterations, followed by a single Newton's method step to polish the root. It also adds the parentheses that were identified as necessary to give identical solutions for the intel and gnu compiler.

marshallward · 2024-01-26T13:47:06Z

This looks like the best option to me as well, thank you @Hallberg-NOAA !

marshallward · 2024-01-26T15:24:29Z

I sent you one last tweak which should speed up by 20% without breaking anything.

Hallberg-NOAA#11

Applying the first iteration explicitly appears to speed up the cuberoot function by a bit over 20%: Before: Halley Final: 0.14174999999999999 After: Halley Final: 0.11080000000000001 There is an assumption that compilers will precompute the constants like `0.7 * (0.7)**3`, and that all will do so in the same manner.

marshallward · 2024-01-26T16:20:30Z

Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/22078 ✔️

Hallberg-NOAA added the enhancement New feature or request label Jan 19, 2024

Hallberg-NOAA force-pushed the Halley_cuberoot branch from 2c227c9 to d125dcf Compare January 23, 2024 14:32

Hallberg-NOAA mentioned this pull request Jan 23, 2024

*Do 6 Newton method iterations in cuberoot #550

Closed

Hallberg-NOAA force-pushed the Halley_cuberoot branch from d125dcf to 4afaf21 Compare January 26, 2024 09:32

marshallward approved these changes Jan 26, 2024

View reviewed changes

marshallward merged commit 17f1c40 into NOAA-GFDL:dev/gfdl Jan 26, 2024
12 checks passed

Hallberg-NOAA mentioned this pull request Jan 28, 2024

Test mle diag #542

Closed

This was referenced May 16, 2024

GFDL to main (2024-05-16) mom-ocean/MOM6#1626

Closed

GFDL to main 2024-05-16 mom-ocean/MOM6#1627

Closed

marshallward mentioned this pull request May 31, 2024

GFDL to main (2024-05-31) mom-ocean/MOM6#1631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*Use Halley method iterations in cuberoot function #544

*Use Halley method iterations in cuberoot function #544

Hallberg-NOAA commented Jan 19, 2024

codecov bot commented Jan 19, 2024 •

edited

Loading

marshallward commented Jan 23, 2024

Hallberg-NOAA commented Jan 23, 2024

marshallward commented Jan 23, 2024 •

edited

Loading

marshallward commented Jan 25, 2024 •

edited

Loading

marshallward commented Jan 25, 2024 •

edited

Loading

marshallward commented Jan 25, 2024 •

edited

Loading

marshallward commented Jan 25, 2024

Hallberg-NOAA commented Jan 26, 2024

marshallward commented Jan 26, 2024

marshallward commented Jan 26, 2024

marshallward commented Jan 26, 2024

*Use Halley method iterations in cuberoot function #544

*Use Halley method iterations in cuberoot function #544

Conversation

Hallberg-NOAA commented Jan 19, 2024

codecov bot commented Jan 19, 2024 • edited Loading

Codecov Report

marshallward commented Jan 23, 2024

Hallberg-NOAA commented Jan 23, 2024

marshallward commented Jan 23, 2024 • edited Loading

marshallward commented Jan 25, 2024 • edited Loading

marshallward commented Jan 25, 2024 • edited Loading

marshallward commented Jan 25, 2024 • edited Loading

marshallward commented Jan 25, 2024

Hallberg-NOAA commented Jan 26, 2024

marshallward commented Jan 26, 2024

marshallward commented Jan 26, 2024

marshallward commented Jan 26, 2024

codecov bot commented Jan 19, 2024 •

edited

Loading

marshallward commented Jan 23, 2024 •

edited

Loading

marshallward commented Jan 25, 2024 •

edited

Loading

marshallward commented Jan 25, 2024 •

edited

Loading

marshallward commented Jan 25, 2024 •

edited

Loading