Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: numerical errors in GPU recombination #874

Merged
merged 4 commits into from
Nov 29, 2024

Conversation

rg936672
Copy link
Contributor

@rg936672 rg936672 commented Nov 20, 2024

PR Type

  • Bugfix

Description

Closes #852. Closes #853. Fixes an incorrect update applied in the elimination step of the recombination solvers.

How Has This Been Tested?

Existing tests pass as expected, even on a GPU machine.

Does this PR introduce a breaking change?

No.

Checklist before requesting a review

  • I have made sure that my PR is not a duplicate.
  • My code follows the style guidelines of this project.
  • I have ensured my code is easy to understand, including docstrings and comments where necessary.
  • I have performed a self-review of my code.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.
  • I have updated CHANGELOG.md, if appropriate.

@rg936672 rg936672 force-pushed the fix/caratheodory-recombination-gpu branch from 47459ff to 690cea6 Compare November 20, 2024 11:59
Copy link
Contributor

Performance review

Commit e5e66c5 - Merge 690cea6 into 3556e02

No significant changes to performance.

Copy link
Contributor

Performance review

Commit 479d244 - Merge 0cf5351 into 3556e02

No significant changes to performance.

@tc85324
Copy link
Contributor

tc85324 commented Nov 25, 2024

Should this PR also resolve #852? Additionally, would it be possible to add a GPU runner to the CI tests so we can avoid these issues going silently unnoticed in the future (will need to open a new issue for this)?

@tp832944
Copy link
Contributor

Should this PR also resolve #852? Additionally, would it be possible to add a GPU runner to the CI tests so we can avoid these issues going silently unnoticed in the future (will need to open a new issue for this)?

We had thought about this previously, but decided against it primarily on cost grounds. We do run GPU tests manually just before each release. A few older GPU bugs may have gone unnoticed until now because of a configuration issue with the GPU.

Copy link
Contributor

@tc85324 tc85324 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your idea is good, and a valid approach! However, I believe we can achieve the same thing (fixing both #853 and #852) by simply correcting the following line, which is currently incorrect:

# Old (incorrect)
updated_null_space_basis.at  = updated_null_space_basis.at[basis_index].set(0)
# New (correct)
updated_null_space_basis.at  = updated_null_space_basis.at[:, elimination_index].set(0)

In my own local tests, this appears to work on CPU and GPU devices; all tests pass on CPU and GPU devices.

The only other thing that might be worthwhile is to indicate to users (via the docstring or a warning) that enabling double precision ("jax_enable_x64") is sometimes required to obtain correct results.

This uses a much simpler fix proposed by @tc85324 - the null space basis
update was incorrect.

This commit effectively reverts the last two commits.

Refs: #853
@rg936672
Copy link
Contributor Author

Ah, that does make sense - I'd just assumed the code was correct as I wasn't super familiar with the algorithm!

Copy link
Contributor

Performance review

Commit 959820f - Merge 52e5330 into b6f7b83

No significant changes to performance.

@rg936672 rg936672 requested a review from tc85324 November 27, 2024 16:44
Copy link
Contributor

@tc85324 tc85324 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small comment on updating the CHANGELOG, but otherwise looks good to me!

@rg936672 rg936672 requested a review from tc85324 November 28, 2024 14:27
Copy link
Contributor

Performance review

Commit 2530652 - Merge 7d3131d into 5039cff

No significant changes to performance.

@tc85324 tc85324 merged commit 87dd4a7 into main Nov 29, 2024
19 checks passed
@tc85324 tc85324 deleted the fix/caratheodory-recombination-gpu branch November 29, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On GPU, TestCaratheodoryRecombination is failing On GPU, TestTreeRecombination is failing
3 participants