-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClientKeepAlive update action ClientKeepAlive #1580
ClientKeepAlive update action ClientKeepAlive #1580
Conversation
2102842
to
af6c2a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@allada Could this affect the redis scheduler as well?
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Coverage, Installation / macos-13, Installation / macos-14, Local / lre-rs / macos-14, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, docker-compose-compiles-nativelink (22.04), windows-2022 / stable, and 1 discussions need to be resolved
nativelink-scheduler/src/simple_scheduler.rs
line 374 at r1 (raw file):
// tasks are going to be dropped all over the place, this isn't a good // setting. if client_action_timeout_s <= 10 {
nit: Seems like we could consolidate the CLIENT_KEEPALIVE_DURATION
s in 'memory_awaited_action_dband
store_awaited_action_db` and reuse that here as well.
The Redis scheduler already handles this case, it was just a missing edge of the in-memory DB. |
When the scheduler was updated to add the keep alive to the AwaitedAction the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive was received. Fix the test client_reconnect_keeps_action_alive which was not performing the eviction due to optimisations in the filter_operations function which then detected the issue. Then update the ActionEvent::ClientKeepAlive event handler to update the client keep alive timestamp in the AwaitedAction. Fixes TraceMachina#1579.
af6c2a2
to
83179f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is why we didn't catch this. I really need to get around to removing memory scheduler and just implement SchedulerStore
to MemoryStore
, I hate lugging around two implementations :-(
Thanks a lot Chris!
Reviewed 1 of 4 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Coverage, Installation / macos-13, Installation / macos-14, Local / lre-rs / macos-14, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, docker-compose-compiles-nativelink (22.04), windows-2022 / stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r1.
Reviewable status: 2 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Coverage, Installation / macos-13, Installation / macos-14, Local / lre-rs / macos-14, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, docker-compose-compiles-nativelink (22.04), windows-2022 / stable
Description
When the scheduler was updated to add the keep alive to the AwaitedAction the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive was received.
Fix the test client_reconnect_keeps_action_alive which was not performing the eviction due to optimisations in the filter_operations function which then detected the issue.
Then update the ActionEvent::ClientKeepAlive event handler to update the client keep alive timestamp in the AwaitedAction.
Fixes #1579.
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Fixed the existing test.
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is