Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientKeepAlive update action ClientKeepAlive #1580

Conversation

chrisstaite-menlo
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo commented Feb 7, 2025

Description

When the scheduler was updated to add the keep alive to the AwaitedAction the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive was received.

Fix the test client_reconnect_keeps_action_alive which was not performing the eviction due to optimisations in the filter_operations function which then detected the issue.

Then update the ActionEvent::ClientKeepAlive event handler to update the client keep alive timestamp in the AwaitedAction.

Fixes #1579.

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Fixed the existing test.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented Feb 7, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@allada Could this affect the redis scheduler as well?

:lgtm:

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Coverage, Installation / macos-13, Installation / macos-14, Local / lre-rs / macos-14, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, docker-compose-compiles-nativelink (22.04), windows-2022 / stable, and 1 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler.rs line 374 at r1 (raw file):

        // tasks are going to be dropped all over the place, this isn't a good
        // setting.
        if client_action_timeout_s <= 10 {

nit: Seems like we could consolidate the CLIENT_KEEPALIVE_DURATIONs in 'memory_awaited_action_dbandstore_awaited_action_db` and reuse that here as well.

@chrisstaite-menlo
Copy link
Collaborator Author

The Redis scheduler already handles this case, it was just a missing edge of the in-memory DB.

When the scheduler was updated to add the keep alive to the AwaitedAction
the MemoryAwaitedActionDb was not updated to set this when a ClientKeepAlive
was received.

Fix the test client_reconnect_keeps_action_alive which was not performing
the eviction due to optimisations in the filter_operations function which
then detected the issue.

Then update the ActionEvent::ClientKeepAlive event handler to update the
client keep alive timestamp in the AwaitedAction.

Fixes TraceMachina#1579.
Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is why we didn't catch this. I really need to get around to removing memory scheduler and just implement SchedulerStore to MemoryStore, I hate lugging around two implementations :-(

Thanks a lot Chris!

:lgtm:

Reviewed 1 of 4 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Coverage, Installation / macos-13, Installation / macos-14, Local / lre-rs / macos-14, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, docker-compose-compiles-nativelink (22.04), windows-2022 / stable

Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1.
Reviewable status: 2 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-13, Bazel Dev / macos-14, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-13, Coverage, Installation / macos-13, Installation / macos-14, Local / lre-rs / macos-14, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Remote / lre-cc / large-ubuntu-22.04, Remote / lre-rs / large-ubuntu-22.04, docker-compose-compiles-nativelink (22.04), windows-2022 / stable

@chrisstaite-menlo chrisstaite-menlo enabled auto-merge (squash) February 7, 2025 17:06
@chrisstaite-menlo chrisstaite-menlo merged commit 7afe286 into TraceMachina:main Feb 7, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Action times out if client is listening
4 participants