Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: remove local fallback for remote execution #7202

Open
Tracked by #19904
buchgr opened this issue Jan 21, 2019 · 9 comments
Open
Tracked by #19904

remote: remove local fallback for remote execution #7202

buchgr opened this issue Jan 21, 2019 · 9 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@buchgr
Copy link
Contributor

buchgr commented Jan 21, 2019

The --remote_local_fallback flag falls back to local execution if there was some error (i.e. network) running the action remotely. This is inherently unsafe as the local and remote execution environment might not be identical.

@buchgr buchgr added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug P2 We'll consider working on this in future. (Assignee optional) labels Jan 21, 2019
@ishikhman ishikhman self-assigned this Mar 21, 2019
@werkt
Copy link
Contributor

werkt commented Mar 27, 2019

remote_local_fallback has incidentally been broken since 1532df0, where the catches with fallback responses (and the reasonable error handling and reporting) are not catching the now-RuntimeError that were previously RetryException, and caught by IOException handlers.

This is incredibly broken even without the fallback, and could be addressed, but will not improve the error handling here, which now dies in the skyframe AbstractParallelEvaluator when, for instance, a DEADLINE_EXCEEDED occurs during download.

@ishikhman
Copy link
Contributor

ishikhman commented Apr 1, 2019

Link to the email announcement: proposal

One of the use cases discovered for a local fallback is when cache download/upload reached a Timeout.
Potential solution to this problem in #7590 with a progressive timeouts (throw exception when nothing was uploaded within X seconds and NOT when the file was not uploaded within Y seconds).

@nicolov
Copy link
Contributor

nicolov commented May 8, 2019

I think local fallback is valuable, if only in situations with poor network connectivity.

Instead of removing the flag altogether, we can make it so --remote_local_fallback will also toggle --remote_upload_local_results=false to prevent misconfigured builds from breaking (which was the original intent of this PR).

If set explicitly, we should also allow local_fallback=true and upload_local_results=true as it might be useful in CI to maximize reliability of the builds. With properly configured C++ toolchains, we've been sharing artifacts among Ubuntu 14 and 18 with no issues at all.

@ishikhman
Copy link
Contributor

ishikhman commented May 9, 2019

I think local fallback is valuable, if only in situations with poor network connectivity.

Instead of removing the flag altogether, we can make it so --remote_local_fallback will also toggle --remote_upload_local_results=false to prevent misconfigured builds from breaking (which was the original intent of this PR).

I like the idea! In general, discussion was paused for a while, because we realized that a local_fallback is actually a very useful feature in some cases.

One of the main concerns is a risk of a cache poisoning, but I don't want to completely disable caching for locally executed actions. Especially considering that there are only a few use cases of the feature known to us and for them caching local actions is a safe thing to do.

If set explicitly, we should also allow local_fallback=true and upload_local_results=true as it might be useful in CI to maximize reliability of the builds. With properly configured C++ toolchains, we've been sharing artifacts among Ubuntu 14 and 18 with no issues at all.

The only concern here is how could we identify whether --remote_upload_local_results=true was set explicitly or is it a default value (it is true by default)? A solution that comes to my mind is another flag, which I'd like to avoid - we have too many flags already.

@nicolov
Copy link
Contributor

nicolov commented May 10, 2019

The only concern here is how could we identify whether --remote_upload_local_results=true was set explicitly or is it a default value

Why would you need to do that?

@ishikhman
Copy link
Contributor

Why would you need to do that?

In order to know whether the flag was set explicitly or not. You've suggested to toggle it to false unless it was explicitly set to true. So how would we know whether it should be flipped to false or not?

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2.5 years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Mar 16, 2023
@brentleyjones
Copy link
Contributor

I think this is still valid @coeuvre?

@github-actions github-actions bot removed the stale Issues or PRs that are stale (no activity for 30 days) label Mar 17, 2023
@coeuvre
Copy link
Member

coeuvre commented Mar 29, 2023

That's interesting. I believe local and remote should be identical otherwise dynamic execution wouldn't work.

cc @tjgq for the work on multiple platform dynamic execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants