-
-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [internal] introduce reusable_input_digests
API for Process
#13271
Conversation
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
@@ -65,6 +65,7 @@ class Process: | |||
use_nailgun: Digest | |||
execution_slot_variable: str | None | |||
cache_scope: ProcessCacheScope | |||
reusable_input_digests: FrozenDict[str, Digest] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bike shedding the name. cc @Eric-Arellano @stuhood @illicitonion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what the key is in this dictionary? Reading the code, sounds like the "relative path" prefix? Keep in mind that that would mean only being able to have one digest associated with each path.
I wonder this could simply be tuple[Digest, ...]
?
--
To help with the question of naming this...my question in the design doc would change my suggestion I think. When would you not want to use this feature? It seems like there would be very little downside as long as you're confident the content is immutable? If that's the case, the name should probably bias people towards using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what the key is in this dictionary? Reading the code, sounds like the "relative path" prefix?
Yes it is the relative path where to make the Digest
available in the input root. Note that the potential caching optimizations are only going to work if the Digest
is mounted via a single symlink or bind mount to the persistent cache. This means that the Digest needs to be put "under" a relative path. A Digest with multiple top-level files / directories would not be usable with those caching optimizations.
Keep in mind that that would mean only being able to have one digest associated with each path.
Correct. And that will allow an executor using a caching optimization to verify the Digest's contents against its persistent cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically, I think that calling these immutable_input_digests
would more clearly specify their constraints, and better align with "append_only" (which is attempting to convey that caches must only grow). Otherwise, the first thought after seeing "reusable" is "don't I wanted everything to be reusable? I'll just put everything in there".
Still WIP since I need to do some work on the remote cache and remote execution code paths. |
The API and how it relates to cache keys is interesting here... It seems a bit weird that whether something was declared as reusable factors into the cache key, and it also seems a little strange that to work out the input Digest of an action you'd need to merge multiple digests... I quite like the "There's just one digest of an input root" thing we had before, and I'm slightly tempted to introduce a new node-type, so that either we'd have: message Directory {
repeated FileNode files = 1;
repeated DirectoryNode directories = 2;
...
repeated ReusableDirectoryNode reusable_directories = 99;
} or message DirectoryNode {
string name = 1;
Digest digest = 2;
bool is_reusable = 3;
} but that would raise a lot of questions around serialisation and cacheability... Presumably we'd need to forget that field (or treat it like a |
The input root digest potentially differing from the input root digest in the actual remote execution request is already the case because of the |
// Merge reusable input digests into `input_files`. | ||
// TODO: Symlink or bind mount reusable inputs into the input root from a persistent | ||
// cache (which would occur after materializing the input files). | ||
let merged_input_digest = | ||
merge_reusable_input_digests(&store2, input_files, reusable_input_digests) | ||
.map_err(|err| format!("Unable to merge digest: {:?}", err)) | ||
.boxed() | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this merging in each Runner is error prone: in particular, this change currently does not include the new digest in the remexec digest for the process/action, which is calculated here:
pants/src/rust/engine/process_execution/src/lib.rs
Lines 574 to 596 in 83404c6
// TODO(#8513) possibly move to the MEPR struct, or to the hashing crate? | |
pub fn digest(req: MultiPlatformProcess, metadata: &ProcessMetadata) -> Digest { | |
let mut hashes: Vec<String> = req | |
.0 | |
.values() | |
.map(|process| crate::remote::make_execute_request(process, metadata.clone()).unwrap()) | |
.map(|(_a, _b, er)| { | |
er.action_digest | |
.map(|d| d.hash) | |
.unwrap_or_else(|| EMPTY_FINGERPRINT.to_hex()) | |
}) | |
.collect(); | |
hashes.sort(); | |
Digest::of_bytes( | |
hashes | |
.iter() | |
.fold(String::new(), |mut acc, hash| { | |
acc.push_str(hash); | |
acc | |
}) | |
.as_bytes(), | |
) | |
} |
... and even if all Runners are updated to merge, they will re-merge for every attempt to execute.
A less error prone approach might be to give Process
a new calculated field for "the complete input_digest", which would force constructors of Process
es to execute the merge themselves. Alternatively, add a high level constructor (or builder) that computed the field. There have been some steps toward adding the builder pattern to Process
, but unfortunately they only focused on field optionality, and so they take &mut Process
rather than being on a separate builder... sorry about that: lesson learned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To unblock #13787, I'm going to take another swing at "digest merging" from this angle: the public API in this PR should merge cleanly with that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also fine with abandoning this PR. It was really more of a concept PR than anything else. Thoughts?
Motivation
While developing the Go backend for Pants, we found that the amount of time spent unpacking the Go SDK (100+ MB) into the input root dominated the execution time for each
Process
invocation at 60-70% of total execution time. This made it impractical to use a downloaded Go SDK versus using a pre-installed system-wide Go SDK. We would like to be able to eventually again download the Go SDK, but need a way to avoid the performance cost.Solution
Introduce the notion of a "reusable input digest" which are digests that are expected to be reused between
Process
invocations. This is a hint to a process executor that the digests are immutable and that the executor can optimize as it chooses. They are useful for tools (like the Go SDK) which will not change betweenProcess
invocations.This initial implementation just merges the digests into the input root and does not implement any optimization (which will come later for the local executor).