Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [internal] introduce reusable_input_digests API for Process #13271

Closed
wants to merge 3 commits into from

Conversation

tdyas
Copy link
Contributor

@tdyas tdyas commented Oct 15, 2021

Motivation

While developing the Go backend for Pants, we found that the amount of time spent unpacking the Go SDK (100+ MB) into the input root dominated the execution time for each Process invocation at 60-70% of total execution time. This made it impractical to use a downloaded Go SDK versus using a pre-installed system-wide Go SDK. We would like to be able to eventually again download the Go SDK, but need a way to avoid the performance cost.

Solution

Introduce the notion of a "reusable input digest" which are digests that are expected to be reused between Process invocations. This is a hint to a process executor that the digests are immutable and that the executor can optimize as it chooses. They are useful for tools (like the Go SDK) which will not change between Process invocations.

This initial implementation just merges the digests into the input root and does not implement any optimization (which will come later for the local executor).

Tom Dyas added 3 commits October 15, 2021 13:29
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
@@ -65,6 +65,7 @@ class Process:
use_nailgun: Digest
execution_slot_variable: str | None
cache_scope: ProcessCacheScope
reusable_input_digests: FrozenDict[str, Digest]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bike shedding the name. cc @Eric-Arellano @stuhood @illicitonion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what the key is in this dictionary? Reading the code, sounds like the "relative path" prefix? Keep in mind that that would mean only being able to have one digest associated with each path.

I wonder this could simply be tuple[Digest, ...]?

--

To help with the question of naming this...my question in the design doc would change my suggestion I think. When would you not want to use this feature? It seems like there would be very little downside as long as you're confident the content is immutable? If that's the case, the name should probably bias people towards using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what the key is in this dictionary? Reading the code, sounds like the "relative path" prefix?

Yes it is the relative path where to make the Digest available in the input root. Note that the potential caching optimizations are only going to work if the Digest is mounted via a single symlink or bind mount to the persistent cache. This means that the Digest needs to be put "under" a relative path. A Digest with multiple top-level files / directories would not be usable with those caching optimizations.

Keep in mind that that would mean only being able to have one digest associated with each path.

Correct. And that will allow an executor using a caching optimization to verify the Digest's contents against its persistent cache.

Copy link
Member

@stuhood stuhood Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically, I think that calling these immutable_input_digests would more clearly specify their constraints, and better align with "append_only" (which is attempting to convey that caches must only grow). Otherwise, the first thought after seeing "reusable" is "don't I wanted everything to be reusable? I'll just put everything in there".

@tdyas
Copy link
Contributor Author

tdyas commented Oct 15, 2021

Still WIP since I need to do some work on the remote cache and remote execution code paths.

@illicitonion
Copy link
Contributor

The API and how it relates to cache keys is interesting here... It seems a bit weird that whether something was declared as reusable factors into the cache key, and it also seems a little strange that to work out the input Digest of an action you'd need to merge multiple digests... I quite like the "There's just one digest of an input root" thing we had before, and I'm slightly tempted to introduce a new node-type, so that either we'd have:

message Directory {
  repeated FileNode files = 1;
  repeated DirectoryNode directories = 2;
  ...
  repeated ReusableDirectoryNode reusable_directories = 99;
}

or

message DirectoryNode {
  string name = 1;

  Digest digest = 2;

  bool is_reusable = 3;
}

but that would raise a lot of questions around serialisation and cacheability... Presumably we'd need to forget that field (or treat it like a DirectoryNode) when serialising for remote execution (well, unless the remote execution system was aware of the mechanism)... Unclear whether we'd want to be sensitive to it at other times... It's a bit of a weird one.

@tdyas
Copy link
Contributor Author

tdyas commented Oct 27, 2021

It seems a bit weird that whether something was declared as reusable factors into the cache key, and it also seems a little strange that to work out the input Digest of an action you'd need to merge multiple digests.

The input root digest potentially differing from the input root digest in the actual remote execution request is already the case because of the use_nailgun attribute which is a Digest to merge in. Moreover, there is already precedent for Pants core rules to add to fields of a Process; for example, environment variables.

Comment on lines +656 to +663
// Merge reusable input digests into `input_files`.
// TODO: Symlink or bind mount reusable inputs into the input root from a persistent
// cache (which would occur after materializing the input files).
let merged_input_digest =
merge_reusable_input_digests(&store2, input_files, reusable_input_digests)
.map_err(|err| format!("Unable to merge digest: {:?}", err))
.boxed()
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this merging in each Runner is error prone: in particular, this change currently does not include the new digest in the remexec digest for the process/action, which is calculated here:

// TODO(#8513) possibly move to the MEPR struct, or to the hashing crate?
pub fn digest(req: MultiPlatformProcess, metadata: &ProcessMetadata) -> Digest {
let mut hashes: Vec<String> = req
.0
.values()
.map(|process| crate::remote::make_execute_request(process, metadata.clone()).unwrap())
.map(|(_a, _b, er)| {
er.action_digest
.map(|d| d.hash)
.unwrap_or_else(|| EMPTY_FINGERPRINT.to_hex())
})
.collect();
hashes.sort();
Digest::of_bytes(
hashes
.iter()
.fold(String::new(), |mut acc, hash| {
acc.push_str(hash);
acc
})
.as_bytes(),
)
}

... and even if all Runners are updated to merge, they will re-merge for every attempt to execute.


A less error prone approach might be to give Process a new calculated field for "the complete input_digest", which would force constructors of Processes to execute the merge themselves. Alternatively, add a high level constructor (or builder) that computed the field. There have been some steps toward adding the builder pattern to Process, but unfortunately they only focused on field optionality, and so they take &mut Process rather than being on a separate builder... sorry about that: lesson learned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock #13787, I'm going to take another swing at "digest merging" from this angle: the public API in this PR should merge cleanly with that though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine with abandoning this PR. It was really more of a concept PR than anything else. Thoughts?

@stuhood
Copy link
Member

stuhood commented Dec 8, 2021

I've started from this change to implement #12716: thanks @tdyas!

@stuhood stuhood closed this Dec 8, 2021
@tdyas tdyas deleted the reusable_cache_api branch December 20, 2021 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants