[WIP] [internal] introduce `reusable_input_digests` API for Process #13271

tdyas · 2021-10-15T19:18:41Z

Motivation

While developing the Go backend for Pants, we found that the amount of time spent unpacking the Go SDK (100+ MB) into the input root dominated the execution time for each Process invocation at 60-70% of total execution time. This made it impractical to use a downloaded Go SDK versus using a pre-installed system-wide Go SDK. We would like to be able to eventually again download the Go SDK, but need a way to avoid the performance cost.

Solution

Introduce the notion of a "reusable input digest" which are digests that are expected to be reused between Process invocations. This is a hint to a process executor that the digests are immutable and that the executor can optimize as it chooses. They are useful for tools (like the Go SDK) which will not change between Process invocations.

This initial implementation just merges the digests into the input root and does not implement any optimization (which will come later for the local executor).

[ci skip-build-wheels]

tdyas · 2021-10-15T19:19:08Z

src/python/pants/engine/process.py

@@ -65,6 +65,7 @@ class Process:
    use_nailgun: Digest
    execution_slot_variable: str | None
    cache_scope: ProcessCacheScope
+    reusable_input_digests: FrozenDict[str, Digest]


Bike shedding the name. cc @Eric-Arellano @stuhood @illicitonion

I'm not following what the key is in this dictionary? Reading the code, sounds like the "relative path" prefix? Keep in mind that that would mean only being able to have one digest associated with each path.

I wonder this could simply be tuple[Digest, ...]?

--

To help with the question of naming this...my question in the design doc would change my suggestion I think. When would you not want to use this feature? It seems like there would be very little downside as long as you're confident the content is immutable? If that's the case, the name should probably bias people towards using it.

I'm not following what the key is in this dictionary? Reading the code, sounds like the "relative path" prefix?

Yes it is the relative path where to make the Digest available in the input root. Note that the potential caching optimizations are only going to work if the Digest is mounted via a single symlink or bind mount to the persistent cache. This means that the Digest needs to be put "under" a relative path. A Digest with multiple top-level files / directories would not be usable with those caching optimizations.

Keep in mind that that would mean only being able to have one digest associated with each path.

Correct. And that will allow an executor using a caching optimization to verify the Digest's contents against its persistent cache.

Semantically, I think that calling these immutable_input_digests would more clearly specify their constraints, and better align with "append_only" (which is attempting to convey that caches must only grow). Otherwise, the first thought after seeing "reusable" is "don't I wanted everything to be reusable? I'll just put everything in there".

tdyas · 2021-10-15T19:19:38Z

Still WIP since I need to do some work on the remote cache and remote execution code paths.

illicitonion · 2021-10-20T21:39:52Z

The API and how it relates to cache keys is interesting here... It seems a bit weird that whether something was declared as reusable factors into the cache key, and it also seems a little strange that to work out the input Digest of an action you'd need to merge multiple digests... I quite like the "There's just one digest of an input root" thing we had before, and I'm slightly tempted to introduce a new node-type, so that either we'd have:

message Directory {
  repeated FileNode files = 1;
  repeated DirectoryNode directories = 2;
  ...
  repeated ReusableDirectoryNode reusable_directories = 99;
}

or

message DirectoryNode {
  string name = 1;

  Digest digest = 2;

  bool is_reusable = 3;
}

but that would raise a lot of questions around serialisation and cacheability... Presumably we'd need to forget that field (or treat it like a DirectoryNode) when serialising for remote execution (well, unless the remote execution system was aware of the mechanism)... Unclear whether we'd want to be sensitive to it at other times... It's a bit of a weird one.

tdyas · 2021-10-27T22:35:52Z

It seems a bit weird that whether something was declared as reusable factors into the cache key, and it also seems a little strange that to work out the input Digest of an action you'd need to merge multiple digests.

The input root digest potentially differing from the input root digest in the actual remote execution request is already the case because of the use_nailgun attribute which is a Digest to merge in. Moreover, there is already precedent for Pants core rules to add to fields of a Process; for example, environment variables.

stuhood · 2021-12-03T17:45:11Z

src/rust/engine/process_execution/src/local.rs

+      // Merge reusable input digests into `input_files`.
+      // TODO: Symlink or bind mount reusable inputs into the input root from a persistent
+      //   cache (which would occur after materializing the input files).
+      let merged_input_digest =
+        merge_reusable_input_digests(&store2, input_files, reusable_input_digests)
+          .map_err(|err| format!("Unable to merge digest: {:?}", err))
+          .boxed()
+          .await?;


Doing this merging in each Runner is error prone: in particular, this change currently does not include the new digest in the remexec digest for the process/action, which is calculated here:

pants/src/rust/engine/process_execution/src/lib.rs

Lines 574 to 596 in 83404c6

// TODO(#8513) possibly move to the MEPR struct, or to the hashing crate?

pub fn digest(req: MultiPlatformProcess, metadata: &ProcessMetadata) -> Digest {

let mut hashes: Vec<String> = req

.0

.values()

.map(|process| crate::remote::make_execute_request(process, metadata.clone()).unwrap())

.map(|(_a, _b, er)| {

er.action_digest

.map(|d| d.hash)

.unwrap_or_else(|| EMPTY_FINGERPRINT.to_hex())

})

.collect();

hashes.sort();

Digest::of_bytes(

hashes

.iter()

.fold(String::new(), |mut acc, hash| {

acc.push_str(hash);

acc

})

.as_bytes(),

)

}

... and even if all Runners are updated to merge, they will re-merge for every attempt to execute.

A less error prone approach might be to give Process a new calculated field for "the complete input_digest", which would force constructors of Processes to execute the merge themselves. Alternatively, add a high level constructor (or builder) that computed the field. There have been some steps toward adding the builder pattern to Process, but unfortunately they only focused on field optionality, and so they take &mut Process rather than being on a separate builder... sorry about that: lesson learned.

To unblock #13787, I'm going to take another swing at "digest merging" from this angle: the public API in this PR should merge cleanly with that though.

I'm also fine with abandoning this PR. It was really more of a concept PR than anything else. Thoughts?

stuhood · 2021-12-08T02:43:57Z

I've started from this change to implement #12716: thanks @tdyas!

Tom Dyas added 3 commits October 15, 2021 13:29

reusable input digests API

7ae7b63

[ci skip-build-wheels]

local executor impl + test

b0d34b6

[ci skip-build-wheels]

remote executor

937123b

[ci skip-build-wheels]

tdyas commented Oct 15, 2021

View reviewed changes

stuhood reviewed Dec 3, 2021

View reviewed changes

stuhood closed this Dec 8, 2021

tdyas deleted the reusable_cache_api branch December 20, 2021 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [internal] introduce `reusable_input_digests` API for Process #13271

[WIP] [internal] introduce `reusable_input_digests` API for Process #13271

tdyas commented Oct 15, 2021

tdyas Oct 15, 2021

Eric-Arellano Oct 15, 2021

tdyas Oct 15, 2021

stuhood Dec 3, 2021 •

edited

Loading

tdyas commented Oct 15, 2021

illicitonion commented Oct 20, 2021

tdyas commented Oct 27, 2021 •

edited

Loading

stuhood Dec 3, 2021

stuhood Dec 4, 2021

tdyas Dec 4, 2021

stuhood commented Dec 8, 2021

	// TODO(#8513) possibly move to the MEPR struct, or to the hashing crate?
	pub fn digest(req: MultiPlatformProcess, metadata: &ProcessMetadata) -> Digest {
	let mut hashes: Vec<String> = req
	.0
	.values()
	.map(\|process\| crate::remote::make_execute_request(process, metadata.clone()).unwrap())
	.map(\|(_a, _b, er)\| {
	er.action_digest
	.map(\|d\| d.hash)
	.unwrap_or_else(\|\| EMPTY_FINGERPRINT.to_hex())
	})
	.collect();
	hashes.sort();
	Digest::of_bytes(
	hashes
	.iter()
	.fold(String::new(), \|mut acc, hash\| {
	acc.push_str(hash);
	acc
	})
	.as_bytes(),
	)
	}

[WIP] [internal] introduce reusable_input_digests API for Process #13271

[WIP] [internal] introduce reusable_input_digests API for Process #13271

Conversation

tdyas commented Oct 15, 2021

Motivation

Solution

tdyas Oct 15, 2021

Choose a reason for hiding this comment

Eric-Arellano Oct 15, 2021

Choose a reason for hiding this comment

tdyas Oct 15, 2021

Choose a reason for hiding this comment

stuhood Dec 3, 2021 • edited Loading

Choose a reason for hiding this comment

tdyas commented Oct 15, 2021

illicitonion commented Oct 20, 2021

tdyas commented Oct 27, 2021 • edited Loading

stuhood Dec 3, 2021

Choose a reason for hiding this comment

stuhood Dec 4, 2021

Choose a reason for hiding this comment

tdyas Dec 4, 2021

Choose a reason for hiding this comment

stuhood commented Dec 8, 2021

[WIP] [internal] introduce `reusable_input_digests` API for Process #13271

[WIP] [internal] introduce `reusable_input_digests` API for Process #13271

stuhood Dec 3, 2021 •

edited

Loading

tdyas commented Oct 27, 2021 •

edited

Loading