Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2nd attempt at Reproducible out folder contents #4642

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

lihaoyi
Copy link
Member

@lihaoyi lihaoyi commented Mar 4, 2025

The biggest challenge with making out/ folder reproducible is removing absolute paths. This can be a challenge:

  1. Paths are everywhere, sometimes as os.Paths, sometimes in JSON blobs written to disk, sometimes in strings like "-Xplugin=...".

  2. These paths will then get interpreted by in-memory libraries (e.g. Zinc), Scala subprocesses we control (e.g. mill.testrunner.TestRunnerMain, or subprocesses we do not control (e.g. native-image).

  3. These subprocesses may run in a variety of different working directories: some in the workspace root, some in a .dest folder, or elsewhere.

  4. The same path may be passed to different subprocesses, written in different languages, running in different working directories, and should behave the same way.

All of the above works fine if we are using absolute paths: an absolute path means the same thing regardless of how it is serialized or who it is passed to. That is the reason OS-Lib uses absolute os.Paths by default. However, absolute paths are not reproducible: someone running code in the folder /Users/lihaoyi/mill or /Users/someone-else/mill will have different absolute paths, and would thus be unable to share caches keyed on the hashes of those paths.

Serializing Absolute Paths to Relative Paths

In order to make this work, we need to serialize absolute paths as relative paths. As we do not in general know where a serialized path is going to end up being used, we need to serialize absolute paths as relative paths that reference the same final destination regardless of the cwd of the process it is passed to. We can do this as follows:

  1. Serialize all absolute paths relative to some stable root folder, e.g. anything in os.home / "foo/bar/baz" gets serialized as "out/mill-home/foo/bar/baz", anything in Task.workspace / "foo/bar/baz" gets serialized as "out/mill-workspace/foo/bar/baz"

  2. Whenever we spawn external processes, we synthesize out/mill-home and out/mill-workspace symlinks that point to their respective destinations os.home and Task.workspace

  3. Any subprocess that reads in these paths and then tries to dereference "out/mill-home/foo/bar/baz" or "out/mill-home/foo/bar/baz", regardless of language, will follow the filesystem symlink and end up reading from the right place on disk. (With the notable exception of Scala subprocesses using OS-Lib, which is stricter about deserializing relative paths as absolute paths than most platforms)

This is a similar approach taken by Bazel's Symlink Sandbox, which generates local symlinks in the working directory of any subprocess that Bazel spawns.

Implementation Details

  1. We hook into OS-Lib to control the serialization of os.Path, allowing it to be written out as prefixed relative paths, and read back in as prefixed relative paths.

  2. We also hook into OS-Lib to create the out/mill-home and out/mill-workspace symlinks every time you spawn a process, in that process' working directory

Limitations

This approach to making Mill's paths reproducible suffers from many of the same weaknesses that Bazel has. In general, the symlinks work 99% of the time, but once in a while something can go wrong:

  1. The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by java.lang.ProcessBuilder, which may include subprocesses spawned by third-party libraries we use

  2. Any transitive subprocesses spawned by the subprocesses that Mill creates are out of our control, and thus may not have the proper symlinks set up if the working directory differs from their parent

  3. The symlinks are not transparent: user code will be able to see that the mill-workspace folder is a symlink, and some code may not behave correctly when traversing symlinks

  4. Some code demands absolute paths and provides no alternative, e.g. the native-image binary for generating Graal executables.

  5. Any subprocesses using OS-Lib to deserialize the paths (e.g. our own) need to explicitly use the os.Path(_, os.pwd) constructor to allow it to handle relative paths and resolve them from the current working directly.

Notes

There are some non-path-related changes in this PR:

  1. The Zinc incremental compiler has its own ReadWriteMapper mechanism for customizing serialization of paths, so we hook into that in ZincWorkerImpl

  2. We tweak the valueHash computation in GroupExecution to take the hash of the serialized JSON, rather than of the original JVM object, since different os.Paths with different hashes may serialize to the same relative path after the working directly has been substituted, and so we need to hash the JSON to make sure we get a stable hash

@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 4, 2025

This is getting pretty close to getting ./mill -w 'integration.feature[reproducibility].local.server' mill.integration.ReproducibilityTests.diff to pass. The current and seemingly last blocker is sbt/zinc#1540, where Zinc's analysis files have some inherent non-determinism, but everything else seems to be deterministic.

@lihaoyi lihaoyi changed the title Reproducible 2 2nd attempt at Reproducible out folder contents Mar 4, 2025
@roman-mibex-2
Copy link

The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by java.lang.ProcessBuilder, which may include sub-processes spawned by third-party libraries we use

One random idea I had once, but never enough time to explore: Use a java.nio.File system to have a virtual paths. With the idea that:

  • Generic java libraries would go through the our 'Mill' path resolution
  • .toStrings would do something like you describe here, with sym-links to out/home, etc: For everything that might end up in an external process.
    However, that is a way lager API and I'm not sure if it gives more control in reality.

I guess, if the ProcessBuilder becomes a common issue deep inside other libraries: We still could think of adding a Java Agent to 'fix' paths.

@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 8, 2025

@roman-mibex-2 that could work, but one big issue is we need to support passing absolute paths to third-party subprocesses as well, and those are entirely out of the JVM's control. So we do need some kind of OS/FS-level handling to make those relative paths work

Using a java agent to fix paths passed to third part subprocesses won't work, because the subprocesses get a mix of files-on-disk, command-line strings, environment variable strings, any of which could have the paths embedded within them. It's impossible to look at a string or file and identify if it is whole-ly or partially composed of absolute paths, e.g. how would we look at a gzipped messagepack blob on disk and fix up the paths there?

One generalization that could make your idea work is to use a FUSE filesystem to redirect the paths at the filesystem level. Bazel offers this IIRC on linux (https://bazel.build/versions/7.5.0/docs/sandboxing#sandboxfs), not sure how hard it would be to port over to Mill. Running such a FUSE filesystem across different Mac/Linux/Windows environments would also be a challenge (e.g. needing kernel drivers or sudo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants