-
-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2nd attempt at Reproducible out folder contents #4642
base: main
Are you sure you want to change the base?
Conversation
This is getting pretty close to getting |
One random idea I had once, but never enough time to explore: Use a java.nio.File system to have a virtual paths. With the idea that:
I guess, if the |
@roman-mibex-2 that could work, but one big issue is we need to support passing absolute paths to third-party subprocesses as well, and those are entirely out of the JVM's control. So we do need some kind of OS/FS-level handling to make those relative paths work Using a java agent to fix paths passed to third part subprocesses won't work, because the subprocesses get a mix of files-on-disk, command-line strings, environment variable strings, any of which could have the paths embedded within them. It's impossible to look at a string or file and identify if it is whole-ly or partially composed of absolute paths, e.g. how would we look at a gzipped messagepack blob on disk and fix up the paths there? One generalization that could make your idea work is to use a FUSE filesystem to redirect the paths at the filesystem level. Bazel offers this IIRC on linux (https://bazel.build/versions/7.5.0/docs/sandboxing#sandboxfs), not sure how hard it would be to port over to Mill. Running such a FUSE filesystem across different Mac/Linux/Windows environments would also be a challenge (e.g. needing kernel drivers or sudo) |
The biggest challenge with making
out/
folder reproducible is removing absolute paths. This can be a challenge:Paths are everywhere, sometimes as
os.Path
s, sometimes in JSON blobs written to disk, sometimes in strings like"-Xplugin=..."
.These paths will then get interpreted by in-memory libraries (e.g. Zinc), Scala subprocesses we control (e.g.
mill.testrunner.TestRunnerMain
, or subprocesses we do not control (e.g.native-image
).These subprocesses may run in a variety of different working directories: some in the workspace root, some in a
.dest
folder, or elsewhere.The same path may be passed to different subprocesses, written in different languages, running in different working directories, and should behave the same way.
All of the above works fine if we are using absolute paths: an absolute path means the same thing regardless of how it is serialized or who it is passed to. That is the reason OS-Lib uses absolute
os.Path
s by default. However, absolute paths are not reproducible: someone running code in the folder/Users/lihaoyi/mill
or/Users/someone-else/mill
will have different absolute paths, and would thus be unable to share caches keyed on the hashes of those paths.Serializing Absolute Paths to Relative Paths
In order to make this work, we need to serialize absolute paths as relative paths. As we do not in general know where a serialized path is going to end up being used, we need to serialize absolute paths as relative paths that reference the same final destination regardless of the
cwd
of the process it is passed to. We can do this as follows:Serialize all absolute paths relative to some stable root folder, e.g. anything in
os.home / "foo/bar/baz"
gets serialized as"out/mill-home/foo/bar/baz"
, anything inTask.workspace / "foo/bar/baz"
gets serialized as"out/mill-workspace/foo/bar/baz"
Whenever we spawn external processes, we synthesize
out/mill-home
andout/mill-workspace
symlinks that point to their respective destinationsos.home
andTask.workspace
Any subprocess that reads in these paths and then tries to dereference
"out/mill-home/foo/bar/baz"
or"out/mill-home/foo/bar/baz"
, regardless of language, will follow the filesystem symlink and end up reading from the right place on disk. (With the notable exception of Scala subprocesses using OS-Lib, which is stricter about deserializing relative paths as absolute paths than most platforms)This is a similar approach taken by Bazel's Symlink Sandbox, which generates local symlinks in the working directory of any subprocess that Bazel spawns.
Implementation Details
We hook into OS-Lib to control the serialization of
os.Path
, allowing it to be written out as prefixed relative paths, and read back in as prefixed relative paths.We also hook into OS-Lib to create the
out/mill-home
andout/mill-workspace
symlinks every time you spawn a process, in that process' working directoryLimitations
This approach to making Mill's paths reproducible suffers from many of the same weaknesses that Bazel has. In general, the symlinks work 99% of the time, but once in a while something can go wrong:
The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by
java.lang.ProcessBuilder
, which may include subprocesses spawned by third-party libraries we useAny transitive subprocesses spawned by the subprocesses that Mill creates are out of our control, and thus may not have the proper symlinks set up if the working directory differs from their parent
The symlinks are not transparent: user code will be able to see that the
mill-workspace
folder is a symlink, and some code may not behave correctly when traversing symlinksSome code demands absolute paths and provides no alternative, e.g. the
native-image
binary for generating Graal executables.Any subprocesses using OS-Lib to deserialize the paths (e.g. our own) need to explicitly use the
os.Path(_, os.pwd)
constructor to allow it to handle relative paths and resolve them from the current working directly.Notes
There are some non-path-related changes in this PR:
The Zinc incremental compiler has its own
ReadWriteMapper
mechanism for customizing serialization of paths, so we hook into that inZincWorkerImpl
We tweak the
valueHash
computation inGroupExecution
to take the hash of the serialized JSON, rather than of the original JVM object, since differentos.Path
s with different hashes may serialize to the same relative path after the working directly has been substituted, and so we need to hash the JSON to make sure we get a stable hash