Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement loading modules from piped files #5035

Closed
wants to merge 3 commits into from

Conversation

cr0sh
Copy link
Contributor

@cr0sh cr0sh commented Oct 9, 2022

This is not discussed in the issue tracker, but on this Zulip thread.

Previously load_module opened a file on path to check if the module is precompiled, then closes before constructing the Module from it. This is fine on normal files, but on piped files(like /dev/stdin), is problematic as the header read is not visible on the next open.

So, this set of commits inserts a new procedure, before pre-opening to read the header, to detect if the file pointed by the path is a pipe(is_fifo of FileTypeExt), then directly loads the module from the path without checking if the modules is precompiled.

This change is unix-only, and intentionally does not support loading precompiled modules from pipes, because it should not be used that way. An additional warning message is printed in such situations to prevent users from being confused.

The Zulip thread contains an additional concern by me, about implicitly allowing accsesing /dev/tty on - inputs, which is not considered here.

*Edited to describe the current procedure to detect piped files.

@cr0sh
Copy link
Contributor Author

cr0sh commented Oct 9, 2022

It seems like I cannot assign a reviewer for the PR(the template indicates me to do so), so not assigning.

@cr0sh cr0sh changed the title Run piped modules Implement loading modules from piped files Oct 9, 2022
f.read_to_end(&mut buf)?;
let mmap_vec = MmapVec::from_slice(&buf)?;
Ok::<_, anyhow::Error>(mmap_vec)
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces new copies of the data, slowing down startup. Previously Module::from_file would mmap the file once and then use this mmap'ed version directly. Now it will mmap it once (or read to a vec and then copy it to a new memory region in MmapVec::from_slice) and then copy it to a new memory region again in Module::new (through MmapVec::from_slice).

Copy link
Contributor Author

@cr0sh cr0sh Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely right. To be honest, I thought directly writing Mmap/MmapVec with looped f.read() is more efficient but feeled like a premature optimization, so chose this implementation for brevity. (The main reason for this is the File::read_to_end is not compatible with MmapVec.)

If startup performance matters for piped inputs too, I'll fix this into like

loop {
    let buf = [0u8; 4<<10];
    match f.read(&mut buf) {
        Ok(0) => break,
        Ok(n) => /* copy `n` bytes to Mmap and increment the offset */ todo!(),
        Err(e) => return Err(e),
    }
}

But again I'm not sure this is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still have an extra copy in case the mmap succeeded.

Copy link
Contributor Author

@cr0sh cr0sh Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I didn't understand what you were saying. 😅 Now I see.
I'm afraid that this is not able to be trivially fixed, as there's no public API on Module to pass MmapVec directly into a constructor?

Module::from_binary seems to be the 'generic' catch-all for many Module constructors but I don't have an idea what it does...

Copy link
Contributor Author

@cr0sh cr0sh Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, you said about the case when the mmap suceeded, so trying to mmap and if it succeeds then discarding the mmap and using Module::from_path is OK. Is that you're thinking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may work.

@cr0sh
Copy link
Contributor Author

cr0sh commented Oct 9, 2022

Piping tests now run on unix platforms only(as the - substitiution applies only on unix too). Double buffering issue is resolved.
What do you think about implicitly allowing access to /dev/tty on wasmtime - as a special case?

Cargo.toml Outdated
@@ -30,6 +30,7 @@ wasmtime-wast = { workspace = true }
wasmtime-wasi = { workspace = true }
wasmtime-wasi-crypto = { workspace = true, optional = true }
wasmtime-wasi-nn = { workspace = true, optional = true }
wasmtime-runtime = { workspace = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Wasmtime CLI serves as an example of how to use Wasmtime's API, and in general we don't want users of the API depending on internal crates like wasmtime-runtime. Would it be possible to factor out this code for opening and loading a module either from deserialized form or a plain module into a utility function in the wasmtime crate, so that the CLI code can call that rather than doing this work itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only needed for MmapVec so I'll remove on the next commit. (see below.)

Copy link
Contributor Author

@cr0sh cr0sh Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading my previous commits and review comments here, I realized this reply is not possible at all. MmapVec cannot be avoided because load_module should do both "header check" and "copyless mmap" in one path. I think your suggestion is truly valid but please kindly understand that I don't know much about this project (yet) to implement it rapidly so I would come back with results after some time with some investigation and tryouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. And thanks for working on this! I agree that running modules from stdin is an interesting idea.

// to store the whole bytes in a memory.
let mmap_vec = MmapVec::from_file(path).or_else(|_| {
// Mmap-ing this much would not eat system memory thanks for paging
const MAX_READ_FROM_STREAM: usize = 1 << 30; // TODO: make this configurable with environment variables?
Copy link
Member

@sunfishcode sunfishcode Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary limitations and environment variables are something we'd like to avoid if possible. The users of mmap_vec don't care if they have an MMapVec or a plain Vec<u8>, so in the case where the mmap fails, would it work to just call std::fs::read to read the file? That would require reorganizing this code to avoid using MmapVec as the type to unify the two options. Something like this:

    let mmap_vec;
    let read;
    let bytes: &[u8] = if let Ok(mapped) = MmapVec::from_file(Path::new(path)) {
        mmap_vec = mapped;
        mmap_vec.as_ref()
    } else {
        read = std::fs::read(path)?;
        read.as_ref()
    };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I misinterpreted @bjorn3 's first review so I'll roll back into previous implementation which just uses Read::read_to_end and Vec<u8>, and prevent redundant allocation when a module is from a regular file. This may be sub-optimal for irregular file case but I think it's OK because no one wants to do hyperfine 'cat foo.wasm | wasmtime -'?

@cr0sh
Copy link
Contributor Author

cr0sh commented Oct 23, 2022

I simplified the strategy to detect piped files(directly invokes is_fifo on Unix) and fixed tests incorrectly set up.

Comment on lines 430 to 431
if cfg!(unix) {
#[cfg(unix)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if cfg!(unix) {
#[cfg(unix)]
#[cfg(unix)]
{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a genius! 🤣

@jameysharp
Copy link
Contributor

Here's the strategy I'd suggest:

Add a new method to wasmtime::Module that accepts both precompiled binaries and wasm/wat files, and uses mmap. I'd call it from_trusted_file but I'm open to suggestions. It should probably be unsafe, like Module::deserialize_file, for the same reasons.

In the CLI, if allow_precompiled is true, then call Module::from_trusted_file; otherwise, call Module::from_file just like it does today. Don't inspect the bytes of the file at all in the CLI, but put that logic in the wasmtime crate instead. I think that addresses @sunfishcode's comment on keeping the CLI as a simple demonstration of how to use the wasmtime public API, and improves on the current state where there's this magic ELF-header check in the CLI.

In your motivating use case (curl | wasmtime), users shouldn't be passing --allow-precompiled, so the CLI would call Module::from_file, resulting in a single fs::read that consumes the piped input exactly once, which solves the original bug you encountered.

This new method should read the file using MmapVec::from_file. Only the OS knows exactly which file types it's capable of mapping into memory, so this shouldn't check whether the file is a FIFO or anything along those lines; just let the syscall fail if someone uses --allow-precompiled with a pipe.

Once you have an MmapVec, you can check whether it begins with b"\x7fELF". If so, pass it to SerializedModule::from_mmap; otherwise, pass it to Module::new. Neither function copies the bytes it's given, addressing @bjorn3's concern.

I think the only downside to this plan is that the error message from the CLI isn't as friendly if someone tries to run a precompiled artifact without passing --allow-precompiled. We could make Module::build_artifacts return a more helpful error if the bytes it receives start with an ELF header, instead of just reporting that parsing failed.

One advantage to this approach is that wasmtime --allow-precompiled becomes potentially faster for .wat and .wasm files, because it'll use mmap for those cases too.

@cr0sh
Copy link
Contributor Author

cr0sh commented Oct 26, 2022

@jameysharp Thanks for your detailed guidance! I didn't think fixing this would not be hard this way and admittedly I couldn't do some sort of refactoring the project(and even adding a new public API for it) never saw before, tempting myself to bring somewhat hackish way. I'll try to reimplement this way.

I'd call it from_trusted_file but I'm open to suggestions.

This looks good.

I think the only downside to this plan is that the error message from the CLI isn't as friendly if someone tries to run a precompiled artifact without passing --allow-precompiled. We could make Module::build_artifacts return a more helpful error if the bytes it receives start with an ELF header, instead of just reporting that parsing failed.

I'm not sure that I can bring an excellent solution for this; just will do my best.

To recap I leave a tabular summary for your suggestion:

no --allow-precompiled
(Module::from_file)
--allow-precompiled
(Module::from_trusted_file)
precompiled regular: opens OK and fails on some inner procedure but should be rejected via an additional check? TODO: better error handling
fifo: same
regular: opens OK w/ mmap
fifo: mmap fails (intentionally left broken)
.wasm/.wat regular: opens OK
fifo: opens OK (major change on this PR)
regular: opens OK w/mmap
fifo: same

@jameysharp
Copy link
Contributor

should be rejected via an additional check

Right, the key here is that a file that begins with the 4-byte ELF magic is not valid for .wat or .wasm format and will be rejected by those parsers. None of the characters a .wat file can start with (which I think are whitespace, (, or ;) include \x7f. And the binary .wasm format has its own different 4-byte magic (\0asm).

So from_file already rejects precompiled modules, but it does it by reporting a generic parse failure, which is not super friendly. We can be nicer to users by specifically checking for this case. But if you have any trouble with that, I don't think that has to be in this PR.

The existing check that the CLI does for the ELF magic serves two purposes: one is to provide a helpful error message, and the other is to decide whether to use deserialize_file or from_file. So one way to look at the approach I've outlined is that it does the same things, but in a different order.

I couldn't do some sort of refactoring the project(and even adding a new public API for it) never saw before, tempting myself to bring somewhat hackish way.

Absolutely! I had to stare at the existing code for a long time, and think about a bunch of options that don't work, to come up with this suggestion. So of course I don't expect somebody new to the project to think of it.

I think the project will be better off with this additional API. In my opinion, it would be an improvement even if it didn't help with reading from pipes: the CLI shouldn't need to know about ELF magic. But I also think having support for reading from pipes is a great idea, so I like that this fixes multiple issues at the same time.

@cr0sh cr0sh force-pushed the run-piped-modules branch from 5742b5a to f730953 Compare November 29, 2022 16:09
@cr0sh
Copy link
Contributor Author

cr0sh commented Nov 29, 2022

Rebased and implemented the @jameysharp 's guidance. (I'll move to a new pull request as the topic has changed, so please leave this PR not reviewed)

@cr0sh cr0sh force-pushed the run-piped-modules branch from f730953 to 7e054da Compare November 29, 2022 16:18
This applies to unix targets only,
as Windows does not have an appropriate alternative.
Treat `-` as an alias to `/dev/stdin`

This applies to unix targets only,
as Windows does not have an appropriate alternative.
Previously, wasmtime-cli checked the module to be loaded is
precompiled or not, by pre-opening the given file path to
check if the "\x7FELF" header exists.
This commit moves this branch into the `Module::from_trusted_file`,
which is only invoked with `--allow-precompiled` flag on CLI.

The initial motivation of the commit is, feeding a module to wasmtime
from piped inputs, is blocked by the pre-opening of the module.
The `Module::from_trusted_file`, assumes the --allow-precompiled flag
so there is no piped inputs, happily mmap-ing the module to test
if the header exists.
If --allow-precompiled is not supplied, the existing `Module::from_file`
will be used, without the additional header check as the precompiled
modules are intentionally not allowed on piped inputs for security measures.

One caveat of this approach is that the user may be confused if
he or she tries to execute a precompiled module without
--allow-precompiled, as wasmtime shows an 'input bytes aren't valid
utf-8' error, not directly getting what's going wrong.
So this commit includes a hack-ish workaround for this: If the error
on `Module::new` ends with "input bytes aren't valid utf-8" strings,
show a simple note to the standard error stream.

This might be a small hiccup on preparing i18n or changing error
format on the `wat` crate, but I feel comfortable (yet) this strategy
because the `wat` crate includes tests for the error messages, so one
would notice the breakage if the error message have changed.

Thanks to @jameysharp for suggesting this idea with a detailed guidance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants