Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate use of stack probes and removal of explicit stack-limit checks #8135

Open
cfallin opened this issue Mar 14, 2024 · 12 comments
Open

Comments

@cfallin
Copy link
Member

cfallin commented Mar 14, 2024

Currently, Wasmtime uses explicit stack-limit checks to avoid overflowing the stack, and these checks are generated in the prologue of every Wasm function. This is largely a consequence of the design decision to run on the host's stack (when not doing async): the two subcases are either Wasm hits the guard page itself (and we don't currently have a mechanism to recover from that, with a separate signal stack) or Wasm uses almost all the stack then calls back into the host and the host then overflows (which looks like a host crash). So the advantage of explicit stack-limit checks is that they retain control while giving us a well-defined path to return a trap, but the cost is that they impose overhead on every function call. (And potentially nontrivial overhead: the limit is several chained loads deep in data structures.)

The alternative idea is to switch to reliance on a guard page, with stack probes in functions that have frames larger than a single guard page (just as with native code today), and handle the two cases above by:

  • Installing a sigaltstack and handling a trap in the function prologue as a stack-overflow Wasm trap;
  • Explicitly checking against the stack limit on exit from the Wasm code back to host code.

(Not coincidentally, I believe this is also the exact scheme that was used by Lucet!)

@alexcrichton
Copy link
Member

Some other bits I've thought of after more thinking:

  • There's one more way to exit wasm which isn't via trampolines, and that's via libcalls. Here there's two kinds of libcalls:
    • One kind of libcall is cranelift-inserted and is for things like fma when the architecture doesn't have fma. (also things like ceil without sse4.2). These relocations are resolved here and we'd have to install custom versions of all libcalls that do stack checks before calling the native function. AFAIK we don't have the option to rely on cranelift-generated trampolines here.
    • Another kind of libcall is Wasmtime's libcalls for memory.grow and the like. We'll need to modify these macros to handle the stack check too. Ideally though we'd rely on cranelift-generated trampolines here somehow, but that would be a refactoring we don't currently have in place.
  • One nice property today is that when we have a segfault we are able to validate that there is cranelift metadata claiming "yes this instruction can segfault". We go so far as to say a pc is only a wasm pc if it has a trap registered for it. This is a nice defense-in-depth thing for catching bugs where we don't accidentally handle faults from instructions that shouldn't actually fault. Historically we handled this by assuming any faulting instruction was a stack overflow fault if it didn't have a trap pc, but I'd prefer to not go back to that. Instead I think what we can do is that we can attach stack overflow metadata for any explicit probes to handle >1page stacks. For <1page stacks we can test that the faulting address is within 1 page of the stack pointer, and in that case assume that it's stack overflow
  • We'll need to sigaltstack for more than SIGSEGV but also SIGILL because wasm could be close to a stack overflow and then have a divide-by-zero for example.
  • I'll note that we do not do sigsetjmp/siglongjmp today. We try to instead use architecture-specific intrinsics which are more efficient than sigsetjmp/siglongjmp. I think that's ok for this use case, but it something to watch out for.
  • I'm remembering now Fix preservation of the sigaltstack on macOS #2676 which I recall was a complicated interaction when embedding Wasmtime. This reminds me that on macOS when we catch a trap we actually resume execution outside the context of the signal handler and then it resumes inside of wasmtime_longjmp. We'll probably need to unwind at least 4k or so and just pave over whatever's there since the very end of the stack probably isn't usable.

@alexcrichton
Copy link
Member

Also I'm completely wrong about sigaltstack: we are currently installing a sigaltstack already

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 15, 2024
This commit enhances the `host_always_has_some_stack` a bit in light of
some thinking around bytecodealliance#8135. Notably this test ensures that the host
itself never segfaults even if the wasm exhausts all of its stack. There
are a number of ways that we can exit out to the host, though, and only
one was tested previously. This commit updates to ensure more cases are
covered.
@cfallin
Copy link
Member Author

cfallin commented Mar 15, 2024

Thanks for filling in a ton of details here! A few thoughts:

  • Libcalls: given that Cranelift has mechanisms today to generate the stack-limit check in prologues, it seems we could have a mode where it is aware that libcalls are "special" (host escapes) and it can generate the same limit-checking just before a given libcall. Not terribly efficient but we mostly use libcalls to polyfill float stuff on older (non-SSE 4.2 or whatever) machines so maybe that's OK.
  • Calls to Wasmtime helpers: we could perhaps encapsulate the above logic in a new CLIF instruction, check_stack_limit or somesuch, and emit that before a call to any helper in our environment implementation.
  • Trap-info precision: we could have a notion of range in addition to exact PC, and cover the whole prologue (which may have multiple stack pushes, etc) with a stack-overflow trap region. We would need to be a little careful to ensure that our whole frame is "probed" implicitly in the prologue wrt its layout and store order: currently user-defined stackslots and spillslots come below clobber-saves and the FP/return address pair, and we only store the latter, so maybe we want to either flip that (though Windows fastcall forced this arrangement IIRC) or have one phantom store to the lowest stack/spillslot in the prologue. We'd also need to adjust the scheme for stack-args at callsites: currently on x64, aarch64 and riscv64 we decrement SP further, whereas we'd want to preallocate the maximum call-arg size in the prologue as s390x does.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 15, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.
@alexcrichton
Copy link
Member

alexcrichton commented Mar 15, 2024

I've been nerd sniped.


it can generate the same limit-checking just before a given libcall

I'm a bit worried about the complexity of this because libcalls are completely invisible to Wasmtime here. In that sense it's really late in the x64 backend that we realize that a libcall is required and at that point we've lost a lot of context necessary to generate a libcall.

I think the best solution here will be to implement something like #8152 but for cranelift-generated libcalls as well. That way we can, after compilation, see what libcalls are required and then go generate a small shim for each libcall.

This also handles things like the fp/pc of a wasm module because we can't do that inline but we instead need an actual function frame to record the exit fp/pc (I forget the exact reasons for this but it's why we had to have separate trampolines for builtin functions in wasmtime)

Calls to Wasmtime helpers

Hopefully handled through #8152 now!

Trap-info precision: we could have a notion of range in addition to exact PC

For <1page functions, I implemented this which basically tests whether the faulting address is within one page of the stack pointer. My hope is that this is precise enough to say "yeah it's a stack overflow" and also easy enough that we don't need metadata on every possible instruction that modifies the stack in Cranelift (e.g. I don't want to have to rearrange the stack args/clobbers/etc).

The main thing that I'm worried about is a situation like:

;; prologue stuff, does not fault
sub rsp, $many_pages
;; start probes
or [rsp + $stack_size_minus_one_page], 0
or [rsp + $stack_size_minus_two_pages], 0
;; ...

where if the first or faults then the faulting address is pretty far away from rsp, perhaps many pages. That I think can be solved though with explicit trap metadata per probe.

I suppose put another way, I'd like to ideally frame the problem as:

All stack overflows caught through the guard page can either be classified as:

  • The faulting address is within one page of the stack pointer itself
  • The instruction is annotated as "this may cause stack overflow"

@alexcrichton
Copy link
Member

(sorry hit comment a bit too soon so I added a little bit more context at the end)

@cfallin
Copy link
Member Author

cfallin commented Mar 15, 2024

The main thing that I'm worried about is a situation like: [sub rsp once, many stores with large offsets]

Fortunately we seem to decrement rsp and then store in stages. The comment there cites Valgrind compatibility, which I think specifically means that rsp is decremented first; I don't know if there's a reason we do individual decrements per store on x86, though on aarch64/riscv64 it would allow us to use store-with-immediate-offset insts with small offset fields. In any case, we do seem to already have this "fault is close to actual current SP" property already, which is nice!

@cfallin
Copy link
Member Author

cfallin commented Mar 15, 2024

(And likewise the dynamic loop version seems to have the same property)

@alexcrichton
Copy link
Member

Oh nice, so then if we're comfortable saying all faults within 1 page of rsp are stack-overflow related faults, I think all we need to do is turn on stack probes and call it a day then?

@cfallin
Copy link
Member Author

cfallin commented Mar 15, 2024

You would know better than me but that sounds plausible, given your work on helper calls in the linked PR!

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 17, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 17, 2024
This commit enhances the `host_always_has_some_stack` a bit in light of
some thinking around bytecodealliance#8135. Notably this test ensures that the host
itself never segfaults even if the wasm exhausts all of its stack. There
are a number of ways that we can exit out to the host, though, and only
one was tested previously. This commit updates to ensure more cases are
covered.
@alexcrichton
Copy link
Member

I poked a bit more at this, specifically with respect to cranelift-injected libcalls (e.g. ceilf). These are not as simple as wasmtime builtins because they don't take a *mut VMContext argument (which I forgot about). The best way I could think of for implementing this was:

  • First, call a "consume stack" builtin whose sole purpose is to have a big stack frame and probe that there's ~32k of stack remaining.
  • Next call a "stack limit" libcall where Wasmtime uses its thread-local state to read the stack limit field of VMRuntimeLimits
  • Finally perform an explicit stack check with this, and then assuming that proceeds call the actual libcall.

I got all that wired up and working and AFAIK that covers the cases that at least I could think of where the guard page will now guaranteed be hit in Cranelift-generated code.


Stepping back a bit though, I thought a little more about this at a higher level. I ran sightglass's default suite and it showed a 3-4% improvement in execution on spidermonkey.wasm, but no changes elsewhere. I'm not sure if it would show a larger improvement in the situations that you're looking at @cfallin though if they're call-heavy benchmarks.

I'm a bit worried about the complexity of this change. Lots about Cranelift/Wasmtime are striking the right balance between simplicity and performance, and I don't think there's any silver bullet for handling stack overflow in wasm. The stack limit check is quite simple but not exactly speedy. Using a guard page feels pretty complex and subtle because we need a guarantee that only Cranelift-generated code hits the guard page, never native code. I've done my best to enumerate all the possible ways that I personally know of we need to handle but I have little confidence that my knowledge is exhaustive, for example I haven't analyzed components yet to see if they need updates too.

More-or-less I think we should probably have more compelling performance data motivating this change before pursuing it further. If 3-4% breaks the bank then that seems ok to pursue, but in abstract 3-4% feels pretty small and perhaps not worth the complexity. If there's a 40% improvement in a different benchmark though that's of course more compelling.

@cfallin
Copy link
Member Author

cfallin commented Mar 18, 2024

@alexcrichton thanks a bunch for pushing this further! I agree it's best to do this only if it actually matters. In my own work I can at least test your branch if it's available, once I get ICs (lots of tiny function calls) wired up properly, and give more data; it'd have to be worth 1.5x-2x or so in my mind.

@alexcrichton
Copy link
Member

This is the (messy) branch with various bits of work, but for performance testing I'd just comment out this line since the test case probably doesn't rely on catching stack overflow anyway.

github-merge-queue bot pushed a commit that referenced this issue Mar 18, 2024
* Enhance test around stack overflow

This commit enhances the `host_always_has_some_stack` a bit in light of
some thinking around #8135. Notably this test ensures that the host
itself never segfaults even if the wasm exhausts all of its stack. There
are a number of ways that we can exit out to the host, though, and only
one was tested previously. This commit updates to ensure more cases are
covered.

* Fix test to test correct thing

* Update tests/all/stack_overflow.rs

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>

---------

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 21, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 21, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 21, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 21, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 22, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 22, 2024
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full
github-merge-queue bot pushed a commit that referenced this issue Mar 22, 2024
* Exit through Cranelift-generated trampolines for builtins

This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by #8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full

* Fix some build issues

* Update winch test expectations

* Update Winch to use new builtin shims.

This commit refactors the Winch compiler to use the new trampolines for
all Wasmtime builtins created in the previous commits. This required a
fair bit of refactoring to handle plumbing through a new kind of
relocation and function call.

Winch's `FuncEnv` now contains a `PrimaryMap` from `UserExternalNameRef`
to `UserExternalName`. This is because there's now more than one kind of
name than just wasm function relocations, so the raw index space of
`UserExternalNameRef` is no longer applicable. This required threading
`FuncEnv` to more locations along with some refactorings to ensure that
lifetimes work out ok.

The `CompiledFunction` no longer stores a trait object of how to map
name refs to names and now directly has a `Primarymap`. This also means
that Winch's return value from its `TargetIsa` is a `CompiledFunction`
as opposed to the previous just-a-`MachBuffer` so it can also package up
all the relocation information. This ends up having `winch-codegen`
depend on `wasmtime-cranelift-shared` as a new dependency.

* Review feedback
dhil pushed a commit to dhil/wasmtime that referenced this issue Mar 26, 2024
…liance#8152)

* Exit through Cranelift-generated trampolines for builtins

This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.

The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:

* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
  pc/fp/etc) and then load a function pointer to the host
  implementation.
* The host implementation is invoked and then proceeds as usual.

The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.

This work is inspired by bytecodealliance#8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.

prtest:full

* Fix some build issues

* Update winch test expectations

* Update Winch to use new builtin shims.

This commit refactors the Winch compiler to use the new trampolines for
all Wasmtime builtins created in the previous commits. This required a
fair bit of refactoring to handle plumbing through a new kind of
relocation and function call.

Winch's `FuncEnv` now contains a `PrimaryMap` from `UserExternalNameRef`
to `UserExternalName`. This is because there's now more than one kind of
name than just wasm function relocations, so the raw index space of
`UserExternalNameRef` is no longer applicable. This required threading
`FuncEnv` to more locations along with some refactorings to ensure that
lifetimes work out ok.

The `CompiledFunction` no longer stores a trait object of how to map
name refs to names and now directly has a `Primarymap`. This also means
that Winch's return value from its `TargetIsa` is a `CompiledFunction`
as opposed to the previous just-a-`MachBuffer` so it can also package up
all the relocation information. This ends up having `winch-codegen`
depend on `wasmtime-cranelift-shared` as a new dependency.

* Review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants