forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry-pick support for zstd's external sequence producer API #2
Open
embg
wants to merge
4
commits into
terrelln:zstd-next
Choose a base branch
from
embg:quickassist
base: zstd-next
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Import upstream zstd v1.5.5 to expose upstream's QAT integration. Import from upstream commit 58b3ef79 [0]. This is one commit before the tag v1.5.5-kernel [1], which is signed with upstream's signing key. The next patch in the series imports from v1.5.5-kernel, and is included in the series, rather than just importing directly from v1.5.5-kernel, because it is a non-trivial patch applied to improve the kernel's decompression speed. This commit contains 3 backported patches on top of v1.5.5: Two from the Linux copy of zstd, and one from upstream's `dev` branch. In addition to keeping the kernel's copy of zstd up to date, this update was requested by Intel to expose upstream zstd's external match provider API to the kernel, which allows QAT to accelerate the LZ match finding stage. This commit was generated by: export ZSTD=/path/to/repo/zstd/ export LINUX=/path/to/repo/linux/ cd "$ZSTD/contrib/linux-kernel" git checkout v1.5.5-kernel~ make import LINUX="$LINUX" I tested and benchmarked this commit on x86-64 with gcc-13.2.1 on an Intel i9-9900K by running my benchmark scripts that benchmark zstd's performance in btrfs and squashfs compressed filesystems. This commit improves compression speed, especially for higher compression levels, and regresses decompression speed. But the decompression speed regression is addressed by the next patch in the series. Component, Level, C. time delta, size delta, D. time delta Btrfs , 1, -1.9%, +0.0%, +9.5% Btrfs , 3, -5.6%, +0.0%, +7.4% Btrfs , 5, -4.9%, +0.0%, +5.0% Btrfs , 7, -5.7%, +0.0%, +5.2% Btrfs , 9, -5.7%, +0.0%, +4.0% Squashfs , 1, N/A, 0.0%, +11.6% I also boot tested with a zstd compressed kernel on i386 and aarch64. Link: facebook/zstd@58b3ef7 Link: https://github.com/facebook/zstd/tree/v1.5.5-kernel Signed-off-by: Nick Terrell <terrelln@fb.com>
Backport upstream commit c7269ad [0] to improve zstd decoding speed. Updating the kernel to zstd v1.5.5 earlier in this patch series regressed zstd decoding speed. This turned out to be because gcc was not unrolling the inner loops of the Huffman decoder which are executed a constant number of times [1]. This really hurts performance, as we expect this loop to be completely branch-free. This commit fixes the issue by unrolling the loop manually [2]. The commit fixes one more minor issue, which is to mask a variable shift by 0x3F. The shift was guaranteed to be less than 64, but gcc couldn't prove that, and emitted suboptimal code. Finally, the upstream commit added a build macro `HUF_DISABLE_FAST_DECODE` which is not used in the kernel, but is maintained to keep a clean import from upstream. This commit was generated from upstream signed tag v1.5.5-kernel [3] by: export ZSTD=/path/to/repo/zstd/ export LINUX=/path/to/repo/linux/ cd "$ZSTD/contrib/linux-kernel" git checkout v1.5.5-kernel make import LINUX="$LINUX" I ran my benchmark & test suite before and after this commit to measure the overall decompression speed benefit. It benchmarks zstd at several compression levels. These benchmarks measure the total time it takes to read data from the compressed filesystem. Component, Level, Read time delta Btrfs , 1, -7.0% Btrfs , 3, -3.9% Btrfs , 5, -4.7% Btrfs , 7, -5.5% Btrfs , 9, -2.4% Squashfs , 1, -9.1% Link: facebook/zstd@c7269ad Link: https://gist.github.com/terrelln/2e14ff1fb197102a08d7823d8044978d Link: https://gist.github.com/terrelln/a70bde22a2abc800691fb65c21eabc2a Link: https://github.com/facebook/zstd/tree/v1.5.5-kernel Signed-off-by: Nick Terrell <terrelln@fb.com>
8c3cfb1
to
005f476
Compare
terrelln
requested changes
Dec 28, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the debug.c
change, and make sure there are no other unintended changes.
Cherry-picks support for using zstd's external sequence producer API in the kernel. This unblocks the use of QuickAssist hardware acceleration for zstd in applications such as BTRFS. Some context: the kernel uses ZSTD_initStaticCCtx() to create compression contexts. This function builds the compression context in a user-provided buffer which must be sized according to the compression parameters. zstd provides size estimation functions for this purpose, but until now they were not compatible with the external sequence producer API. This cherry-pick fixes that incompatibility. More specifically, it pulls in zstd upstream PRs #3839 and #3854. PR #3854 includes a unit test (upstream only) which validates that the external sequence producer API works correctly in conjunction with ZSTD_initStaticCCtx(). To build this commit, I first cherry-picked the relevant upstream commits onto the upstream v1.5.5-kernel tag: cd ~/repos/zstd git checkout tags/v1.5.5-kernel git cherry-pick -m 1 126ec2669c927b24acd38ea903a211c1b5416588 git cherry-pick c6cabf94417d84ebb5da62e05d8b8a9623763585 I then ran "make import" to copy the changes into my fork of Linux: cd ~/repos/zstd/contrib/linux-kernel/ make import Signed-off-by: Elliot Gorokhovsky <embg@meta.com>
7247a9d
to
11a0569
Compare
terrelln
pushed a commit
that referenced
this pull request
Mar 4, 2025
…ea as VM_ALLOC Erhard reported the following KASAN hit while booting his PowerMac G4 with a KASAN-enabled kernel 6.13-rc6: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 Write of size 8 at addr f1000000 by task chronyd/1293 CPU: 0 UID: 123 PID: 1293 Comm: chronyd Tainted: G W 6.13.0-rc6-PMacG4 #2 Tainted: [W]=WARN Hardware name: PowerMac3,6 7455 0x80010303 PowerMac Call Trace: [c2437590] [c1631a84] dump_stack_lvl+0x70/0x8c (unreliable) [c24375b0] [c0504998] print_report+0xdc/0x504 [c2437610] [c050475c] kasan_report+0xf8/0x108 [c2437690] [c0505a3c] kasan_check_range+0x24/0x18c [c24376a0] [c03fb5e4] copy_to_kernel_nofault+0xd8/0x1c8 [c24376c0] [c004c014] patch_instructions+0x15c/0x16c [c2437710] [c00731a8] bpf_arch_text_copy+0x60/0x7c [c2437730] [c0281168] bpf_jit_binary_pack_finalize+0x50/0xac [c2437750] [c0073cf4] bpf_int_jit_compile+0xb30/0xdec [c2437880] [c0280394] bpf_prog_select_runtime+0x15c/0x478 [c24378d0] [c1263428] bpf_prepare_filter+0xbf8/0xc14 [c2437990] [c12677ec] bpf_prog_create_from_user+0x258/0x2b4 [c24379d0] [c027111c] do_seccomp+0x3dc/0x1890 [c2437ac0] [c001d8e0] system_call_exception+0x2dc/0x420 [c2437f30] [c00281ac] ret_from_syscall+0x0/0x2c --- interrupt: c00 at 0x5a1274 NIP: 005a1274 LR: 006a3b3c CTR: 005296c8 REGS: c2437f40 TRAP: 0c00 Tainted: G W (6.13.0-rc6-PMacG4) MSR: 0200f932 <VEC,EE,PR,FP,ME,IR,DR,RI> CR: 24004422 XER: 00000000 GPR00: 00000166 af8f3fa0 a7ee3540 00000001 00000000 013b6500 005a5858 0200f932 GPR08: 00000000 00001fe9 013d5fc8 005296c8 2822244c 00b2fcd8 00000000 af8f4b57 GPR16: 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000002 GPR24: 00afdbb0 00000000 00000000 00000000 006e0004 013ce060 006e7c1c 00000001 NIP [005a1274] 0x5a1274 LR [006a3b3c] 0x6a3b3c --- interrupt: c00 The buggy address belongs to the virtual mapping at [f1000000, f1002000) created by: text_area_cpu_up+0x20/0x190 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x76e30 flags: 0x80000000(zone=2) raw: 80000000 00000000 00000122 00000000 00000000 00000000 ffffffff 00000001 raw: 00000000 page dumped because: kasan: bad access detected Memory state around the buggy address: f0ffff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0ffff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >f1000000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ f1000080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f1000100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ================================================================== f8 corresponds to KASAN_VMALLOC_INVALID which means the area is not initialised hence not supposed to be used yet. Powerpc text patching infrastructure allocates a virtual memory area using get_vm_area() and flags it as VM_ALLOC. But that flag is meant to be used for vmalloc() and vmalloc() allocated memory is not supposed to be used before a call to __vmalloc_node_range() which is never called for that area. That went undetected until commit e4137f0 ("mm, kasan, kmsan: instrument copy_from/to_kernel_nofault") The area allocated by text_area_cpu_up() is not vmalloc memory, it is mapped directly on demand when needed by map_kernel_page(). There is no VM flag corresponding to such usage, so just pass no flag. That way the area will be unpoisonned and usable immediately. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20250112135832.57c92322@yea/ Fixes: 37bc3e5 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/06621423da339b374f48c0886e3a5db18e896be8.1739342693.git.christophe.leroy@csgroup.eu
terrelln
pushed a commit
that referenced
this pull request
Mar 4, 2025
We have several places across the kernel where we want to access another task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making a call to syscall_get_arguments(). This works for register arguments right away by accessing the task's `regs' member of `struct pt_regs', however for stack arguments seen with 32-bit/o32 kernels things are more complicated. Technically they ought to be obtained from the user stack with calls to an access_remote_vm(), but we have an easier way available already. So as to be able to access syscall stack arguments as regular function arguments following the MIPS calling convention we copy them over from the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in handle_sys(), to the current stack frame's outgoing argument space at the top of the stack, which is where the handler called expects to see its incoming arguments. This area is also pointed at by the `pt_regs' pointer obtained by task_pt_regs(). Make the o32 stack argument space a proper member of `struct pt_regs' then, by renaming the existing member from `pad0' to `args' and using generated offsets to access the space. No functional change though. With the change in place the o32 kernel stack frame layout at the entry to a syscall handler invoked by handle_sys() is therefore as follows: $sp + 68 -> | ... | <- pt_regs.regs[9] +---------------------+ $sp + 64 -> | $t0 | <- pt_regs.regs[8] +---------------------+ $sp + 60 -> | $a3/argument #4 | <- pt_regs.regs[7] +---------------------+ $sp + 56 -> | $a2/argument #3 | <- pt_regs.regs[6] +---------------------+ $sp + 52 -> | $a1/argument #2 | <- pt_regs.regs[5] +---------------------+ $sp + 48 -> | $a0/argument #1 | <- pt_regs.regs[4] +---------------------+ $sp + 44 -> | $v1 | <- pt_regs.regs[3] +---------------------+ $sp + 40 -> | $v0 | <- pt_regs.regs[2] +---------------------+ $sp + 36 -> | $at | <- pt_regs.regs[1] +---------------------+ $sp + 32 -> | $zero | <- pt_regs.regs[0] +---------------------+ $sp + 28 -> | stack argument torvalds#8 | <- pt_regs.args[7] +---------------------+ $sp + 24 -> | stack argument torvalds#7 | <- pt_regs.args[6] +---------------------+ $sp + 20 -> | stack argument torvalds#6 | <- pt_regs.args[5] +---------------------+ $sp + 16 -> | stack argument #5 | <- pt_regs.args[4] +---------------------+ $sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3] +---------------------+ $sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2] +---------------------+ $sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1] +---------------------+ $sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0] +---------------------+ holding user data received and with the first 4 frame slots reserved by the psABI for the compiler to spill the incoming arguments from $a0-$a3 registers (which it sometimes does according to its needs) and the next 4 frame slots designated by the psABI for any stack function arguments that follow. This data is also available for other tasks to peek/poke at as reqired and where permitted. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
terrelln
pushed a commit
that referenced
this pull request
Mar 4, 2025
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #2 - Large set of fixes for vector handling, specially in the interactions between host and guest state. This fixes a number of bugs affecting actual deployments, and greatly simplifies the FP/SIMD/SVE handling. Thanks to Mark Rutland for dealing with this thankless task. - Fix an ugly race between vcpu and vgic creation/init, resulting in unexpected behaviours. - Fix use of kernel VAs at EL2 when emulating timers with nVHE. - Small set of pKVM improvements and cleanups.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Outdated. Please refer to the new PR.