Cherry-pick support for zstd's external sequence producer API #2

embg · 2023-12-28T20:30:38Z

Outdated. Please refer to the new PR.

Import upstream zstd v1.5.5 to expose upstream's QAT integration. Import from upstream commit 58b3ef79 [0]. This is one commit before the tag v1.5.5-kernel [1], which is signed with upstream's signing key. The next patch in the series imports from v1.5.5-kernel, and is included in the series, rather than just importing directly from v1.5.5-kernel, because it is a non-trivial patch applied to improve the kernel's decompression speed. This commit contains 3 backported patches on top of v1.5.5: Two from the Linux copy of zstd, and one from upstream's `dev` branch. In addition to keeping the kernel's copy of zstd up to date, this update was requested by Intel to expose upstream zstd's external match provider API to the kernel, which allows QAT to accelerate the LZ match finding stage. This commit was generated by: export ZSTD=/path/to/repo/zstd/ export LINUX=/path/to/repo/linux/ cd "$ZSTD/contrib/linux-kernel" git checkout v1.5.5-kernel~ make import LINUX="$LINUX" I tested and benchmarked this commit on x86-64 with gcc-13.2.1 on an Intel i9-9900K by running my benchmark scripts that benchmark zstd's performance in btrfs and squashfs compressed filesystems. This commit improves compression speed, especially for higher compression levels, and regresses decompression speed. But the decompression speed regression is addressed by the next patch in the series. Component, Level, C. time delta, size delta, D. time delta Btrfs , 1, -1.9%, +0.0%, +9.5% Btrfs , 3, -5.6%, +0.0%, +7.4% Btrfs , 5, -4.9%, +0.0%, +5.0% Btrfs , 7, -5.7%, +0.0%, +5.2% Btrfs , 9, -5.7%, +0.0%, +4.0% Squashfs , 1, N/A, 0.0%, +11.6% I also boot tested with a zstd compressed kernel on i386 and aarch64. Link: facebook/zstd@58b3ef7 Link: https://github.com/facebook/zstd/tree/v1.5.5-kernel Signed-off-by: Nick Terrell <terrelln@fb.com>

Backport upstream commit c7269ad [0] to improve zstd decoding speed. Updating the kernel to zstd v1.5.5 earlier in this patch series regressed zstd decoding speed. This turned out to be because gcc was not unrolling the inner loops of the Huffman decoder which are executed a constant number of times [1]. This really hurts performance, as we expect this loop to be completely branch-free. This commit fixes the issue by unrolling the loop manually [2]. The commit fixes one more minor issue, which is to mask a variable shift by 0x3F. The shift was guaranteed to be less than 64, but gcc couldn't prove that, and emitted suboptimal code. Finally, the upstream commit added a build macro `HUF_DISABLE_FAST_DECODE` which is not used in the kernel, but is maintained to keep a clean import from upstream. This commit was generated from upstream signed tag v1.5.5-kernel [3] by: export ZSTD=/path/to/repo/zstd/ export LINUX=/path/to/repo/linux/ cd "$ZSTD/contrib/linux-kernel" git checkout v1.5.5-kernel make import LINUX="$LINUX" I ran my benchmark & test suite before and after this commit to measure the overall decompression speed benefit. It benchmarks zstd at several compression levels. These benchmarks measure the total time it takes to read data from the compressed filesystem. Component, Level, Read time delta Btrfs , 1, -7.0% Btrfs , 3, -3.9% Btrfs , 5, -4.7% Btrfs , 7, -5.5% Btrfs , 9, -2.4% Squashfs , 1, -9.1% Link: facebook/zstd@c7269ad Link: https://gist.github.com/terrelln/2e14ff1fb197102a08d7823d8044978d Link: https://gist.github.com/terrelln/a70bde22a2abc800691fb65c21eabc2a Link: https://github.com/facebook/zstd/tree/v1.5.5-kernel Signed-off-by: Nick Terrell <terrelln@fb.com>

terrelln

Please revert the debug.c change, and make sure there are no other unintended changes.

lib/zstd/common/debug.c

Cherry-picks support for using zstd's external sequence producer API in the kernel. This unblocks the use of QuickAssist hardware acceleration for zstd in applications such as BTRFS. Some context: the kernel uses ZSTD_initStaticCCtx() to create compression contexts. This function builds the compression context in a user-provided buffer which must be sized according to the compression parameters. zstd provides size estimation functions for this purpose, but until now they were not compatible with the external sequence producer API. This cherry-pick fixes that incompatibility. More specifically, it pulls in zstd upstream PRs #3839 and #3854. PR #3854 includes a unit test (upstream only) which validates that the external sequence producer API works correctly in conjunction with ZSTD_initStaticCCtx(). To build this commit, I first cherry-picked the relevant upstream commits onto the upstream v1.5.5-kernel tag: cd ~/repos/zstd git checkout tags/v1.5.5-kernel git cherry-pick -m 1 126ec2669c927b24acd38ea903a211c1b5416588 git cherry-pick c6cabf94417d84ebb5da62e05d8b8a9623763585 I then ran "make import" to copy the changes into my fork of Linux: cd ~/repos/zstd/contrib/linux-kernel/ make import Signed-off-by: Elliot Gorokhovsky <embg@meta.com>

…ea as VM_ALLOC Erhard reported the following KASAN hit while booting his PowerMac G4 with a KASAN-enabled kernel 6.13-rc6: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 Write of size 8 at addr f1000000 by task chronyd/1293 CPU: 0 UID: 123 PID: 1293 Comm: chronyd Tainted: G W 6.13.0-rc6-PMacG4 #2 Tainted: [W]=WARN Hardware name: PowerMac3,6 7455 0x80010303 PowerMac Call Trace: [c2437590] [c1631a84] dump_stack_lvl+0x70/0x8c (unreliable) [c24375b0] [c0504998] print_report+0xdc/0x504 [c2437610] [c050475c] kasan_report+0xf8/0x108 [c2437690] [c0505a3c] kasan_check_range+0x24/0x18c [c24376a0] [c03fb5e4] copy_to_kernel_nofault+0xd8/0x1c8 [c24376c0] [c004c014] patch_instructions+0x15c/0x16c [c2437710] [c00731a8] bpf_arch_text_copy+0x60/0x7c [c2437730] [c0281168] bpf_jit_binary_pack_finalize+0x50/0xac [c2437750] [c0073cf4] bpf_int_jit_compile+0xb30/0xdec [c2437880] [c0280394] bpf_prog_select_runtime+0x15c/0x478 [c24378d0] [c1263428] bpf_prepare_filter+0xbf8/0xc14 [c2437990] [c12677ec] bpf_prog_create_from_user+0x258/0x2b4 [c24379d0] [c027111c] do_seccomp+0x3dc/0x1890 [c2437ac0] [c001d8e0] system_call_exception+0x2dc/0x420 [c2437f30] [c00281ac] ret_from_syscall+0x0/0x2c --- interrupt: c00 at 0x5a1274 NIP: 005a1274 LR: 006a3b3c CTR: 005296c8 REGS: c2437f40 TRAP: 0c00 Tainted: G W (6.13.0-rc6-PMacG4) MSR: 0200f932 <VEC,EE,PR,FP,ME,IR,DR,RI> CR: 24004422 XER: 00000000 GPR00: 00000166 af8f3fa0 a7ee3540 00000001 00000000 013b6500 005a5858 0200f932 GPR08: 00000000 00001fe9 013d5fc8 005296c8 2822244c 00b2fcd8 00000000 af8f4b57 GPR16: 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000002 GPR24: 00afdbb0 00000000 00000000 00000000 006e0004 013ce060 006e7c1c 00000001 NIP [005a1274] 0x5a1274 LR [006a3b3c] 0x6a3b3c --- interrupt: c00 The buggy address belongs to the virtual mapping at [f1000000, f1002000) created by: text_area_cpu_up+0x20/0x190 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x76e30 flags: 0x80000000(zone=2) raw: 80000000 00000000 00000122 00000000 00000000 00000000 ffffffff 00000001 raw: 00000000 page dumped because: kasan: bad access detected Memory state around the buggy address: f0ffff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0ffff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >f1000000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ f1000080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f1000100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ================================================================== f8 corresponds to KASAN_VMALLOC_INVALID which means the area is not initialised hence not supposed to be used yet. Powerpc text patching infrastructure allocates a virtual memory area using get_vm_area() and flags it as VM_ALLOC. But that flag is meant to be used for vmalloc() and vmalloc() allocated memory is not supposed to be used before a call to __vmalloc_node_range() which is never called for that area. That went undetected until commit e4137f0 ("mm, kasan, kmsan: instrument copy_from/to_kernel_nofault") The area allocated by text_area_cpu_up() is not vmalloc memory, it is mapped directly on demand when needed by map_kernel_page(). There is no VM flag corresponding to such usage, so just pass no flag. That way the area will be unpoisonned and usable immediately. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20250112135832.57c92322@yea/ Fixes: 37bc3e5 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/06621423da339b374f48c0886e3a5db18e896be8.1739342693.git.christophe.leroy@csgroup.eu

We have several places across the kernel where we want to access another task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making a call to syscall_get_arguments(). This works for register arguments right away by accessing the task's `regs' member of `struct pt_regs', however for stack arguments seen with 32-bit/o32 kernels things are more complicated. Technically they ought to be obtained from the user stack with calls to an access_remote_vm(), but we have an easier way available already. So as to be able to access syscall stack arguments as regular function arguments following the MIPS calling convention we copy them over from the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in handle_sys(), to the current stack frame's outgoing argument space at the top of the stack, which is where the handler called expects to see its incoming arguments. This area is also pointed at by the `pt_regs' pointer obtained by task_pt_regs(). Make the o32 stack argument space a proper member of `struct pt_regs' then, by renaming the existing member from `pad0' to `args' and using generated offsets to access the space. No functional change though. With the change in place the o32 kernel stack frame layout at the entry to a syscall handler invoked by handle_sys() is therefore as follows: $sp + 68 -> | ... | <- pt_regs.regs[9] +---------------------+ $sp + 64 -> | $t0 | <- pt_regs.regs[8] +---------------------+ $sp + 60 -> | $a3/argument #4 | <- pt_regs.regs[7] +---------------------+ $sp + 56 -> | $a2/argument #3 | <- pt_regs.regs[6] +---------------------+ $sp + 52 -> | $a1/argument #2 | <- pt_regs.regs[5] +---------------------+ $sp + 48 -> | $a0/argument #1 | <- pt_regs.regs[4] +---------------------+ $sp + 44 -> | $v1 | <- pt_regs.regs[3] +---------------------+ $sp + 40 -> | $v0 | <- pt_regs.regs[2] +---------------------+ $sp + 36 -> | $at | <- pt_regs.regs[1] +---------------------+ $sp + 32 -> | $zero | <- pt_regs.regs[0] +---------------------+ $sp + 28 -> | stack argument torvalds#8 | <- pt_regs.args[7] +---------------------+ $sp + 24 -> | stack argument torvalds#7 | <- pt_regs.args[6] +---------------------+ $sp + 20 -> | stack argument torvalds#6 | <- pt_regs.args[5] +---------------------+ $sp + 16 -> | stack argument #5 | <- pt_regs.args[4] +---------------------+ $sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3] +---------------------+ $sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2] +---------------------+ $sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1] +---------------------+ $sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0] +---------------------+ holding user data received and with the first 4 frame slots reserved by the psABI for the compiler to spill the incoming arguments from $a0-$a3 registers (which it sometimes does according to its needs) and the next 4 frame slots designated by the psABI for any stack function arguments that follow. This data is also available for other tasks to peek/poke at as reqired and where permitted. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #2 - Large set of fixes for vector handling, specially in the interactions between host and guest state. This fixes a number of bugs affecting actual deployments, and greatly simplifies the FP/SIMD/SVE handling. Thanks to Mark Rutland for dealing with this thankless task. - Fix an ugly race between vcpu and vgic creation/init, resulting in unexpected behaviours. - Fix use of kernel VAs at EL2 when emulating timers with nVHE. - Small set of pKVM improvements and cleanups.

Nick Terrell added 2 commits November 20, 2023 14:48

embg force-pushed the quickassist branch 3 times, most recently from 8c3cfb1 to 005f476 Compare December 28, 2023 21:13

terrelln requested changes Dec 28, 2023

View reviewed changes

lib/zstd/common/debug.c Outdated Show resolved Hide resolved

embg force-pushed the quickassist branch from 005f476 to 7d87952 Compare December 28, 2023 21:17

embg force-pushed the quickassist branch from 7d87952 to 9b2d4b5 Compare December 28, 2023 21:29

embg requested a review from terrelln December 28, 2023 21:38

embg force-pushed the quickassist branch 8 times, most recently from 7247a9d to 11a0569 Compare May 20, 2024 15:47

TEST COMMIT FOR INTEL

5bf90f7

embg force-pushed the quickassist branch from 11a0569 to 5bf90f7 Compare June 3, 2024 15:13

terrelln force-pushed the zstd-next branch from 3f832df to 7eb1721 Compare March 4, 2025 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick support for zstd's external sequence producer API #2

Cherry-pick support for zstd's external sequence producer API #2

embg commented Dec 28, 2023 •

edited

Loading

terrelln left a comment

Cherry-pick support for zstd's external sequence producer API #2

Are you sure you want to change the base?

Cherry-pick support for zstd's external sequence producer API #2

Conversation

embg commented Dec 28, 2023 • edited Loading

terrelln left a comment

Choose a reason for hiding this comment

embg commented Dec 28, 2023 •

edited

Loading