-
Notifications
You must be signed in to change notification settings - Fork 55.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
small cleanup in CREDITS #12
Conversation
I believe pull requests like this are not used. https://github.com/torvalds/linux/blob/master/Documentation/HOWTO I recommend reading that. |
Thanks Gunni! Now that I have read my way through the documentation, three small questions before I close this:
|
This is considered a trivial patch. Please consult |
Well, according to https://github.com/torvalds/linux/blob/master/Documentation/SubmittingPatches#L200 you should mail Jiri Kosina trivial@kernel.org. https://github.com/torvalds/linux/blob/master/MAINTAINERS#L6515 However since kernel.org is down i can not say where this should be sent, maybe use vger as a replacement and note that you're mailing there because the correct mail is down. EDIT: yeah or use the mail sc68cal submitted: Jiri Kosina jkosina@suse.cz. |
Thanks a lot! I'll email my patch to Jiri Kosina. |
You also need to amend your commit message to add a sign off line. You need that in the commit message or your patch will not be accepted. Check the docs for more info |
When the cgroup base was allocated with kmalloc, it was necessary to annotate the variable with kmemleak_not_leak(). But because it has recently been changed to be allocated with alloc_page() (which skips kmemleak checks) causes a warning on boot up. I was triggering this output: allocated 8388608 bytes of page_cgroup please try 'cgroup_disable=memory' option if you don't want memory cgroups kmemleak: Trying to color unknown object at 0xf5840000 as Grey Pid: 0, comm: swapper Not tainted 3.0.0-test #12 Call Trace: [<c17e34e6>] ? printk+0x1d/0x1f^M [<c10e2941>] paint_ptr+0x4f/0x78 [<c178ab57>] kmemleak_not_leak+0x58/0x7d [<c108ae9f>] ? __rcu_read_unlock+0x9/0x7d [<c1cdb462>] kmemleak_init+0x19d/0x1e9 [<c1cbf771>] start_kernel+0x346/0x3ec [<c1cbf1b4>] ? loglevel+0x18/0x18 [<c1cbf0aa>] i386_start_kernel+0xaa/0xb0 After a bit of debugging I tracked the object 0xf840000 (and others) down to the cgroup code. The change from allocating base with kmalloc to alloc_page() has the base not calling kmemleak_alloc() which adds the pointer to the object_tree_root, but kmemleak_not_leak() adds it to the crt_early_log[] table. On kmemleak_init(), the entry is found in the early_log[] but not the object_tree_root, and this error message is displayed. If alloc_page() fails then it defaults back to vmalloc() which still uses the kmemleak_alloc() which makes us still need the kmemleak_not_leak() call. The solution is to call the kmemleak_alloc() directly if the alloc_page() succeeds. Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Ingo Molnar <mingo@elte.hu> wrote: > The patch below addresses these concerns, serializes the output, tidies up the > printout, resulting in this new output: There's one bug remaining that my patch does not address: the vCPUs are not printed in order: # vCPU #0's dump: # vCPU #2's dump: # vCPU torvalds#24's dump: # vCPU #5's dump: # vCPU torvalds#39's dump: # vCPU torvalds#38's dump: # vCPU torvalds#51's dump: # vCPU torvalds#11's dump: # vCPU torvalds#10's dump: # vCPU torvalds#12's dump: This is undesirable as the order of printout is highly random, so successive dumps are difficult to compare. The patch below serializes the signalling itself. (this is on top of the previous patch) The patch also tweaks the vCPU printout line a bit so that it does not start with '#', which is discarded if such messages are pasted into Git commit messages. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Pekka Enberg <penberg@kernel.org>
If the pte mapping in generic_perform_write() is unmapped between iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the "copied" parameter to ->end_write can be zero. ext4 couldn't cope with it with delayed allocations enabled. This skips the i_disksize enlargement logic if copied is zero and no new data was appeneded to the inode. gdb> bt #0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\ 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467 #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 #2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\ ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440 #3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\ os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482 #4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\ xffff88001e26be40) at mm/filemap.c:2600 #5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\ zed out>, pos=<value optimized out>) at mm/filemap.c:2632 #6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\ t fs/ext4/file.c:136 #7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \ ppos=0xffff88001e26bf48) at fs/read_write.c:406 #8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\ 000, pos=0xffff88001e26bf48) at fs/read_write.c:435 #9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\ 4000) at fs/read_write.c:487 #10 <signal handler called> #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ () #12 0x0000000000000000 in ?? () gdb> print offset $22 = 0xffffffffffffffff gdb> print idx $23 = 0xffffffff gdb> print inode->i_blkbits $24 = 0xc gdb> up #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 2512 if (ext4_da_should_update_i_disksize(page, end)) { gdb> print start $25 = 0x0 gdb> print end $26 = 0xffffffffffffffff gdb> print pos $27 = 0x108000 gdb> print new_i_size $28 = 0x108000 gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize $29 = 0xd9000 gdb> down 2467 for (i = 0; i < idx; i++) gdb> print i $30 = 0xd44acbee This is 100% reproducible with some autonuma development code tuned in a very aggressive manner (not normal way even for knumad) which does "exotic" changes to the ptes. It wouldn't normally trigger but I don't see why it can't happen normally if the page is added to swap cache in between the two faults leading to "copied" being zero (which then hangs in ext4). So it should be fixed. Especially possible with lumpy reclaim (albeit disabled if compaction is enabled) as that would ignore the young bits in the ptes. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Overly indented code should be refactored. Suggest refactoring excessive indentation of of if/else/for/do/while/switch statements. For example: $ cat t.c #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { if (1) if (2) if (3) if (4) if (5) if (6) if (7) if (8) ; return 0; } $ ./scripts/checkpatch.pl -f t.c WARNING: Too many leading tabs - consider code refactoring #12: FILE: t.c:12: + if (6) WARNING: Too many leading tabs - consider code refactoring #13: FILE: t.c:13: + if (7) WARNING: Too many leading tabs - consider code refactoring #14: FILE: t.c:14: + if (8) total: 0 errors, 3 warnings, 17 lines checked t.c has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the netdev is already in NETREG_UNREGISTERING/_UNREGISTERED state, do not update the real num tx queues. netdev_queue_update_kobjects() is already called via remove_queue_kobjects() at NETREG_UNREGISTERING time. So, when upper layer driver, e.g., FCoE protocol stack is monitoring the netdev event of NETDEV_UNREGISTER and calls back to LLD ndo_fcoe_disable() to remove extra queues allocated for FCoE, the associated txq sysfs kobjects are already removed, and trying to update the real num queues would cause something like below: ... PID: 25138 TASK: ffff88021e64c440 CPU: 3 COMMAND: "kworker/3:3" #0 [ffff88021f007760] machine_kexec at ffffffff810226d9 #1 [ffff88021f0077d0] crash_kexec at ffffffff81089d2d #2 [ffff88021f0078a0] oops_end at ffffffff813bca78 #3 [ffff88021f0078d0] no_context at ffffffff81029e72 #4 [ffff88021f007920] __bad_area_nosemaphore at ffffffff8102a155 #5 [ffff88021f0079f0] bad_area_nosemaphore at ffffffff8102a23e torvalds#6 [ffff88021f007a00] do_page_fault at ffffffff813bf32e torvalds#7 [ffff88021f007b10] page_fault at ffffffff813bc045 [exception RIP: sysfs_find_dirent+17] RIP: ffffffff81178611 RSP: ffff88021f007bc0 RFLAGS: 00010246 RAX: ffff88021e64c440 RBX: ffffffff8156cc63 RCX: 0000000000000004 RDX: ffffffff8156cc63 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88021f007be0 R8: 0000000000000004 R9: 0000000000000008 R10: ffffffff816fed00 R11: 0000000000000004 R12: 0000000000000000 R13: ffffffff8156cc63 R14: 0000000000000000 R15: ffff8802222a0000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#8 [ffff88021f007be8] sysfs_get_dirent at ffffffff81178c07 torvalds#9 [ffff88021f007c18] sysfs_remove_group at ffffffff8117ac27 torvalds#10 [ffff88021f007c48] netdev_queue_update_kobjects at ffffffff813178f9 torvalds#11 [ffff88021f007c88] netif_set_real_num_tx_queues at ffffffff81303e38 torvalds#12 [ffff88021f007cc8] ixgbe_set_num_queues at ffffffffa0249763 [ixgbe] torvalds#13 [ffff88021f007cf8] ixgbe_init_interrupt_scheme at ffffffffa024ea89 [ixgbe] torvalds#14 [ffff88021f007d48] ixgbe_fcoe_disable at ffffffffa0267113 [ixgbe] torvalds#15 [ffff88021f007d68] vlan_dev_fcoe_disable at ffffffffa014fef5 [8021q] torvalds#16 [ffff88021f007d78] fcoe_interface_cleanup at ffffffffa02b7dfd [fcoe] torvalds#17 [ffff88021f007df8] fcoe_destroy_work at ffffffffa02b7f08 [fcoe] torvalds#18 [ffff88021f007e18] process_one_work at ffffffff8105d7ca torvalds#19 [ffff88021f007e68] worker_thread at ffffffff81060513 torvalds#20 [ffff88021f007ee8] kthread at ffffffff810648b6 torvalds#21 [ffff88021f007f48] kernel_thread_helper at ffffffff813c40f4 Signed-off-by: Yi Zou <yi.zou@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
If the netdev is already in NETREG_UNREGISTERING/_UNREGISTERED state, do not update the real num tx queues. netdev_queue_update_kobjects() is already called via remove_queue_kobjects() at NETREG_UNREGISTERING time. So, when upper layer driver, e.g., FCoE protocol stack is monitoring the netdev event of NETDEV_UNREGISTER and calls back to LLD ndo_fcoe_disable() to remove extra queues allocated for FCoE, the associated txq sysfs kobjects are already removed, and trying to update the real num queues would cause something like below: ... PID: 25138 TASK: ffff88021e64c440 CPU: 3 COMMAND: "kworker/3:3" #0 [ffff88021f007760] machine_kexec at ffffffff810226d9 #1 [ffff88021f0077d0] crash_kexec at ffffffff81089d2d #2 [ffff88021f0078a0] oops_end at ffffffff813bca78 #3 [ffff88021f0078d0] no_context at ffffffff81029e72 #4 [ffff88021f007920] __bad_area_nosemaphore at ffffffff8102a155 #5 [ffff88021f0079f0] bad_area_nosemaphore at ffffffff8102a23e torvalds#6 [ffff88021f007a00] do_page_fault at ffffffff813bf32e torvalds#7 [ffff88021f007b10] page_fault at ffffffff813bc045 [exception RIP: sysfs_find_dirent+17] RIP: ffffffff81178611 RSP: ffff88021f007bc0 RFLAGS: 00010246 RAX: ffff88021e64c440 RBX: ffffffff8156cc63 RCX: 0000000000000004 RDX: ffffffff8156cc63 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88021f007be0 R8: 0000000000000004 R9: 0000000000000008 R10: ffffffff816fed00 R11: 0000000000000004 R12: 0000000000000000 R13: ffffffff8156cc63 R14: 0000000000000000 R15: ffff8802222a0000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#8 [ffff88021f007be8] sysfs_get_dirent at ffffffff81178c07 torvalds#9 [ffff88021f007c18] sysfs_remove_group at ffffffff8117ac27 torvalds#10 [ffff88021f007c48] netdev_queue_update_kobjects at ffffffff813178f9 torvalds#11 [ffff88021f007c88] netif_set_real_num_tx_queues at ffffffff81303e38 torvalds#12 [ffff88021f007cc8] ixgbe_set_num_queues at ffffffffa0249763 [ixgbe] torvalds#13 [ffff88021f007cf8] ixgbe_init_interrupt_scheme at ffffffffa024ea89 [ixgbe] torvalds#14 [ffff88021f007d48] ixgbe_fcoe_disable at ffffffffa0267113 [ixgbe] torvalds#15 [ffff88021f007d68] vlan_dev_fcoe_disable at ffffffffa014fef5 [8021q] torvalds#16 [ffff88021f007d78] fcoe_interface_cleanup at ffffffffa02b7dfd [fcoe] torvalds#17 [ffff88021f007df8] fcoe_destroy_work at ffffffffa02b7f08 [fcoe] torvalds#18 [ffff88021f007e18] process_one_work at ffffffff8105d7ca torvalds#19 [ffff88021f007e68] worker_thread at ffffffff81060513 torvalds#20 [ffff88021f007ee8] kthread at ffffffff810648b6 torvalds#21 [ffff88021f007f48] kernel_thread_helper at ffffffff813c40f4 Signed-off-by: Yi Zou <yi.zou@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 torvalds#6 [d72d3cb4] isolate_migratepages at c030b15a torvalds#7 [d72d3d1] zone_watermark_ok at c02d26cb torvalds#8 [d72d3d2c] compact_zone at c030b8de torvalds#9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 torvalds#6 [d72d3cb4] isolate_migratepages at c030b15a torvalds#7 [d72d3d1] zone_watermark_ok at c02d26cb torvalds#8 [d72d3d2c] compact_zone at c030b8de torvalds#9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
BugLink: http://bugs.launchpad.net/bugs/907778 commit ea51d13 upstream. If the pte mapping in generic_perform_write() is unmapped between iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the "copied" parameter to ->end_write can be zero. ext4 couldn't cope with it with delayed allocations enabled. This skips the i_disksize enlargement logic if copied is zero and no new data was appeneded to the inode. gdb> bt #0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\ 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467 #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 #2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\ ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440 #3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\ os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482 #4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\ xffff88001e26be40) at mm/filemap.c:2600 #5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\ zed out>, pos=<value optimized out>) at mm/filemap.c:2632 torvalds#6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\ t fs/ext4/file.c:136 torvalds#7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \ ppos=0xffff88001e26bf48) at fs/read_write.c:406 torvalds#8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\ 000, pos=0xffff88001e26bf48) at fs/read_write.c:435 torvalds#9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\ 4000) at fs/read_write.c:487 torvalds#10 <signal handler called> torvalds#11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ () torvalds#12 0x0000000000000000 in ?? () gdb> print offset $22 = 0xffffffffffffffff gdb> print idx $23 = 0xffffffff gdb> print inode->i_blkbits $24 = 0xc gdb> up #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 2512 if (ext4_da_should_update_i_disksize(page, end)) { gdb> print start $25 = 0x0 gdb> print end $26 = 0xffffffffffffffff gdb> print pos $27 = 0x108000 gdb> print new_i_size $28 = 0x108000 gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize $29 = 0xd9000 gdb> down 2467 for (i = 0; i < idx; i++) gdb> print i $30 = 0xd44acbee This is 100% reproducible with some autonuma development code tuned in a very aggressive manner (not normal way even for knumad) which does "exotic" changes to the ptes. It wouldn't normally trigger but I don't see why it can't happen normally if the page is added to swap cache in between the two faults leading to "copied" being zero (which then hangs in ext4). So it should be fixed. Especially possible with lumpy reclaim (albeit disabled if compaction is enabled) as that would ignore the young bits in the ptes. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
…S block during isolation for migration BugLink: http://bugs.launchpad.net/bugs/931719 commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 torvalds#6 [d72d3cb4] isolate_migratepages at c030b15a torvalds#7 [d72d3d1] zone_watermark_ok at c02d26cb torvalds#8 [d72d3d2c] compact_zone at c030b8de torvalds#9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d1] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ian told me that there are many memory leaks in the hierarchy mode. I can easily reproduce it with the follwing command. $ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak $ perf record --latency -g -- ./perf test -w thloop $ perf report -H --stdio ... Indirect leak of 168 byte(s) in 21 object(s) allocated from: #0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75 #1 0x55ed3602346e in map__get util/map.h:189 #2 0x55ed36024cc4 in hist_entry__init util/hist.c:476 #3 0x55ed36025208 in hist_entry__new util/hist.c:588 #4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587 #5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638 torvalds#6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685 torvalds#7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776 torvalds#8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735 torvalds#9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119 torvalds#10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867 torvalds#11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351 torvalds#12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404 torvalds#13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448 torvalds#14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556 torvalds#15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 ... $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak' 93 I found that hist_entry__delete() missed to release child entries in the hierarchy tree (hroot_{in,out}). It needs to iterate the child entries and call hist_entry__delete() recursively. After this change: $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak' 0 Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ian told me that there are many memory leaks in the hierarchy mode. I can easily reproduce it with the follwing command. $ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak $ perf record --latency -g -- ./perf test -w thloop $ perf report -H --stdio ... Indirect leak of 168 byte(s) in 21 object(s) allocated from: #0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75 #1 0x55ed3602346e in map__get util/map.h:189 #2 0x55ed36024cc4 in hist_entry__init util/hist.c:476 #3 0x55ed36025208 in hist_entry__new util/hist.c:588 #4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587 #5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638 torvalds#6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685 torvalds#7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776 torvalds#8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735 torvalds#9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119 torvalds#10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867 torvalds#11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351 torvalds#12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404 torvalds#13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448 torvalds#14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556 torvalds#15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 ... $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak' 93 I found that hist_entry__delete() missed to release child entries in the hierarchy tree (hroot_{in,out}). It needs to iterate the child entries and call hist_entry__delete() recursively. After this change: $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak' 0 Reported-by: Ian Rogers <irogers@google.com> Tested-by Thomas Falcon <thomas.falcon@intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250307061250.320849-2-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD. For a pipe data, it reads the header data including pmu_mapping from PERF_RECORD_HEADER_FEATURE runtime. But it's already set in: perf_session__new() __perf_session__new() evlist__init_trace_event_sample_raw() evlist__has_amd_ibs() perf_env__nr_pmu_mappings() Then it'll overwrite that when it processes the HEADER_FEATURE record. Here's a report from address sanitizer. Direct leak of 2689 byte(s) in 1 object(s) allocated from: #0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 #1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64 #2 0x5595a7d414ef in strbuf_init util/strbuf.c:25 #3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362 #4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517 #5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315 torvalds#6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23 torvalds#7 0x5595a7d7f893 in __perf_session__new util/session.c:179 torvalds#8 0x5595a7b79572 in perf_session__new util/session.h:115 torvalds#9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603 torvalds#10 0x5595a7c019eb in run_builtin perf.c:351 torvalds#11 0x5595a7c01c92 in handle_internal_command perf.c:404 torvalds#12 0x5595a7c01deb in run_argv perf.c:448 torvalds#13 0x5595a7c02134 in main perf.c:556 torvalds#14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 Let's free the existing pmu_mapping data if any. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- EDITME: describe what is new in this series revision. - EDITME: use bulletpoints and terse descriptions. - Link to v4: https://lore.kernel.org/r/20250311-b4-sm8750-display-v4-0-da6b3e959c76@linaro.org drm/msm: Add support for SM8750 Hi, Dependency / Rabased on top of ============================== https://lore.kernel.org/all/20241214-dpu-drop-features-v1-0-988f0662cb7e@linaro.org/ Merging ======= DSI pieces here might not be ready - I got modetest writeback working, but DSI panel on MTP8750 still shows darkness. Therefore we discussed that DPU/catalog things could be applied first. Changes in v4 ============= - Add ack/rb tags - Implement Dmitry's feedback (lower-case hex, indentation, pass mdss_ver instead of ctl), patches: drm/msm/dpu: Implement 10-bit color alpha for v12.0 DPU drm/msm/dpu: Implement CTL_PIPE_ACTIVE for v12.0 DPU - Rebase on latest next - Drop applied two first patches - Link to v3: https://lore.kernel.org/r/20250221-b4-sm8750-display-v3-0-3ea95b1630ea@linaro.org Changes in v3 ============= - Add ack/rb tags - #5: dt-bindings: display/msm: dp-controller: Add SM8750: Extend commit msg - torvalds#7: dt-bindings: display/msm: qcom,sm8750-mdss: Add SM8750: - Properly described interconnects - Use only one compatible and contains for the sub-blocks (Rob) - torvalds#12: drm/msm/dsi: Add support for SM8750: Drop 'struct msm_dsi_config sm8750_dsi_cfg' and use sm8650 one. - drm/msm/dpu: Implement new v12.0 DPU differences Split into several patches - Link to v2: https://lore.kernel.org/r/20250217-b4-sm8750-display-v2-0-d201dcdda6a4@linaro.org Changes in v2 ============= - Implement LM crossbar, 10-bit alpha and active layer changes: New patch: drm/msm/dpu: Implement new v12.0 DPU differences - New patch: drm/msm/dpu: Add missing "fetch" name to set_active_pipes() - Add CDM - Split some DPU patch pieces into separate patches: drm/msm/dpu: Drop useless comments drm/msm/dpu: Add LM_7, DSC_[67], PP_[67] and MERGE_3D_5 drm/msm/dpu: Add handling of LM_6 and LM_7 bits in pending flush mask - Split DSI and DSI PHY patches - Mention CLK_OPS_PARENT_ENABLE in DSI commit - Mention DSI PHY PLL work: https://patchwork.freedesktop.org/patch/542000/?series=119177&rev=1 - DPU: Drop SSPP_VIG4 comments - DPU: Add CDM - Link to v1: https://lore.kernel.org/r/20250109-b4-sm8750-display-v1-0-b3f15faf4c97@linaro.org Best regards, Krzysztof To: Abhinav Kumar <quic_abhinavk@quicinc.com> To: Sean Paul <sean@poorly.run> To: Marijn Suijten <marijn.suijten@somainline.org> To: David Airlie <airlied@gmail.com> To: Simona Vetter <simona@ffwll.ch> To: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> To: Maxime Ripard <mripard@kernel.org> To: Thomas Zimmermann <tzimmermann@suse.de> To: Rob Herring <robh@kernel.org> To: Krzysztof Kozlowski <krzk+dt@kernel.org> To: Conor Dooley <conor+dt@kernel.org> To: Krishna Manikandan <quic_mkrishn@quicinc.com> To: Jonathan Marek <jonathan@marek.ca> To: Kuogee Hsieh <quic_khsieh@quicinc.com> To: Neil Armstrong <neil.armstrong@linaro.org> To: Dmitry Baryshkov <lumag@kernel.org> To: Rob Clark <robdclark@gmail.com> Cc: linux-arm-msm@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: freedreno@lists.freedesktop.org Cc: devicetree@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Srini Kandagatla <srinivas.kandagatla@linaro.org> Cc: Rob Clark <robdclark@chromium.org> --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 5, "change-id": "20250109-b4-sm8750-display-6ea537754af1", "prefixes": [], "history": { "v1": [ "20250109-b4-sm8750-display-v1-0-b3f15faf4c97@linaro.org" ], "v2": [ "20250217-b4-sm8750-display-v2-0-d201dcdda6a4@linaro.org" ], "v3": [ "20250221-b4-sm8750-display-v3-0-3ea95b1630ea@linaro.org" ], "v4": [ "20250311-b4-sm8750-display-v4-0-da6b3e959c76@linaro.org" ] } } }
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chia-Yu Chang says: ==================== AccECN protocol preparation patch series Please find the v7 v7 (03-Mar-2025) - Move 2 new patches added in v6 to the next AccECN patch series v6 (27-Dec-2024) - Avoid removing removing the potential CA_ACK_WIN_UPDATE in ack_ev_flags of patch #1 (Eric Dumazet <edumazet@google.com>) - Add reviewed-by tag in patches #2, #3, #4, #5, torvalds#6, torvalds#7, torvalds#8, torvalds#12, torvalds#14 - Foloiwng 2 new pathces are added after patch torvalds#9 (Patch that adds SKB_GSO_TCP_ACCECN) * New patch torvalds#10 to replace exisiting SKB_GSO_TCP_ECN with SKB_GSO_TCP_ACCECN in the driver to avoid CWR flag corruption * New patch torvalds#11 adds AccECN for virtio by adding new negotiation flag (VIRTIO_NET_F_HOST/GUEST_ACCECN) in feature handshake and translating Accurate ECN GSO flag between virtio_net_hdr (VIRTIO_NET_HDR_GSO_ACCECN) and skb header (SKB_GSO_TCP_ACCECN) - Add detailed changelog and comments in torvalds#13 (Eric Dumazet <edumazet@google.com>) - Move patch torvalds#14 to the next AccECN patch series (Eric Dumazet <edumazet@google.com>) v5 (5-Nov-2024) - Add helper function "tcp_flags_ntohs" to preserve last 2 bytes of TCP flags of patch #4 (Paolo Abeni <pabeni@redhat.com>) - Fix reverse X-max tree order of patches #4, torvalds#11 (Paolo Abeni <pabeni@redhat.com>) - Rename variable "delta" as "timestamp_delta" of patch #2 fo clariety - Remove patch torvalds#14 in this series (Paolo Abeni <pabeni@redhat.com>, Joel Granados <joel.granados@kernel.org>) v4 (21-Oct-2024) - Fix line length warning of patches #2, #4, torvalds#8, torvalds#10, torvalds#11, torvalds#14 - Fix spaces preferred around '|' (ctx:VxV) warning of patch torvalds#7 - Add missing CC'ed of patches #4, torvalds#12, torvalds#14 v3 (19-Oct-2024) - Fix build error in v2 v2 (18-Oct-2024) - Fix warning caused by NETIF_F_GSO_ACCECN_BIT in patch torvalds#9 (Jakub Kicinski <kuba@kernel.org>) The full patch series can be found in https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
there is a global spinlock between reset and clk, if locked in reset, then print some debug information, maybe dead-lock when uart driver try to disable clk. Backtrace stopped: frame did not save the PC (gdb) thread 4 [Switching to thread 4 (Thread 4)] #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 22 ./arch/riscv/include/asm/vdso/processor.h: No such file or directory. (gdb) bt #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 #1 arch_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/asm-generic/spinlock.h:49 #2 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock.h:186 #3 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock_api_smp.h:111 #4 _raw_spin_lock_irqsave (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at kernel/locking/spinlock.c:162 #5 0xffffffff80563416 in clk_enable_lock () at ./include/linux/spinlock.h:325 torvalds#6 0xffffffff805648de in clk_core_disable_lock (core=0xffffffd900512500) at drivers/clk/clk.c:1062 torvalds#7 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#8 clk_disable (clk=0xffffffd9048b5100) at drivers/clk/clk.c:1079 torvalds#9 0xffffffff8059e5d4 in serial_pxa_console_write (co=<optimized out>, s=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", count=<optimized out>) at drivers/tty/serial/pxa_k1x.c:1724 torvalds#10 0xffffffff8004a34c in call_console_driver (dropped_text=0xffffffff81a68650 <dropped_text> "", len=69, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", con=0xffffffff81964c10 <serial_pxa_console>) at kernel/printk/printk.c:1942 torvalds#11 console_emit_next_record (con=con@entry=0xffffffff81964c10 <serial_pxa_console>, ext_text=<optimized out>, dropped_text=0xffffffff81a68650 <dropped_text> "", handover=0xffffffc80578baa7, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n") at kernel/printk/printk.c:2731 torvalds#12 0xffffffff8004a49a in console_flush_all (handover=0xffffffc80578baa7, next_seq=<synthetic pointer>, do_cond_resched=false) at kernel/printk/printk.c:2793 torvalds#13 console_unlock () at kernel/printk/printk.c:2860 torvalds#14 0xffffffff8004b388 in vprintk_emit (facility=facility@entry=0, level=<optimized out>, level@entry=-1, dev_info=dev_info@entry=0x0, fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2268 torvalds#15 0xffffffff8004b3ae in vprintk_default (fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2279 torvalds#16 0xffffffff8004b646 in vprintk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n", args=args@entry=0xffffffc80578bbd8) at kernel/printk/printk_safe.c:50 torvalds#17 0xffffffff80a880d6 in _printk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n") at kernel/printk/printk.c:2289 torvalds#18 0xffffffff80a90bb6 in spacemit_reset_set (rcdev=rcdev@entry=0xffffffff81f563a8 <k1x_reset_controller+8>, id=id@entry=59, assert=assert@entry=true) at drivers/reset/reset-spacemit-k1x.c:373 torvalds#19 0xffffffff805823b6 in spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:401 torvalds#20 spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:387 torvalds#21 spacemit_reset_assert (rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>, id=59) at drivers/reset/reset-spacemit-k1x.c:413 torvalds#22 0xffffffff8058158e in reset_control_assert (rstc=0xffffffd902b2f280) at drivers/reset/core.c:485 torvalds#23 0xffffffff807ccf96 in cpp_disable_clocks (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:960 torvalds#24 0xffffffff807cd0b2 in cpp_release_hardware (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1038 torvalds#25 0xffffffff807cd990 in cpp_close_node (sd=<optimized out>, fh=<optimized out>) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1135 torvalds#26 0xffffffff8079525e in subdev_close (file=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-subdev.c:105 torvalds#27 0xffffffff8078e49e in v4l2_release (inode=<optimized out>, filp=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-dev.c:459 torvalds#28 0xffffffff80154974 in __fput (file=0xffffffd906645d00) at fs/file_table.c:320 torvalds#29 0xffffffff80154aa2 in ____fput (work=<optimized out>) at fs/file_table.c:348 torvalds#30 0xffffffff8002677e in task_work_run () at kernel/task_work.c:179 torvalds#31 0xffffffff800053b4 in resume_user_mode_work (regs=0xffffffc80578bee0) at ./include/linux/resume_user_mode.h:49 torvalds#32 do_work_pending (regs=0xffffffc80578bee0, thread_info_flags=<optimized out>) at arch/riscv/kernel/signal.c:478 torvalds#33 0xffffffff800039c6 in handle_exception () at arch/riscv/kernel/entry.S:374 Backtrace stopped: frame did not save the PC (gdb) thread 1 [Switching to thread 1 (Thread 1)] #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 49 ./include/asm-generic/spinlock.h: No such file or directory. (gdb) bt #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 #1 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock.h:186 #2 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock_api_smp.h:111 #3 _raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at kernel/locking/spinlock.c:162 #4 0xffffffff8056c4cc in ccu_mix_disable (hw=0xffffffff81956858 <sdh2_clk+120>) at ./include/linux/spinlock.h:325 #5 0xffffffff80564832 in clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1051 torvalds#6 clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1031 torvalds#7 0xffffffff805648e6 in clk_core_disable_lock (core=0xffffffd900529900) at drivers/clk/clk.c:1063 torvalds#8 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#9 clk_disable (clk=clk@entry=0xffffffd904fafa80) at drivers/clk/clk.c:1079 torvalds#10 0xffffffff808bb898 in clk_disable_unprepare (clk=0xffffffd904fafa80) at ./include/linux/clk.h:1085 torvalds#11 0xffffffff808bb916 in spacemit_sdhci_runtime_suspend (dev=<optimized out>) at drivers/mmc/host/sdhci-of-k1x.c:1469 torvalds#12 0xffffffff8066e8e2 in pm_generic_runtime_suspend (dev=<optimized out>) at drivers/base/power/generic_ops.c:25 torvalds#13 0xffffffff80670398 in __rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:395 torvalds#14 0xffffffff806704b8 in rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:529 torvalds#15 0xffffffff80670bdc in rpm_suspend (dev=0xffffffd9018a2810, rpmflags=<optimized out>) at drivers/base/power/runtime.c:672 torvalds#16 0xffffffff806716de in pm_runtime_work (work=0xffffffd9018a2948) at drivers/base/power/runtime.c:974 torvalds#17 0xffffffff800236f4 in process_one_work (worker=worker@entry=0xffffffd9013ee9c0, work=0xffffffd9018a2948) at kernel/workqueue.c:2289 torvalds#18 0xffffffff80023ba6 in worker_thread (__worker=0xffffffd9013ee9c0) at kernel/workqueue.c:2436 torvalds#19 0xffffffff80028bb2 in kthread (_create=0xffffffd9017de840) at kernel/kthread.c:376 torvalds#20 0xffffffff80003934 in handle_exception () at arch/riscv/kernel/entry.S:249 Backtrace stopped: frame did not save the PC (gdb) Change-Id: Ia95b41ffd6c1893c9c5e9c1c9fc0c155ea902d2c
…able the irq fiex bug like: [ 0.327242] pinctrl-single d401e000.pinctrl: 256 pins, size 1024 [ 0.378624] irq 12: nobody cared (try booting with the "irqpoll" option) [ 0.382473] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.36 #0 [ 0.388460] Hardware name: spacemit k1-x MUSE-N1 board (DT) [ 0.394103] Call Trace: [ 0.396619] [<ffffffff80005cf8>] dump_backtrace+0x1c/0x24 [ 0.402088] [<ffffffff80005d28>] show_stack+0x28/0x34 [ 0.407209] [<ffffffff80c9cd7c>] dump_stack_lvl+0x40/0x58 [ 0.412678] [<ffffffff80c9cda8>] dump_stack+0x14/0x1c [ 0.417799] [<ffffffff8005e366>] __report_bad_irq+0x42/0xba [ 0.423442] [<ffffffff8005e6b2>] note_interrupt+0x22e/0x270 [ 0.429084] [<ffffffff8005c144>] handle_irq_event+0x84/0x86 [ 0.434726] [<ffffffff8005f2cc>] handle_fasteoi_irq+0xc0/0x20e [ 0.440629] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.447053] [<ffffffff80635e30>] plic_handle_irq+0x84/0xf8 [ 0.452608] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.459032] [<ffffffff80635aa2>] riscv_intc_irq+0x26/0x60 [ 0.464501] [<ffffffff80cbbb66>] handle_riscv_irq+0x6e/0xb0 [ 0.470144] [<ffffffff80cc469e>] call_on_irq_stack+0x32/0x40 [ 0.475873] handlers: [ 0.478216] [<(____ptrval____)>] pcs_irq_handler [ 0.482903] Disabling IRQ torvalds#12 Change-Id: I4627f1b1e95d87eff88dfd4c1a331e91bd7fa1e5
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD. For a pipe data, it reads the header data including pmu_mapping from PERF_RECORD_HEADER_FEATURE runtime. But it's already set in: perf_session__new() __perf_session__new() evlist__init_trace_event_sample_raw() evlist__has_amd_ibs() perf_env__nr_pmu_mappings() Then it'll overwrite that when it processes the HEADER_FEATURE record. Here's a report from address sanitizer. Direct leak of 2689 byte(s) in 1 object(s) allocated from: #0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 #1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64 #2 0x5595a7d414ef in strbuf_init util/strbuf.c:25 #3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362 #4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517 #5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315 torvalds#6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23 torvalds#7 0x5595a7d7f893 in __perf_session__new util/session.c:179 torvalds#8 0x5595a7b79572 in perf_session__new util/session.h:115 torvalds#9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603 torvalds#10 0x5595a7c019eb in run_builtin perf.c:351 torvalds#11 0x5595a7c01c92 in handle_internal_command perf.c:404 torvalds#12 0x5595a7c01deb in run_argv perf.c:448 torvalds#13 0x5595a7c02134 in main perf.c:556 torvalds#14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 Let's free the existing pmu_mapping data if any. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250311000416.817631-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
there is a global spinlock between reset and clk, if locked in reset, then print some debug information, maybe dead-lock when uart driver try to disable clk. Backtrace stopped: frame did not save the PC (gdb) thread 4 [Switching to thread 4 (Thread 4)] #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 22 ./arch/riscv/include/asm/vdso/processor.h: No such file or directory. (gdb) bt #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 #1 arch_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/asm-generic/spinlock.h:49 #2 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock.h:186 #3 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock_api_smp.h:111 #4 _raw_spin_lock_irqsave (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at kernel/locking/spinlock.c:162 #5 0xffffffff80563416 in clk_enable_lock () at ./include/linux/spinlock.h:325 torvalds#6 0xffffffff805648de in clk_core_disable_lock (core=0xffffffd900512500) at drivers/clk/clk.c:1062 torvalds#7 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#8 clk_disable (clk=0xffffffd9048b5100) at drivers/clk/clk.c:1079 torvalds#9 0xffffffff8059e5d4 in serial_pxa_console_write (co=<optimized out>, s=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", count=<optimized out>) at drivers/tty/serial/pxa_k1x.c:1724 torvalds#10 0xffffffff8004a34c in call_console_driver (dropped_text=0xffffffff81a68650 <dropped_text> "", len=69, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", con=0xffffffff81964c10 <serial_pxa_console>) at kernel/printk/printk.c:1942 torvalds#11 console_emit_next_record (con=con@entry=0xffffffff81964c10 <serial_pxa_console>, ext_text=<optimized out>, dropped_text=0xffffffff81a68650 <dropped_text> "", handover=0xffffffc80578baa7, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n") at kernel/printk/printk.c:2731 torvalds#12 0xffffffff8004a49a in console_flush_all (handover=0xffffffc80578baa7, next_seq=<synthetic pointer>, do_cond_resched=false) at kernel/printk/printk.c:2793 torvalds#13 console_unlock () at kernel/printk/printk.c:2860 torvalds#14 0xffffffff8004b388 in vprintk_emit (facility=facility@entry=0, level=<optimized out>, level@entry=-1, dev_info=dev_info@entry=0x0, fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2268 torvalds#15 0xffffffff8004b3ae in vprintk_default (fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2279 torvalds#16 0xffffffff8004b646 in vprintk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n", args=args@entry=0xffffffc80578bbd8) at kernel/printk/printk_safe.c:50 torvalds#17 0xffffffff80a880d6 in _printk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n") at kernel/printk/printk.c:2289 torvalds#18 0xffffffff80a90bb6 in spacemit_reset_set (rcdev=rcdev@entry=0xffffffff81f563a8 <k1x_reset_controller+8>, id=id@entry=59, assert=assert@entry=true) at drivers/reset/reset-spacemit-k1x.c:373 torvalds#19 0xffffffff805823b6 in spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:401 torvalds#20 spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:387 torvalds#21 spacemit_reset_assert (rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>, id=59) at drivers/reset/reset-spacemit-k1x.c:413 torvalds#22 0xffffffff8058158e in reset_control_assert (rstc=0xffffffd902b2f280) at drivers/reset/core.c:485 torvalds#23 0xffffffff807ccf96 in cpp_disable_clocks (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:960 torvalds#24 0xffffffff807cd0b2 in cpp_release_hardware (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1038 torvalds#25 0xffffffff807cd990 in cpp_close_node (sd=<optimized out>, fh=<optimized out>) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1135 torvalds#26 0xffffffff8079525e in subdev_close (file=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-subdev.c:105 torvalds#27 0xffffffff8078e49e in v4l2_release (inode=<optimized out>, filp=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-dev.c:459 torvalds#28 0xffffffff80154974 in __fput (file=0xffffffd906645d00) at fs/file_table.c:320 torvalds#29 0xffffffff80154aa2 in ____fput (work=<optimized out>) at fs/file_table.c:348 torvalds#30 0xffffffff8002677e in task_work_run () at kernel/task_work.c:179 torvalds#31 0xffffffff800053b4 in resume_user_mode_work (regs=0xffffffc80578bee0) at ./include/linux/resume_user_mode.h:49 torvalds#32 do_work_pending (regs=0xffffffc80578bee0, thread_info_flags=<optimized out>) at arch/riscv/kernel/signal.c:478 torvalds#33 0xffffffff800039c6 in handle_exception () at arch/riscv/kernel/entry.S:374 Backtrace stopped: frame did not save the PC (gdb) thread 1 [Switching to thread 1 (Thread 1)] #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 49 ./include/asm-generic/spinlock.h: No such file or directory. (gdb) bt #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 #1 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock.h:186 #2 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock_api_smp.h:111 #3 _raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at kernel/locking/spinlock.c:162 #4 0xffffffff8056c4cc in ccu_mix_disable (hw=0xffffffff81956858 <sdh2_clk+120>) at ./include/linux/spinlock.h:325 #5 0xffffffff80564832 in clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1051 torvalds#6 clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1031 torvalds#7 0xffffffff805648e6 in clk_core_disable_lock (core=0xffffffd900529900) at drivers/clk/clk.c:1063 torvalds#8 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#9 clk_disable (clk=clk@entry=0xffffffd904fafa80) at drivers/clk/clk.c:1079 torvalds#10 0xffffffff808bb898 in clk_disable_unprepare (clk=0xffffffd904fafa80) at ./include/linux/clk.h:1085 torvalds#11 0xffffffff808bb916 in spacemit_sdhci_runtime_suspend (dev=<optimized out>) at drivers/mmc/host/sdhci-of-k1x.c:1469 torvalds#12 0xffffffff8066e8e2 in pm_generic_runtime_suspend (dev=<optimized out>) at drivers/base/power/generic_ops.c:25 torvalds#13 0xffffffff80670398 in __rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:395 torvalds#14 0xffffffff806704b8 in rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:529 torvalds#15 0xffffffff80670bdc in rpm_suspend (dev=0xffffffd9018a2810, rpmflags=<optimized out>) at drivers/base/power/runtime.c:672 torvalds#16 0xffffffff806716de in pm_runtime_work (work=0xffffffd9018a2948) at drivers/base/power/runtime.c:974 torvalds#17 0xffffffff800236f4 in process_one_work (worker=worker@entry=0xffffffd9013ee9c0, work=0xffffffd9018a2948) at kernel/workqueue.c:2289 torvalds#18 0xffffffff80023ba6 in worker_thread (__worker=0xffffffd9013ee9c0) at kernel/workqueue.c:2436 torvalds#19 0xffffffff80028bb2 in kthread (_create=0xffffffd9017de840) at kernel/kthread.c:376 torvalds#20 0xffffffff80003934 in handle_exception () at arch/riscv/kernel/entry.S:249 Backtrace stopped: frame did not save the PC (gdb) Change-Id: Ia95b41ffd6c1893c9c5e9c1c9fc0c155ea902d2c
…able the irq fiex bug like: [ 0.327242] pinctrl-single d401e000.pinctrl: 256 pins, size 1024 [ 0.378624] irq 12: nobody cared (try booting with the "irqpoll" option) [ 0.382473] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.36 #0 [ 0.388460] Hardware name: spacemit k1-x MUSE-N1 board (DT) [ 0.394103] Call Trace: [ 0.396619] [<ffffffff80005cf8>] dump_backtrace+0x1c/0x24 [ 0.402088] [<ffffffff80005d28>] show_stack+0x28/0x34 [ 0.407209] [<ffffffff80c9cd7c>] dump_stack_lvl+0x40/0x58 [ 0.412678] [<ffffffff80c9cda8>] dump_stack+0x14/0x1c [ 0.417799] [<ffffffff8005e366>] __report_bad_irq+0x42/0xba [ 0.423442] [<ffffffff8005e6b2>] note_interrupt+0x22e/0x270 [ 0.429084] [<ffffffff8005c144>] handle_irq_event+0x84/0x86 [ 0.434726] [<ffffffff8005f2cc>] handle_fasteoi_irq+0xc0/0x20e [ 0.440629] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.447053] [<ffffffff80635e30>] plic_handle_irq+0x84/0xf8 [ 0.452608] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.459032] [<ffffffff80635aa2>] riscv_intc_irq+0x26/0x60 [ 0.464501] [<ffffffff80cbbb66>] handle_riscv_irq+0x6e/0xb0 [ 0.470144] [<ffffffff80cc469e>] call_on_irq_stack+0x32/0x40 [ 0.475873] handlers: [ 0.478216] [<(____ptrval____)>] pcs_irq_handler [ 0.482903] Disabling IRQ torvalds#12 Change-Id: I4627f1b1e95d87eff88dfd4c1a331e91bd7fa1e5
there is a global spinlock between reset and clk, if locked in reset, then print some debug information, maybe dead-lock when uart driver try to disable clk. Backtrace stopped: frame did not save the PC (gdb) thread 4 [Switching to thread 4 (Thread 4)] #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 22 ./arch/riscv/include/asm/vdso/processor.h: No such file or directory. (gdb) bt #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 #1 arch_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/asm-generic/spinlock.h:49 #2 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock.h:186 #3 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock_api_smp.h:111 #4 _raw_spin_lock_irqsave (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at kernel/locking/spinlock.c:162 #5 0xffffffff80563416 in clk_enable_lock () at ./include/linux/spinlock.h:325 torvalds#6 0xffffffff805648de in clk_core_disable_lock (core=0xffffffd900512500) at drivers/clk/clk.c:1062 torvalds#7 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#8 clk_disable (clk=0xffffffd9048b5100) at drivers/clk/clk.c:1079 torvalds#9 0xffffffff8059e5d4 in serial_pxa_console_write (co=<optimized out>, s=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", count=<optimized out>) at drivers/tty/serial/pxa_k1x.c:1724 torvalds#10 0xffffffff8004a34c in call_console_driver (dropped_text=0xffffffff81a68650 <dropped_text> "", len=69, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", con=0xffffffff81964c10 <serial_pxa_console>) at kernel/printk/printk.c:1942 torvalds#11 console_emit_next_record (con=con@entry=0xffffffff81964c10 <serial_pxa_console>, ext_text=<optimized out>, dropped_text=0xffffffff81a68650 <dropped_text> "", handover=0xffffffc80578baa7, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n") at kernel/printk/printk.c:2731 torvalds#12 0xffffffff8004a49a in console_flush_all (handover=0xffffffc80578baa7, next_seq=<synthetic pointer>, do_cond_resched=false) at kernel/printk/printk.c:2793 torvalds#13 console_unlock () at kernel/printk/printk.c:2860 torvalds#14 0xffffffff8004b388 in vprintk_emit (facility=facility@entry=0, level=<optimized out>, level@entry=-1, dev_info=dev_info@entry=0x0, fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2268 torvalds#15 0xffffffff8004b3ae in vprintk_default (fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2279 torvalds#16 0xffffffff8004b646 in vprintk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n", args=args@entry=0xffffffc80578bbd8) at kernel/printk/printk_safe.c:50 torvalds#17 0xffffffff80a880d6 in _printk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n") at kernel/printk/printk.c:2289 torvalds#18 0xffffffff80a90bb6 in spacemit_reset_set (rcdev=rcdev@entry=0xffffffff81f563a8 <k1x_reset_controller+8>, id=id@entry=59, assert=assert@entry=true) at drivers/reset/reset-spacemit-k1x.c:373 torvalds#19 0xffffffff805823b6 in spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:401 torvalds#20 spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:387 torvalds#21 spacemit_reset_assert (rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>, id=59) at drivers/reset/reset-spacemit-k1x.c:413 torvalds#22 0xffffffff8058158e in reset_control_assert (rstc=0xffffffd902b2f280) at drivers/reset/core.c:485 torvalds#23 0xffffffff807ccf96 in cpp_disable_clocks (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:960 torvalds#24 0xffffffff807cd0b2 in cpp_release_hardware (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1038 torvalds#25 0xffffffff807cd990 in cpp_close_node (sd=<optimized out>, fh=<optimized out>) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1135 torvalds#26 0xffffffff8079525e in subdev_close (file=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-subdev.c:105 torvalds#27 0xffffffff8078e49e in v4l2_release (inode=<optimized out>, filp=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-dev.c:459 torvalds#28 0xffffffff80154974 in __fput (file=0xffffffd906645d00) at fs/file_table.c:320 torvalds#29 0xffffffff80154aa2 in ____fput (work=<optimized out>) at fs/file_table.c:348 torvalds#30 0xffffffff8002677e in task_work_run () at kernel/task_work.c:179 torvalds#31 0xffffffff800053b4 in resume_user_mode_work (regs=0xffffffc80578bee0) at ./include/linux/resume_user_mode.h:49 torvalds#32 do_work_pending (regs=0xffffffc80578bee0, thread_info_flags=<optimized out>) at arch/riscv/kernel/signal.c:478 torvalds#33 0xffffffff800039c6 in handle_exception () at arch/riscv/kernel/entry.S:374 Backtrace stopped: frame did not save the PC (gdb) thread 1 [Switching to thread 1 (Thread 1)] #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 49 ./include/asm-generic/spinlock.h: No such file or directory. (gdb) bt #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 #1 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock.h:186 #2 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock_api_smp.h:111 #3 _raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at kernel/locking/spinlock.c:162 #4 0xffffffff8056c4cc in ccu_mix_disable (hw=0xffffffff81956858 <sdh2_clk+120>) at ./include/linux/spinlock.h:325 #5 0xffffffff80564832 in clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1051 torvalds#6 clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1031 torvalds#7 0xffffffff805648e6 in clk_core_disable_lock (core=0xffffffd900529900) at drivers/clk/clk.c:1063 torvalds#8 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#9 clk_disable (clk=clk@entry=0xffffffd904fafa80) at drivers/clk/clk.c:1079 torvalds#10 0xffffffff808bb898 in clk_disable_unprepare (clk=0xffffffd904fafa80) at ./include/linux/clk.h:1085 torvalds#11 0xffffffff808bb916 in spacemit_sdhci_runtime_suspend (dev=<optimized out>) at drivers/mmc/host/sdhci-of-k1x.c:1469 torvalds#12 0xffffffff8066e8e2 in pm_generic_runtime_suspend (dev=<optimized out>) at drivers/base/power/generic_ops.c:25 torvalds#13 0xffffffff80670398 in __rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:395 torvalds#14 0xffffffff806704b8 in rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:529 torvalds#15 0xffffffff80670bdc in rpm_suspend (dev=0xffffffd9018a2810, rpmflags=<optimized out>) at drivers/base/power/runtime.c:672 torvalds#16 0xffffffff806716de in pm_runtime_work (work=0xffffffd9018a2948) at drivers/base/power/runtime.c:974 torvalds#17 0xffffffff800236f4 in process_one_work (worker=worker@entry=0xffffffd9013ee9c0, work=0xffffffd9018a2948) at kernel/workqueue.c:2289 torvalds#18 0xffffffff80023ba6 in worker_thread (__worker=0xffffffd9013ee9c0) at kernel/workqueue.c:2436 torvalds#19 0xffffffff80028bb2 in kthread (_create=0xffffffd9017de840) at kernel/kthread.c:376 torvalds#20 0xffffffff80003934 in handle_exception () at arch/riscv/kernel/entry.S:249 Backtrace stopped: frame did not save the PC (gdb) Change-Id: Ia95b41ffd6c1893c9c5e9c1c9fc0c155ea902d2c
…able the irq fiex bug like: [ 0.327242] pinctrl-single d401e000.pinctrl: 256 pins, size 1024 [ 0.378624] irq 12: nobody cared (try booting with the "irqpoll" option) [ 0.382473] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.36 #0 [ 0.388460] Hardware name: spacemit k1-x MUSE-N1 board (DT) [ 0.394103] Call Trace: [ 0.396619] [<ffffffff80005cf8>] dump_backtrace+0x1c/0x24 [ 0.402088] [<ffffffff80005d28>] show_stack+0x28/0x34 [ 0.407209] [<ffffffff80c9cd7c>] dump_stack_lvl+0x40/0x58 [ 0.412678] [<ffffffff80c9cda8>] dump_stack+0x14/0x1c [ 0.417799] [<ffffffff8005e366>] __report_bad_irq+0x42/0xba [ 0.423442] [<ffffffff8005e6b2>] note_interrupt+0x22e/0x270 [ 0.429084] [<ffffffff8005c144>] handle_irq_event+0x84/0x86 [ 0.434726] [<ffffffff8005f2cc>] handle_fasteoi_irq+0xc0/0x20e [ 0.440629] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.447053] [<ffffffff80635e30>] plic_handle_irq+0x84/0xf8 [ 0.452608] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.459032] [<ffffffff80635aa2>] riscv_intc_irq+0x26/0x60 [ 0.464501] [<ffffffff80cbbb66>] handle_riscv_irq+0x6e/0xb0 [ 0.470144] [<ffffffff80cc469e>] call_on_irq_stack+0x32/0x40 [ 0.475873] handlers: [ 0.478216] [<(____ptrval____)>] pcs_irq_handler [ 0.482903] Disabling IRQ torvalds#12 Change-Id: I4627f1b1e95d87eff88dfd4c1a331e91bd7fa1e5
there is a global spinlock between reset and clk, if locked in reset, then print some debug information, maybe dead-lock when uart driver try to disable clk. Backtrace stopped: frame did not save the PC (gdb) thread 4 [Switching to thread 4 (Thread 4)] #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 22 ./arch/riscv/include/asm/vdso/processor.h: No such file or directory. (gdb) bt #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 #1 arch_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/asm-generic/spinlock.h:49 #2 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock.h:186 #3 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock_api_smp.h:111 #4 _raw_spin_lock_irqsave (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at kernel/locking/spinlock.c:162 #5 0xffffffff80563416 in clk_enable_lock () at ./include/linux/spinlock.h:325 torvalds#6 0xffffffff805648de in clk_core_disable_lock (core=0xffffffd900512500) at drivers/clk/clk.c:1062 torvalds#7 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#8 clk_disable (clk=0xffffffd9048b5100) at drivers/clk/clk.c:1079 torvalds#9 0xffffffff8059e5d4 in serial_pxa_console_write (co=<optimized out>, s=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", count=<optimized out>) at drivers/tty/serial/pxa_k1x.c:1724 torvalds#10 0xffffffff8004a34c in call_console_driver (dropped_text=0xffffffff81a68650 <dropped_text> "", len=69, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", con=0xffffffff81964c10 <serial_pxa_console>) at kernel/printk/printk.c:1942 torvalds#11 console_emit_next_record (con=con@entry=0xffffffff81964c10 <serial_pxa_console>, ext_text=<optimized out>, dropped_text=0xffffffff81a68650 <dropped_text> "", handover=0xffffffc80578baa7, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n") at kernel/printk/printk.c:2731 torvalds#12 0xffffffff8004a49a in console_flush_all (handover=0xffffffc80578baa7, next_seq=<synthetic pointer>, do_cond_resched=false) at kernel/printk/printk.c:2793 torvalds#13 console_unlock () at kernel/printk/printk.c:2860 torvalds#14 0xffffffff8004b388 in vprintk_emit (facility=facility@entry=0, level=<optimized out>, level@entry=-1, dev_info=dev_info@entry=0x0, fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2268 torvalds#15 0xffffffff8004b3ae in vprintk_default (fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2279 torvalds#16 0xffffffff8004b646 in vprintk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n", args=args@entry=0xffffffc80578bbd8) at kernel/printk/printk_safe.c:50 torvalds#17 0xffffffff80a880d6 in _printk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n") at kernel/printk/printk.c:2289 torvalds#18 0xffffffff80a90bb6 in spacemit_reset_set (rcdev=rcdev@entry=0xffffffff81f563a8 <k1x_reset_controller+8>, id=id@entry=59, assert=assert@entry=true) at drivers/reset/reset-spacemit-k1x.c:373 torvalds#19 0xffffffff805823b6 in spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:401 torvalds#20 spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:387 torvalds#21 spacemit_reset_assert (rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>, id=59) at drivers/reset/reset-spacemit-k1x.c:413 torvalds#22 0xffffffff8058158e in reset_control_assert (rstc=0xffffffd902b2f280) at drivers/reset/core.c:485 torvalds#23 0xffffffff807ccf96 in cpp_disable_clocks (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:960 torvalds#24 0xffffffff807cd0b2 in cpp_release_hardware (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1038 torvalds#25 0xffffffff807cd990 in cpp_close_node (sd=<optimized out>, fh=<optimized out>) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1135 torvalds#26 0xffffffff8079525e in subdev_close (file=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-subdev.c:105 torvalds#27 0xffffffff8078e49e in v4l2_release (inode=<optimized out>, filp=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-dev.c:459 torvalds#28 0xffffffff80154974 in __fput (file=0xffffffd906645d00) at fs/file_table.c:320 torvalds#29 0xffffffff80154aa2 in ____fput (work=<optimized out>) at fs/file_table.c:348 torvalds#30 0xffffffff8002677e in task_work_run () at kernel/task_work.c:179 torvalds#31 0xffffffff800053b4 in resume_user_mode_work (regs=0xffffffc80578bee0) at ./include/linux/resume_user_mode.h:49 torvalds#32 do_work_pending (regs=0xffffffc80578bee0, thread_info_flags=<optimized out>) at arch/riscv/kernel/signal.c:478 torvalds#33 0xffffffff800039c6 in handle_exception () at arch/riscv/kernel/entry.S:374 Backtrace stopped: frame did not save the PC (gdb) thread 1 [Switching to thread 1 (Thread 1)] #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 49 ./include/asm-generic/spinlock.h: No such file or directory. (gdb) bt #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 #1 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock.h:186 #2 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock_api_smp.h:111 #3 _raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at kernel/locking/spinlock.c:162 #4 0xffffffff8056c4cc in ccu_mix_disable (hw=0xffffffff81956858 <sdh2_clk+120>) at ./include/linux/spinlock.h:325 #5 0xffffffff80564832 in clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1051 torvalds#6 clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1031 torvalds#7 0xffffffff805648e6 in clk_core_disable_lock (core=0xffffffd900529900) at drivers/clk/clk.c:1063 torvalds#8 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#9 clk_disable (clk=clk@entry=0xffffffd904fafa80) at drivers/clk/clk.c:1079 torvalds#10 0xffffffff808bb898 in clk_disable_unprepare (clk=0xffffffd904fafa80) at ./include/linux/clk.h:1085 torvalds#11 0xffffffff808bb916 in spacemit_sdhci_runtime_suspend (dev=<optimized out>) at drivers/mmc/host/sdhci-of-k1x.c:1469 torvalds#12 0xffffffff8066e8e2 in pm_generic_runtime_suspend (dev=<optimized out>) at drivers/base/power/generic_ops.c:25 torvalds#13 0xffffffff80670398 in __rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:395 torvalds#14 0xffffffff806704b8 in rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:529 torvalds#15 0xffffffff80670bdc in rpm_suspend (dev=0xffffffd9018a2810, rpmflags=<optimized out>) at drivers/base/power/runtime.c:672 torvalds#16 0xffffffff806716de in pm_runtime_work (work=0xffffffd9018a2948) at drivers/base/power/runtime.c:974 torvalds#17 0xffffffff800236f4 in process_one_work (worker=worker@entry=0xffffffd9013ee9c0, work=0xffffffd9018a2948) at kernel/workqueue.c:2289 torvalds#18 0xffffffff80023ba6 in worker_thread (__worker=0xffffffd9013ee9c0) at kernel/workqueue.c:2436 torvalds#19 0xffffffff80028bb2 in kthread (_create=0xffffffd9017de840) at kernel/kthread.c:376 torvalds#20 0xffffffff80003934 in handle_exception () at arch/riscv/kernel/entry.S:249 Backtrace stopped: frame did not save the PC (gdb) Change-Id: Ia95b41ffd6c1893c9c5e9c1c9fc0c155ea902d2c
…able the irq fiex bug like: [ 0.327242] pinctrl-single d401e000.pinctrl: 256 pins, size 1024 [ 0.378624] irq 12: nobody cared (try booting with the "irqpoll" option) [ 0.382473] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.36 #0 [ 0.388460] Hardware name: spacemit k1-x MUSE-N1 board (DT) [ 0.394103] Call Trace: [ 0.396619] [<ffffffff80005cf8>] dump_backtrace+0x1c/0x24 [ 0.402088] [<ffffffff80005d28>] show_stack+0x28/0x34 [ 0.407209] [<ffffffff80c9cd7c>] dump_stack_lvl+0x40/0x58 [ 0.412678] [<ffffffff80c9cda8>] dump_stack+0x14/0x1c [ 0.417799] [<ffffffff8005e366>] __report_bad_irq+0x42/0xba [ 0.423442] [<ffffffff8005e6b2>] note_interrupt+0x22e/0x270 [ 0.429084] [<ffffffff8005c144>] handle_irq_event+0x84/0x86 [ 0.434726] [<ffffffff8005f2cc>] handle_fasteoi_irq+0xc0/0x20e [ 0.440629] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.447053] [<ffffffff80635e30>] plic_handle_irq+0x84/0xf8 [ 0.452608] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.459032] [<ffffffff80635aa2>] riscv_intc_irq+0x26/0x60 [ 0.464501] [<ffffffff80cbbb66>] handle_riscv_irq+0x6e/0xb0 [ 0.470144] [<ffffffff80cc469e>] call_on_irq_stack+0x32/0x40 [ 0.475873] handlers: [ 0.478216] [<(____ptrval____)>] pcs_irq_handler [ 0.482903] Disabling IRQ torvalds#12 Change-Id: I4627f1b1e95d87eff88dfd4c1a331e91bd7fa1e5
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 torvalds#6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 torvalds#7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 torvalds#8 [ffff800084a2fa60] generic_make_request at ffff800040570138 torvalds#9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 torvalds#10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] torvalds#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] torvalds#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] torvalds#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] torvalds#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] torvalds#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] torvalds#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 torvalds#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc torvalds#18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def284 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
there is a global spinlock between reset and clk, if locked in reset, then print some debug information, maybe dead-lock when uart driver try to disable clk. Backtrace stopped: frame did not save the PC (gdb) thread 4 [Switching to thread 4 (Thread 4)] #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 22 ./arch/riscv/include/asm/vdso/processor.h: No such file or directory. (gdb) bt #0 cpu_relax () at ./arch/riscv/include/asm/vdso/processor.h:22 #1 arch_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/asm-generic/spinlock.h:49 #2 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock.h:186 #3 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd0 <enable_lock>) at ./include/linux/spinlock_api_smp.h:111 #4 _raw_spin_lock_irqsave (lock=lock@entry=0xffffffff81a57cd0 <enable_lock>) at kernel/locking/spinlock.c:162 #5 0xffffffff80563416 in clk_enable_lock () at ./include/linux/spinlock.h:325 torvalds#6 0xffffffff805648de in clk_core_disable_lock (core=0xffffffd900512500) at drivers/clk/clk.c:1062 torvalds#7 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#8 clk_disable (clk=0xffffffd9048b5100) at drivers/clk/clk.c:1079 torvalds#9 0xffffffff8059e5d4 in serial_pxa_console_write (co=<optimized out>, s=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", count=<optimized out>) at drivers/tty/serial/pxa_k1x.c:1724 torvalds#10 0xffffffff8004a34c in call_console_driver (dropped_text=0xffffffff81a68650 <dropped_text> "", len=69, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n", con=0xffffffff81964c10 <serial_pxa_console>) at kernel/printk/printk.c:1942 torvalds#11 console_emit_next_record (con=con@entry=0xffffffff81964c10 <serial_pxa_console>, ext_text=<optimized out>, dropped_text=0xffffffff81a68650 <dropped_text> "", handover=0xffffffc80578baa7, text=0xffffffff81a68250 <text> "[ 14.708612] [RESET][spacemit_reset_set][373]:assert = 1, id = 59 \n") at kernel/printk/printk.c:2731 torvalds#12 0xffffffff8004a49a in console_flush_all (handover=0xffffffc80578baa7, next_seq=<synthetic pointer>, do_cond_resched=false) at kernel/printk/printk.c:2793 torvalds#13 console_unlock () at kernel/printk/printk.c:2860 torvalds#14 0xffffffff8004b388 in vprintk_emit (facility=facility@entry=0, level=<optimized out>, level@entry=-1, dev_info=dev_info@entry=0x0, fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2268 torvalds#15 0xffffffff8004b3ae in vprintk_default (fmt=<optimized out>, args=<optimized out>) at kernel/printk/printk.c:2279 torvalds#16 0xffffffff8004b646 in vprintk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n", args=args@entry=0xffffffc80578bbd8) at kernel/printk/printk_safe.c:50 torvalds#17 0xffffffff80a880d6 in _printk (fmt=fmt@entry=0xffffffff813be470 "\001\066[RESET][%s][%d]:assert = %d, id = %d \n") at kernel/printk/printk.c:2289 torvalds#18 0xffffffff80a90bb6 in spacemit_reset_set (rcdev=rcdev@entry=0xffffffff81f563a8 <k1x_reset_controller+8>, id=id@entry=59, assert=assert@entry=true) at drivers/reset/reset-spacemit-k1x.c:373 torvalds#19 0xffffffff805823b6 in spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:401 torvalds#20 spacemit_reset_update (assert=true, id=59, rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>) at drivers/reset/reset-spacemit-k1x.c:387 torvalds#21 spacemit_reset_assert (rcdev=0xffffffff81f563a8 <k1x_reset_controller+8>, id=59) at drivers/reset/reset-spacemit-k1x.c:413 torvalds#22 0xffffffff8058158e in reset_control_assert (rstc=0xffffffd902b2f280) at drivers/reset/core.c:485 torvalds#23 0xffffffff807ccf96 in cpp_disable_clocks (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:960 torvalds#24 0xffffffff807cd0b2 in cpp_release_hardware (cpp_dev=cpp_dev@entry=0xffffffd904cc9040) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1038 torvalds#25 0xffffffff807cd990 in cpp_close_node (sd=<optimized out>, fh=<optimized out>) at drivers/media/platform/spacemit/camera/cam_cpp/k1x_cpp.c:1135 torvalds#26 0xffffffff8079525e in subdev_close (file=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-subdev.c:105 torvalds#27 0xffffffff8078e49e in v4l2_release (inode=<optimized out>, filp=0xffffffd906645d00) at drivers/media/v4l2-core/v4l2-dev.c:459 torvalds#28 0xffffffff80154974 in __fput (file=0xffffffd906645d00) at fs/file_table.c:320 torvalds#29 0xffffffff80154aa2 in ____fput (work=<optimized out>) at fs/file_table.c:348 torvalds#30 0xffffffff8002677e in task_work_run () at kernel/task_work.c:179 torvalds#31 0xffffffff800053b4 in resume_user_mode_work (regs=0xffffffc80578bee0) at ./include/linux/resume_user_mode.h:49 torvalds#32 do_work_pending (regs=0xffffffc80578bee0, thread_info_flags=<optimized out>) at arch/riscv/kernel/signal.c:478 torvalds#33 0xffffffff800039c6 in handle_exception () at arch/riscv/kernel/entry.S:374 Backtrace stopped: frame did not save the PC (gdb) thread 1 [Switching to thread 1 (Thread 1)] #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 49 ./include/asm-generic/spinlock.h: No such file or directory. (gdb) bt #0 0xffffffff80047e9c in arch_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/asm-generic/spinlock.h:49 #1 do_raw_spin_lock (lock=lock@entry=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock.h:186 #2 0xffffffff80aa21ce in __raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at ./include/linux/spinlock_api_smp.h:111 #3 _raw_spin_lock_irqsave (lock=0xffffffff81a57cd8 <g_cru_lock>) at kernel/locking/spinlock.c:162 #4 0xffffffff8056c4cc in ccu_mix_disable (hw=0xffffffff81956858 <sdh2_clk+120>) at ./include/linux/spinlock.h:325 #5 0xffffffff80564832 in clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1051 torvalds#6 clk_core_disable (core=0xffffffd900529900) at drivers/clk/clk.c:1031 torvalds#7 0xffffffff805648e6 in clk_core_disable_lock (core=0xffffffd900529900) at drivers/clk/clk.c:1063 torvalds#8 0xffffffff8056527e in clk_disable (clk=<optimized out>) at drivers/clk/clk.c:1084 torvalds#9 clk_disable (clk=clk@entry=0xffffffd904fafa80) at drivers/clk/clk.c:1079 torvalds#10 0xffffffff808bb898 in clk_disable_unprepare (clk=0xffffffd904fafa80) at ./include/linux/clk.h:1085 torvalds#11 0xffffffff808bb916 in spacemit_sdhci_runtime_suspend (dev=<optimized out>) at drivers/mmc/host/sdhci-of-k1x.c:1469 torvalds#12 0xffffffff8066e8e2 in pm_generic_runtime_suspend (dev=<optimized out>) at drivers/base/power/generic_ops.c:25 torvalds#13 0xffffffff80670398 in __rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:395 torvalds#14 0xffffffff806704b8 in rpm_callback (cb=cb@entry=0xffffffff8066e8ca <pm_generic_runtime_suspend>, dev=dev@entry=0xffffffd9018a2810) at drivers/base/power/runtime.c:529 torvalds#15 0xffffffff80670bdc in rpm_suspend (dev=0xffffffd9018a2810, rpmflags=<optimized out>) at drivers/base/power/runtime.c:672 torvalds#16 0xffffffff806716de in pm_runtime_work (work=0xffffffd9018a2948) at drivers/base/power/runtime.c:974 torvalds#17 0xffffffff800236f4 in process_one_work (worker=worker@entry=0xffffffd9013ee9c0, work=0xffffffd9018a2948) at kernel/workqueue.c:2289 torvalds#18 0xffffffff80023ba6 in worker_thread (__worker=0xffffffd9013ee9c0) at kernel/workqueue.c:2436 torvalds#19 0xffffffff80028bb2 in kthread (_create=0xffffffd9017de840) at kernel/kthread.c:376 torvalds#20 0xffffffff80003934 in handle_exception () at arch/riscv/kernel/entry.S:249 Backtrace stopped: frame did not save the PC (gdb) Change-Id: Ia95b41ffd6c1893c9c5e9c1c9fc0c155ea902d2c
…able the irq fiex bug like: [ 0.327242] pinctrl-single d401e000.pinctrl: 256 pins, size 1024 [ 0.378624] irq 12: nobody cared (try booting with the "irqpoll" option) [ 0.382473] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.36 #0 [ 0.388460] Hardware name: spacemit k1-x MUSE-N1 board (DT) [ 0.394103] Call Trace: [ 0.396619] [<ffffffff80005cf8>] dump_backtrace+0x1c/0x24 [ 0.402088] [<ffffffff80005d28>] show_stack+0x28/0x34 [ 0.407209] [<ffffffff80c9cd7c>] dump_stack_lvl+0x40/0x58 [ 0.412678] [<ffffffff80c9cda8>] dump_stack+0x14/0x1c [ 0.417799] [<ffffffff8005e366>] __report_bad_irq+0x42/0xba [ 0.423442] [<ffffffff8005e6b2>] note_interrupt+0x22e/0x270 [ 0.429084] [<ffffffff8005c144>] handle_irq_event+0x84/0x86 [ 0.434726] [<ffffffff8005f2cc>] handle_fasteoi_irq+0xc0/0x20e [ 0.440629] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.447053] [<ffffffff80635e30>] plic_handle_irq+0x84/0xf8 [ 0.452608] [<ffffffff8005b60e>] generic_handle_domain_irq+0x1c/0x2a [ 0.459032] [<ffffffff80635aa2>] riscv_intc_irq+0x26/0x60 [ 0.464501] [<ffffffff80cbbb66>] handle_riscv_irq+0x6e/0xb0 [ 0.470144] [<ffffffff80cc469e>] call_on_irq_stack+0x32/0x40 [ 0.475873] handlers: [ 0.478216] [<(____ptrval____)>] pcs_irq_handler [ 0.482903] Disabling IRQ torvalds#12 Change-Id: I4627f1b1e95d87eff88dfd4c1a331e91bd7fa1e5
Hi,
Here is what I did to the
CREDITS
file:Sorry for the microscopic commit.
Sincerely,
Jan.