Skip to content

Commit 2133314

Browse files
adam900710gregkh
authored andcommitted
btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
[ Upstream commit 72dad8e ] [BUG] When running btrfs with block size (4K) smaller than page size (64K, aarch64), there is a very high chance to crash the kernel at generic/750, with the following messages: (before the call traces, there are 3 extra debug messages added) BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-3): checking UUID tree hrtimer: interrupt took 5451385 ns BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs] CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ torvalds#89 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs] lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] Call trace: can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P) can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L) btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs] extent_writepage+0x10c/0x3b8 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x160 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x74/0xa0 start_delalloc_inodes+0x17c/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x17c/0x288 [btrfs] shrink_delalloc+0x11c/0x280 [btrfs] flush_space+0x288/0x328 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x228/0x680 worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0 CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ torvalds#89 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write) pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : process_one_work+0x110/0x680 lr : worker_thread+0x1bc/0x360 Call trace: process_one_work+0x110/0x680 (P) worker_thread+0x1bc/0x360 (L) worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 2-3 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x275bb9540000 from 0xffff800080000000 PHYS_OFFSET: 0xffff8fbba0000000 CPU features: 0x100,00000070,00801250,8201720b [CAUSE] The above warning is triggered immediately after the delalloc range failure, this happens in the following sequence: - Range [1568K, 1636K) is dirty 1536K 1568K 1600K 1636K 1664K | |/////////|////////| | Where 1536K, 1600K and 1664K are page boundaries (64K page size) - Enter extent_writepage() for page 1536K - Enter run_delalloc_nocow() with locked page 1536K and range [1568K, 1636K) This is due to the inode having preallocated extents. - Enter cow_file_range() with locked page 1536K and range [1568K, 1636K) - btrfs_reserve_extent() only reserved two extents The main loop of cow_file_range() only reserved two data extents, Now we have: 1536K 1568K 1600K 1636K 1664K | |<-->|<--->|/|///////| | 1584K 1596K Range [1568K, 1596K) has an ordered extent reserved. - btrfs_reserve_extent() failed inside cow_file_range() for file offset 1596K This is already a bug in our space reservation code, but for now let's focus on the error handling path. Now cow_file_range() returned -ENOSPC. - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range [1568K, 1636K) Function btrfs_cleanup_ordered_extents() normally needs to skip the ranges inside the folio, as it will normally be cleaned up by extent_writepage(). Such split error handling is already problematic in the first place. What's worse is the folio range skipping itself, which is not taking subpage cases into consideration at all, it will only skip the range if the page start >= the range start. In our case, the page start < the range start, since for subpage cases we can have delalloc ranges inside the folio but not covering the folio. So it doesn't skip the page range at all. This means all the ordered extents, both [1568K, 1584K) and [1584K, 1596K) will be marked as IOERR. And these two ordered extents have no more pending ios, they are marked finished, and *QUEUED* to be deleted from the io tree. - extent_writepage() do error cleanup Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K). Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the deletion from io tree is async, it may or may not happen at this time. If the ranges have not yet been removed, we will do double cleaning on those ranges, triggering the above ordered extent warnings. In theory there are other bugs, like the cleanup in extent_writepage() can cause double accounting on ranges that are submitted asynchronously (compression for example). But that's much harder to trigger because normally we do not mix regular and compression delalloc ranges. [FIX] The folio range split is already buggy and not subpage compatible, it was introduced a long time ago where subpage support was not even considered. So instead of splitting the ordered extents cleanup into the folio range and out of folio range, do all the cleanup inside writepage_delalloc(). - Pass @null as locked_folio for btrfs_cleanup_ordered_extents() in btrfs_run_delalloc_range() - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc() failed So all ordered extents are only cleaned up by btrfs_run_delalloc_range(). - Handle the ranges that already have ordered extents allocated If part of the folio already has ordered extent allocated, and btrfs_run_delalloc_range() failed, we also need to cleanup that range. Now we have a concentrated error handling for ordered extents during btrfs_run_delalloc_range(). Fixes: d1051d6 ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 8bf334b ("btrfs: fix double accounting race when extent_writepage_io() failed") Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 80f32ac commit 2133314

File tree

2 files changed

+49
-13
lines changed

2 files changed

+49
-13
lines changed

fs/btrfs/extent_io.c

+48-11
Original file line numberDiff line numberDiff line change
@@ -1145,14 +1145,19 @@ static bool find_next_delalloc_bitmap(struct folio *folio,
11451145
}
11461146

11471147
/*
1148-
* helper for extent_writepage(), doing all of the delayed allocation setup.
1148+
* Do all of the delayed allocation setup.
11491149
*
1150-
* This returns 1 if btrfs_run_delalloc_range function did all the work required
1151-
* to write the page (copy into inline extent). In this case the IO has
1152-
* been started and the page is already unlocked.
1150+
* Return >0 if all the dirty blocks are submitted async (compression) or inlined.
1151+
* The @folio should no longer be touched (treat it as already unlocked).
11531152
*
1154-
* This returns 0 if all went well (page still locked)
1155-
* This returns < 0 if there were errors (page still locked)
1153+
* Return 0 if there is still dirty block that needs to be submitted through
1154+
* extent_writepage_io().
1155+
* bio_ctrl->submit_bitmap will indicate which blocks of the folio should be
1156+
* submitted, and @folio is still kept locked.
1157+
*
1158+
* Return <0 if there is any error hit.
1159+
* Any allocated ordered extent range covering this folio will be marked
1160+
* finished (IOERR), and @folio is still kept locked.
11561161
*/
11571162
static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
11581163
struct folio *folio,
@@ -1170,6 +1175,16 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
11701175
* last delalloc end.
11711176
*/
11721177
u64 last_delalloc_end = 0;
1178+
/*
1179+
* The range end (exclusive) of the last successfully finished delalloc
1180+
* range.
1181+
* Any range covered by ordered extent must either be manually marked
1182+
* finished (error handling), or has IO submitted (and finish the
1183+
* ordered extent normally).
1184+
*
1185+
* This records the end of ordered extent cleanup if we hit an error.
1186+
*/
1187+
u64 last_finished_delalloc_end = page_start;
11731188
u64 delalloc_start = page_start;
11741189
u64 delalloc_end = page_end;
11751190
u64 delalloc_to_write = 0;
@@ -1238,11 +1253,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
12381253
found_len = last_delalloc_end + 1 - found_start;
12391254

12401255
if (ret >= 0) {
1256+
/*
1257+
* Some delalloc range may be created by previous folios.
1258+
* Thus we still need to clean up this range during error
1259+
* handling.
1260+
*/
1261+
last_finished_delalloc_end = found_start;
12411262
/* No errors hit so far, run the current delalloc range. */
12421263
ret = btrfs_run_delalloc_range(inode, folio,
12431264
found_start,
12441265
found_start + found_len - 1,
12451266
wbc);
1267+
if (ret >= 0)
1268+
last_finished_delalloc_end = found_start + found_len;
12461269
} else {
12471270
/*
12481271
* We've hit an error during previous delalloc range,
@@ -1277,8 +1300,22 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
12771300

12781301
delalloc_start = found_start + found_len;
12791302
}
1280-
if (ret < 0)
1303+
/*
1304+
* It's possible we had some ordered extents created before we hit
1305+
* an error, cleanup non-async successfully created delalloc ranges.
1306+
*/
1307+
if (unlikely(ret < 0)) {
1308+
unsigned int bitmap_size = min(
1309+
(last_finished_delalloc_end - page_start) >>
1310+
fs_info->sectorsize_bits,
1311+
fs_info->sectors_per_page);
1312+
1313+
for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
1314+
btrfs_mark_ordered_io_finished(inode, folio,
1315+
page_start + (bit << fs_info->sectorsize_bits),
1316+
fs_info->sectorsize, false);
12811317
return ret;
1318+
}
12821319
out:
12831320
if (last_delalloc_end)
12841321
delalloc_end = last_delalloc_end;
@@ -1512,13 +1549,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
15121549

15131550
bio_ctrl->wbc->nr_to_write--;
15141551

1515-
done:
1516-
if (ret) {
1552+
if (ret)
15171553
btrfs_mark_ordered_io_finished(inode, folio,
15181554
page_start, PAGE_SIZE, !ret);
1519-
mapping_set_error(folio->mapping, ret);
1520-
}
15211555

1556+
done:
1557+
if (ret < 0)
1558+
mapping_set_error(folio->mapping, ret);
15221559
/*
15231560
* Only unlock ranges that are submitted. As there can be some async
15241561
* submitted ranges inside the folio.

fs/btrfs/inode.c

+1-2
Original file line numberDiff line numberDiff line change
@@ -2419,8 +2419,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
24192419

24202420
out:
24212421
if (ret < 0)
2422-
btrfs_cleanup_ordered_extents(inode, locked_folio, start,
2423-
end - start + 1);
2422+
btrfs_cleanup_ordered_extents(inode, NULL, start, end - start + 1);
24242423
return ret;
24252424
}
24262425

0 commit comments

Comments
 (0)