Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Race condition in RPC filesystem cache. #7531

Merged
merged 4 commits into from
Mar 7, 2025

Conversation

dts
Copy link
Contributor

@dts dts commented Feb 20, 2025

Description

This fixes #6976. writeFile does not atomically write the file, and when the file is read by the other thread, it can (if the timing is just wrong), read an empty file. In the most common case, the /@vite/env module empty, and defines don't show up. In principle, however, I believe it can happen with any file.

Please don't delete this checklist! Before submitting the PR, please make sure you do the following:

  • It's really useful if your PR references an issue where it is discussed ahead of time. If the feature is substantial or introduces breaking changes without a discussion, PR might be closed.
  • Ideally, include a test that fails without this PR but passes with it.
    • There is a test case repository (https://github.com/timothympace/vitest-define-bug), which fails on master, and runs for hours and hours with this fix. I don't think it'll be easy to create a test case that causes the error reliably in this repository.
  • Please, don't make changes to pnpm-lock.yaml unless you introduce a new test example.

Tests

  • Run the tests with pnpm test:ci.
    • there are some errors locally that appear to be environment related, not related to my change.

Documentation

  • If you introduce new functionality, document it. You can run documentation with pnpm run docs command.

Changesets

  • Changes in changelog are generated from PR name. Please, make sure that it explains your changes in an understandable manner. Please, prefix changeset messages with feat:, fix:, perf:, docs:, or chore:.

Copy link

netlify bot commented Feb 20, 2025

Deploy Preview for vitest-dev ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit 6b9b821
🔍 Latest deploy log https://app.netlify.com/sites/vitest-dev/deploys/67c1203f1ba381000883cb04
😎 Deploy Preview https://deploy-preview-7531--vitest-dev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@dts dts force-pushed the bugfix/rpc-race-condition branch 2 times, most recently from cfc61ef to 2e5acdf Compare February 22, 2025 00:38
@hi-ogawa
Copy link
Contributor

Thanks for tackling the issue! The idea sounds right, but maybe do you have any reference of this pattern? I feel readFile/writeFile atomic issue sounds common and people must have solved a similar issue somewhere in the past.

@dts
Copy link
Contributor Author

dts commented Feb 22, 2025

Here's a similar existing implementation: https://github.com/npm/write-file-atomic/blob/main/lib/index.js. It's well used (53M last week or whatever), and solves the problems in the same basic way (i.e. write to random file, move across, unlink if error), but done in a way that is externally identical to writeFile (i.e. it supports FDs in addition to paths, etc). I'd assume you don't want to take on another dependency, but that would be a totally fine solution. I'm happy to make that change if that's desired.

@hi-ogawa
Copy link
Contributor

hi-ogawa commented Feb 23, 2025

Thanks for the pointer and explanation 👍 Like you said, we don't likely want an extra dependency for this, but I was just wondering if there's any extended commentary of this approach since I'm not familiar with low level aspect of file system operation.

@dts
Copy link
Contributor Author

dts commented Feb 23, 2025

I wasn't able to find a nice, packaged article on this (maybe I should write one?) - but the summary is basically that writeFile has a bunch of intermediate states that it leaves on disk, the most common one that we're hitting in this bug is when the file is freshly emptied, but nothing's been written yet. If the file were larger, we'd probably hit more cases of half-written ones. In this PR, by writing it to a temporary file whose location is never broadcast, we get through all of these FS operations in a safely out-of-the-way place. Then, rename is a single/indivisible (atomic) change on the filesystem level. Processes will either hit the old version, or the new version - there is never a time when other processes could hit anything else. I am a little lost on why this is failing consistently on Windows, the way it writes that file is totally separate (when I run this test locally the atomicWriteFile isn't involved in writing the list.json piece.)

Copy link

pkg-pr-new bot commented Feb 24, 2025

@vitest/browser

npm i https://pkg.pr.new/@vitest/browser@7531

@vitest/coverage-istanbul

npm i https://pkg.pr.new/@vitest/coverage-istanbul@7531

@vitest/coverage-v8

npm i https://pkg.pr.new/@vitest/coverage-v8@7531

@vitest/expect

npm i https://pkg.pr.new/@vitest/expect@7531

@vitest/mocker

npm i https://pkg.pr.new/@vitest/mocker@7531

@vitest/pretty-format

npm i https://pkg.pr.new/@vitest/pretty-format@7531

@vitest/runner

npm i https://pkg.pr.new/@vitest/runner@7531

@vitest/snapshot

npm i https://pkg.pr.new/@vitest/snapshot@7531

@vitest/spy

npm i https://pkg.pr.new/@vitest/spy@7531

@vitest/ui

npm i https://pkg.pr.new/@vitest/ui@7531

@vitest/utils

npm i https://pkg.pr.new/@vitest/utils@7531

vite-node

npm i https://pkg.pr.new/vite-node@7531

vitest

npm i https://pkg.pr.new/vitest@7531

@vitest/web-worker

npm i https://pkg.pr.new/@vitest/web-worker@7531

@vitest/ws-client

npm i https://pkg.pr.new/@vitest/ws-client@7531

commit: 050ef1a

@hi-ogawa
Copy link
Contributor

hi-ogawa commented Feb 24, 2025

Now I'm digging into this and there might be a different issue. Here is a minimal simulation of the issue https://github.com/hi-ogawa/reproductions/blob/main/vitest-7531-atomic-write-file/repro.js

The issue is that promises.has(tmp) and promises.set(tmp, ...) don't run synchronously when !created.has(dir) causes await mkdir() is called, so this causes multiple writeFile to happen in parallel and properly awaited reader-side can end up reading the file which is being overwritten.

  if (promises.has(tmp)) { // 👈
    await promises.get(tmp);
    return { id: tmp };
  }
  if (!created.has(dir)) {
    await mkdir(dir, { recursive: true });
    created.add(dir);
  }
  promises.set( // 👈
    tmp,
    writeFile(tmp, code).finally(() => promises.delete(tmp)),
  );

By reversing if (promises.has(tmp)) and promises.has(tmp), my minimal demo seems fixed, so maybe is this enough?
(I pushed this version in #7546)

  if (!created.has(dir)) {
    await mkdir(dir, { recursive: true });
    created.add(dir);
  }
  if (promises.has(tmp)) { // 👈
    await promises.get(tmp);
    return { id: tmp };
  }
  promises.set( // 👈
    tmp,
    writeFile(tmp, code).finally(() => promises.delete(tmp)),
  );

@dts
Copy link
Contributor Author

dts commented Feb 24, 2025

Ooh! That's a promising/problematic race condition as well. Unfortunately, it doesn't appear to fix the specific case I'm working on (it failed after 37m of reruns, which is right in line with the main branch). I'll double check with the fix I'm proposing again, but last time it ran for hours without failing.

@hi-ogawa
Copy link
Contributor

hi-ogawa commented Feb 24, 2025

I think promise.has/set is a clear bug, so I'll go with merging #7546 first. Your atomicWriteFile does make sense to cover maybe other unknown scenarios, so we can continue looking into this. One thing we want to check is whether this change affects performance since this fetch cache is added to improve the perf #5592.

@dts
Copy link
Contributor Author

dts commented Feb 25, 2025

My confusion is still the persistent windows failure. I don't have a windows setup handy, how do you recommend debugging this?

@hi-ogawa
Copy link
Contributor

Oh, I thought Windows is just flaky on CI, but actually this change affects it somehow? I don't have Windows either and don't have a clue yet. Let me try a few CI rerun.

@hi-ogawa
Copy link
Contributor

Ah okay, this log says something https://github.com/vitest-dev/vitest/actions/runs/13467699629/job/37748891146?pr=7531#step:8:250 I'm not familiar with Windows fs, but I'd assume there's something special about it.

+ There were unhandled errors during test collection
  + {
  +   name: 'Error',
  +   stack: "Error: EPERM: operation not permitted, rename 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\uCLG3-rDwVAXN_u-29Rvw\\ssr\\.tmp-1740441023776-jla0syj4p9m' -> 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\uCLG3-rDwVAXN_u-29Rvw\\ssr\\40b9f1f37d5bd759c353e90078138580f67eb091'\n" +
  +     '    at async rename (node:internal/fs/promises:782:10)\n' +
  +     '    at async atomicWriteFile (file:///D:/a/vitest/vitest/packages/vitest/dist/chunks/resolveConfig.BuPGvfZo.js:6729:5)\n' +

@dts dts force-pushed the bugfix/rpc-race-condition branch from 2e5acdf to 0e3467b Compare February 25, 2025 19:30
@dts
Copy link
Contributor Author

dts commented Feb 25, 2025

I've made a windows-specific fallback for this case (where rename doesn't work because the target file already exists). I couldn't figure out a way to make the write atomic on windows. In principle, it would require sharing a lock between client & server for the cache.

@dts dts force-pushed the bugfix/rpc-race-condition branch from 0e3467b to c15c956 Compare February 25, 2025 19:32
@dts
Copy link
Contributor Author

dts commented Feb 25, 2025

An alternative solution would be to lock the files across the RPC bridge with something like proper-lockfile.

@hi-ogawa hi-ogawa added the p2-to-be-discussed Enhancement under consideration (priority) label Feb 28, 2025
@dts
Copy link
Contributor Author

dts commented Feb 28, 2025

With my diff (M2 MacBook Pro, 3 runs of pnpm test each)

Duration  5.55s (transform 1.50s, setup 547ms, collect 2.40s, tests 10.65s, environment 4.07s, prepare 5.76s)
Duration  5.50s (transform 1.31s, setup 509ms, collect 2.21s, tests 10.60s, environment 4.08s, prepare 5.73s)
Duration  5.66s (transform 1.53s, setup 647ms, collect 2.50s, tests 10.66s, environment 4.36s, prepare 5.79s)

Main:

Duration  5.70s (transform 1.32s, setup 542ms, collect 2.29s, tests 10.69s, environment 4.36s, prepare 6.02s)
Duration  5.54s (transform 1.77s, setup 521ms, collect 2.61s, tests 10.65s, environment 4.08s, prepare 5.52s)
Duration  5.74s (transform 1.68s, setup 506ms, collect 2.64s, tests 10.78s, environment 4.54s, prepare 5.90s)

No real difference, it would appear.

hi-ogawa
hi-ogawa previously approved these changes Feb 28, 2025
@hi-ogawa hi-ogawa requested a review from sheremet-va February 28, 2025 02:27
@dts dts force-pushed the bugfix/rpc-race-condition branch from c15c956 to 6b9b821 Compare February 28, 2025 02:32
Copy link
Contributor

@hi-ogawa hi-ogawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition might still exist somewhere else like #7546, but I think enforcing atomic write file here looks good to me, considering there's no perf impact and it doesn't require much code.

I wasn't involved in the original perf improvement PR #5592, so I'll wait for another reviews from the team 🙏

@@ -146,3 +150,21 @@ function handleRollupError(e: unknown): never {
}
throw e
}

async function atomicWriteFile(realFilePath: string, data: string): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a few comments why we do this, and a link to this PR

@dts dts force-pushed the bugfix/rpc-race-condition branch from 6b9b821 to 050ef1a Compare March 6, 2025 22:36
@sheremet-va
Copy link
Member

Performance wise it should be fine because Vitest won't hit this code many times since it's cached. Let's merge it to see how it affects the ecosystem.

@sheremet-va sheremet-va merged commit b7f5526 into vitest-dev:main Mar 7, 2025
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cr-tracked p2-to-be-discussed Enhancement under consideration (priority)
Projects
Status: P2 - 2
Development

Successfully merging this pull request may close these issues.

Race condition in test runner when using "define" in the vite configuration
3 participants