Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle corrupt FS by offering to reformat #117

Closed
tannewt opened this issue Apr 11, 2017 · 18 comments
Closed

Handle corrupt FS by offering to reformat #117

tannewt opened this issue Apr 11, 2017 · 18 comments
Milestone

Comments

@tannewt
Copy link
Member

tannewt commented Apr 11, 2017

Unplugging without ejecting can corrupt the FS. On start we should validate the FS and offer to fix/reformat it on error. It shouldn't be automatically done because then data may be lost.

@dhalbert
Copy link
Collaborator

Workaround for now to zero out SPI flash on Express boards: Install https://github.com/adafruit/Adafruit_SPIFlash as a library in the Arduino IDE. From File->Examples, compile, load, and run the example Adafruit_SPIFlash->flash_erase. When CircuitPython restarts, it will recreate the filesystem on SPI flash. Don't forget to set the board type for your Express board or you will get compilation errors.

(Thanks @tdicola and @tannewt)

@dhalbert
Copy link
Collaborator

I have too many times unplugged or hard-reset a board and trashed the filesystem. Verifying the integrity of the filesystem is good but there may be errors not so easily detected. I was thinking about other convenient ways to force a reset of the filesystem:

  1. A filesystem reset callable from the REPL.
  2. Prebuilt copies of spi_erase in .uf2 and .bin format included in each CircuitPython release, in case the REPL is not reachable due to nasty code in boot.py or main.py. spi_erase currently asks for confirmation via serial input, but that maybe should be removed in case a terminal program is not handy.
  3. On boards with extra buttons, provide a way to reset the filesystem when the board is reset. For instance, on CircuitPlayground Express, if both input pushbuttons were held down when the board was reset, that could be a signal to reset the filesystem. But maybe this is too easy: the mean kid in the class might clear out someone else's board.

@willingc
Copy link
Collaborator

Just curious...What OS are you using @dhalbert? What is running on the board before the unplug or hard-reset: REPL or a script? Is there any particular things that you are using I2C, SPI, etc.?

The filesystem seems to become corrupted more often than I would expect by just unplugging.

@dhalbert
Copy link
Collaborator

@willingc: This is Windows. The corruption I see is due to delayed writes. There's a lot of gory detail and some red herrings in bug #111. See https://superuser.com/questions/1197897/windows-delays-writing-fat-table-on-usb-drive-despite-quick-removal for a summary. Feel free to write to me at the email in my github profile.

@willingc
Copy link
Collaborator

Hmm...I can see how delay writes would be pain to work with. I suspect there is an active, in-progress write when the reset/unplug occurs for the corruption to occur vs. simply having a write that is delayed but not yet started. I'll do some thinking on this while I'm traveling. I'm a bit curious about what state the MP/CP code leaves things when facing an unexpected loss of power during a write.

@dhalbert
Copy link
Collaborator

The problem I am seeing in Windows is specifically that the File Allocation Table (FAT) entries for a file are updated 20-90 seconds after the directory info for the file and the file data itself is written. These entries mark the first block used for a file and then contain a chain of pointers to subsequent blocks. So if you pull the plug or press the reset button at any point during that 20-90 seconds, the file system will be inconsistent. It does not have to be literally during a write. I do an "Eject" every time after I update the filesystem and before I run anything. I also turn off auto-reset.

I am not sure exactly what might cause Windows to detect that the filesystem is corrupt.

@willingc
Copy link
Collaborator

So to rephrase: "If the encapsulated write process (from initial write of the directory info to update of final pointer in the the chain of pointers to storage blocks) is active, any interruption (loss of power/hard reset) or standard "Eject" can leave the filesystem in an incomplete/corrupt state."

Is there a particular error message that you get when reconnecting the CP board to the Windows machine?

An interesting test would be to see if the CP board could be read by Linux or macOS after Windows reports corruption.

@tannewt
Copy link
Member Author

tannewt commented Apr 19, 2017

Thank you for all of the good thinking on this! I'd love to have better ways of recovering from this.

There are two potential failure modes I know of:

  1. Hard reset or power loss during a write (when the status LED is red). This would cause a partial sector write to the flash which would be bad. I don't know any way of preventing this.
  2. Hard reset or power loss while the host OS has cached writes. This would cause some blocks on the SPI flash to be updated and others not. This can be avoided by safely ejecting prior to hard reset or power loss. The safe ejection flushes the cache and should leave the SPI flash FS consistent. CircuitPython can't do anything to prevent this because its only responsibility is reading and writing blocks at the direction of the host.

autoreset does make 2 a bit confusing because CircuitPython will attempt to read the FS in this intermediate state. This usually results in a spurious syntax error. It shouldn't cause any corruption by reading it though.

@dhalbert have you tried using windows to reformat the drive? I think it should work because CircuitPython is just being a dumb block device.

Thanks!

@willingc
Copy link
Collaborator

1 is an age old problem ;-)

As for 2, just to clarify, is Windows reporting corruption after the "Safe Ejection"? That would be a Windows bug for not flushing the cache. I won't have a Windows machine until Tuesday but I will try out some options when I have access.

@dhalbert
Copy link
Collaborator

dhalbert commented Apr 19, 2017

@willingc: As for 2, just to clarify, is Windows reporting corruption after the "Safe Ejection"? That would be a Windows bug for not flushing the cache.

No, Safe Ejection prevents the corruption. The problem is a hard reset or disconnect before the cached writes happen, by not waiting long enough or by not doing an Eject. A good way to get corruption is to write several files and then press the reset button a few seconds later, without doing an Eject. I always do an Eject after I copy files, to force the writes and avoid this to avoid this.

Also, as I mentioned in the superuser.com posting, USB flash drives are by default set to "Quick Removal", so the writes should not be delayed. But a few of them (the FAT table ones) still are. I do think this is some kind of bug, but it's very long-standing. And even if it's fixed, that fix may not get propagated to many older systems. I am trying to get the attention of some knowledgeable person at Microsoft, but that's difficult.

@tannewt: @dhalbert have you tried using windows to reformat the drive? I think it should work because CircuitPython is just being a dumb block device.

I think I tried this, but it maybe didn't work. FatFS is very minimal. For instance, I think typically FAT16 has two copies of the FAT table for safety, and FatFS creates just one when asked to create a filesystem. I'll try again soon to verify or not.

@tannewt
Copy link
Member Author

tannewt commented Apr 19, 2017

Yeah, thats true about FatFS being minimal. I think it actually creates FAT12. I'm definitely open to ideas on how to recover from problems like this.

@dhalbert do you use the reset button frequently? The goal is to have auto-reset and soft reset work the majority of the time.

@dhalbert
Copy link
Collaborator

I don't usually use the reset button (I know better now), but Windows just added serial support in "Windows Subsystem for Linux" (the bash shell environment they have now), and I was (unsuccessfully) trying it out. So there was a lot of plugging/unplugging/resetting and I was writing infinite loops to send characters.

The reset button is very tempting for the average user.

If there is a programmatic way to force the delayed writes to complete or do an Eject, maybe we could add code to do that to some recommended editor for Windows, like Mu (though I have had other troubles with Mu). That would be a less manual fix for Windows. I am not wedded to Windows by any means but many customers will be using Windows and I want to give CircuitPython a workout on it.

@tannewt
Copy link
Member Author

tannewt commented Apr 19, 2017

Ok, yeah. I do hard resets often to reload CircuitPython itself too.

I'm not sure what options we have on the host side for forcing writes. I know that CircuitPython can't do anything. We could switch to MTP which relies on the device maintaining the file system but it has its own problems (like no Mac OSX support.)

@dhalbert
Copy link
Collaborator

Followup: I tried reformatting the filesystem to FAT via Windows this morning. It did not complain when formatting CIRCUITPY, and I could copy files in, but the FatFS could not read them. uos.listdir() showed nothing and uos.mkdir() raised an OSError, so it appears FatFS cannot deal with the filesystem Windows created.

@tannewt
Copy link
Member Author

tannewt commented Apr 20, 2017 via email

@dhalbert
Copy link
Collaborator

To help out a user, I built a flash_erase .uf2 for Metro M0 Express that doesn't prompt the user. It just erases the flash and then blinks to indicate success or failure. Similar versions could be built for Feather M0 Express and CP Express. (Not sure if this will work for all.) See https://forums.adafruit.com/viewtopic.php?f=60&t=118427&p=594953#p594953.

Code is just the original flash_erase, pruned to almost nothing. This is just hacked up for now. I should submit a proper pull request to the flash library.

// Adafruit SPI Flash Total Erase
// Author: Tony DiCola
//
// This example will perform a complete erase of ALL data on the SPI
// flash.  This is handy to reset the flash into a known empty state
// and fix potential filesystem or other corruption issues.
//
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// !!  NOTE: YOU WILL ERASE ALL DATA BY RUNNING THIS SKETCH!  !!
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
//
// Usage:
// - Modify the pins and type of fatfs object in the config
//   section below if necessary (usually not necessary).
// - Upload this sketch to your M0 express board.

#include <SPI.h>
#include <Adafruit_SPIFlash.h>

// Configuration of the flash chip pins and flash fatfs object.
// You don't normally need to change these if using a Feather/Metro
// M0 express board.
#define FLASH_TYPE     SPIFLASHTYPE_W25Q16BV  // Flash chip type.
                                              // If you change this be
                                              // sure to change the fatfs
                                              // object type below to match.

#define FLASH_SS       SS1                    // Flash chip SS pin.
#define FLASH_SPI_PORT SPI1                   // What SPI port is Flash on?

Adafruit_SPIFlash flash(FLASH_SS, &FLASH_SPI_PORT);     // Use hardware SPI

// Alternatively you can define and use non-SPI pins!
//Adafruit_SPIFlash flash(SCK1, MISO1, MOSI1, FLASH_SS);


void setup() {

  // Initialize flash library and check its chip ID.
  if (!flash.begin(FLASH_TYPE)) {
    blink(2);
  }
  if (!flash.EraseChip()) {
    blink(3);
  }
  blink(1);
}

void loop() {
  // Nothing to do in the loop.
  delay(100);
}

void blink(int times) {
    pinMode(LED_BUILTIN, OUTPUT);
    while (1) {
      for (int i = 0; i < times; i++) {
        digitalWrite(LED_BUILTIN, HIGH);
        delay(100);
        digitalWrite(LED_BUILTIN, LOW);
        delay(100);
      }
      delay(1000);
    } 
}

@tannewt tannewt modified the milestone: Long term Aug 1, 2017
@tannewt
Copy link
Member Author

tannewt commented Sep 1, 2017

I don't think we should offer to reformat from within CircuitPython. There is now a troubleshooting post about recovering from this issue: https://circuitpython.readthedocs.io/en/latest/docs/troubleshooting.html#file-system-issues

@tannewt tannewt closed this as completed Sep 1, 2017
@jepler
Copy link
Member

jepler commented Nov 27, 2019

Erasing the storage is now supported within CircuitPython via the storage module: https://circuitpython.readthedocs.io/en/4.x/shared-bindings/storage/__init__.html#storage.erase_filesystem

(Note added in case someone else besides me arrives via search)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants