Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent fails to clean up old zone-bundler snapshots on startup #7746

Open
jgallagher opened this issue Mar 6, 2025 · 0 comments
Open

Comments

@jgallagher
Copy link
Contributor

Following up on #7743, @smklein and I were trying to figure out why dogfood had a handful of very old (late 2023) zone-archives snapshots. Early in its life sled-agent creates a ZoneBundler, and one of the first things it does is try to find and destroy old snapshots in initialize_zfs_resources:

fn initialize_zfs_resources(log: &Logger) -> Result<(), BundleError> {
let zb_snapshots =
Zfs::list_snapshots().unwrap().into_iter().filter(|snap| {
// Check for snapshots named how we expect to create them.
if snap.snap_name != ZONE_ROOT_SNAPSHOT_NAME
|| !snap.snap_name.starts_with(ARCHIVE_SNAPSHOT_PREFIX)
{
return false;
}
// Additionally check for the zone-bundle-specific property.
//
// If we find a dataset that matches our names, but which _does not_
// have such a property, we'll panic rather than possibly deleting
// user data.
let name = snap.to_string();
let value =
Zfs::get_oxide_value(&name, ZONE_BUNDLE_ZFS_PROPERTY_NAME)
.unwrap_or_else(|_| {
panic!(
"Found a ZFS snapshot with a name reserved for \
zone bundling, but which does not have the \
zone-bundle-specific property. Bailing out, \
rather than risking deletion of user data. \
snap_name = {}, property = {}",
&name, ZONE_BUNDLE_ZFS_PROPERTY_NAME
)
});
assert_eq!(value, ZONE_BUNDLE_ZFS_PROPERTY_VALUE);
true
});
for snapshot in zb_snapshots {
Zfs::destroy_snapshot(&snapshot.filesystem, &snapshot.snap_name)?;
debug!(
log,
"destroyed pre-existing zone bundle snapshot";
"snapshot" => %snapshot,
);
}
Ok(())
}
.

There are a handful of bugs here:

  1. The boolean conditional on line 75 should be &&, not ||. As written, it is never true, so we never proceed to checking for the zone-bundle-specific property. (In hindsight, this is saving us!)
  2. On line 87, we call get_oxide_value(_, ZONE_BUNDLE_ZFS_PROPERTY_NAME). ZONE_BUNDLE_ZFS_PROPERTY_NAME is "oxide:for-zone-bundle"; however, get_oxide_value then prepends "oxide:" itself, which means we're erroneously querying for oxide:oxide:for-zone-bundle.
  3. Inside get_oxide_value, we call get_values(..., Some(PropertySource::Local)). When get_values is given a non-None property source, it inserts -s $source in the command line args, but in the wrong place: it needs to come before all_names. As written, the zfs get invocation will fail.
  4. Back in initialize_zfs_resources: on line 88 we panic on any failure from the get_oxide_value() call, which at the moment is guaranteed to fail due to items 2 and 3. This could also fail for any spurious reason. Probably we should log a warning and return false here instead (which would skip destroying the dataset, but that seems okay)?
  5. On line 98 we assert that the value we read back is true, which could also induce a panic. Can we log a warning and return false here too?

I believe fixing this set all together would let dogfood clean up these old snapshots as intended. But we've also discussed removing zone bundles altogether now that support bundles are coming along. If we do that we may want to manually clean up these old snapshots (and possible check customer systems for them?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant