You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following up on #7743, @smklein and I were trying to figure out why dogfood had a handful of very old (late 2023) zone-archives snapshots. Early in its life sled-agent creates a ZoneBundler, and one of the first things it does is try to find and destroy old snapshots in initialize_zfs_resources:
The boolean conditional on line 75 should be &&, not ||. As written, it is never true, so we never proceed to checking for the zone-bundle-specific property. (In hindsight, this is saving us!)
On line 87, we call get_oxide_value(_, ZONE_BUNDLE_ZFS_PROPERTY_NAME). ZONE_BUNDLE_ZFS_PROPERTY_NAME is "oxide:for-zone-bundle"; however, get_oxide_value then prepends "oxide:" itself, which means we're erroneously querying for oxide:oxide:for-zone-bundle.
Inside get_oxide_value, we call get_values(..., Some(PropertySource::Local)). When get_values is given a non-None property source, it inserts -s $source in the command line args, but in the wrong place: it needs to come beforeall_names. As written, the zfs get invocation will fail.
Back in initialize_zfs_resources: on line 88 we panic on any failure from the get_oxide_value() call, which at the moment is guaranteed to fail due to items 2 and 3. This could also fail for any spurious reason. Probably we should log a warning and return false here instead (which would skip destroying the dataset, but that seems okay)?
On line 98 we assert that the value we read back is true, which could also induce a panic. Can we log a warning and return false here too?
I believe fixing this set all together would let dogfood clean up these old snapshots as intended. But we've also discussed removing zone bundles altogether now that support bundles are coming along. If we do that we may want to manually clean up these old snapshots (and possible check customer systems for them?).
The text was updated successfully, but these errors were encountered:
Following up on #7743, @smklein and I were trying to figure out why dogfood had a handful of very old (late 2023)
zone-archives
snapshots. Early in its lifesled-agent
creates aZoneBundler
, and one of the first things it does is try to find and destroy old snapshots ininitialize_zfs_resources
:omicron/sled-agent/src/zone_bundle.rs
Lines 70 to 110 in 078678f
There are a handful of bugs here:
&&
, not||
. As written, it is never true, so we never proceed to checking for the zone-bundle-specific property. (In hindsight, this is saving us!)get_oxide_value(_, ZONE_BUNDLE_ZFS_PROPERTY_NAME)
.ZONE_BUNDLE_ZFS_PROPERTY_NAME
is"oxide:for-zone-bundle"
; however,get_oxide_value
then prepends"oxide:"
itself, which means we're erroneously querying foroxide:oxide:for-zone-bundle
.get_oxide_value
, we callget_values(..., Some(PropertySource::Local))
. Whenget_values
is given a non-None
property source, it inserts-s $source
in the command line args, but in the wrong place: it needs to come beforeall_names
. As written, thezfs get
invocation will fail.initialize_zfs_resources
: on line 88 we panic on any failure from theget_oxide_value()
call, which at the moment is guaranteed to fail due to items 2 and 3. This could also fail for any spurious reason. Probably we should log a warning and return false here instead (which would skip destroying the dataset, but that seems okay)?assert
that the value we read back istrue
, which could also induce a panic. Can we log a warning and return false here too?I believe fixing this set all together would let dogfood clean up these old snapshots as intended. But we've also discussed removing zone bundles altogether now that support bundles are coming along. If we do that we may want to manually clean up these old snapshots (and possible check customer systems for them?).
The text was updated successfully, but these errors were encountered: