-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow ReadInternalStore
to be pickled
#34
Conversation
🥳 |
def _callback_info(cb): | ||
return { | ||
"offset": cb(_attr="offset"), | ||
"uri": cb(_attr="_fd")().uri, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The private _attr
argument is used here (as is needed for ndarray as noted in the code). Perhaps this can inform what a "low level" block API would look like. By using the offset
and the uri
this will not work (and likely produce junk data) after an update
of the backing ASDF file. Minimally we should document this danger and it may be one better to check the file mode and not allow pickling when the file is opened 'rw'.
def _to_callback(info): | ||
def cb(): | ||
with asdf.generic_io.get_file(info["uri"], mode="r") as gf: | ||
return asdf._block.io.read_block(gf, info["offset"])[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another instance of using non-public asdf api. I would be happy to see 3.x include at least a provisional introduce of a lower level api that would include a function that could replace this (that loads a block from an ASDF file without parsing the tree, etc).
|
||
self._chunk_info = state["_chunk_info"] | ||
self._chunk_callbacks = {k: _to_callback(self._chunk_info[k]) for k in self._chunk_info} | ||
# as __init__ will not be called on self, set up attributed expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be improved by changing InternalStore
to not rely on code in __init__
. I will open a follow-up issue.
To use a zarr array read from an asdf file (with chunks stored as internal ASDF blocks) with dask distributed processing the store that provides the chunk data must be pickleable.
This PR makes
ReadInternalStore
pickleable (and re-pickleable as dask may transfer chunks between workers).The following example was used for prototyping:
The above example generates a large (17k x 37k) image, saves it to an ASDF file as a chunked array, loads the saved file, converts the chunked array to a dask array and computes the sum (using dask distributed). The screenshot below is of a single run showing the computation distributed across 10 workers.
As it's likely that users will want to use dask we should consider options for testing dask compatibility (and specifically dask distributed compatibility) in the CI.