Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ReadInternalStore to be pickled #34

Merged
merged 2 commits into from
Oct 18, 2023

Conversation

braingram
Copy link
Contributor

To use a zarr array read from an asdf file (with chunks stored as internal ASDF blocks) with dask distributed processing the store that provides the chunk data must be pickleable.

This PR makes ReadInternalStore pickleable (and re-pickleable as dask may transfer chunks between workers).

The following example was used for prototyping:

import os

import asdf
import dask.distributed
import dask.array
import zarr
import numpy as np

import asdf_zarr.storage


if __name__ == '__main__':
    fn = 'big_test.asdf'

    if not os.path.exists(fn):
        shape = (17000, 37000)  # 600 MB
        a = np.arange(np.product(shape), dtype='uint8').reshape(shape)
        z = zarr.array(a, chunks=(1000, 1000), compressor=None)
        af = asdf.AsdfFile()
        af['z'] = asdf_zarr.storage.to_internal(z)
        af.write_to(fn)


    client = dask.distributed.Client()
    print(f"client at {client.dashboard_link}")


    with asdf.open(fn) as af:
        z = af['z']
        d = dask.array.from_zarr(z)
        input("press enter to compute sum")
        print("Computing sum...")
        s = d.sum().compute()
        print(f"\tsum={s}")
        input("press enter to exit")

The above example generates a large (17k x 37k) image, saves it to an ASDF file as a chunked array, loads the saved file, converts the chunked array to a dask array and computes the sum (using dask distributed). The screenshot below is of a single run showing the computation distributed across 10 workers.

Screen Shot 2023-10-17 at 5 17 07 PM

As it's likely that users will want to use dask we should consider options for testing dask compatibility (and specifically dask distributed compatibility) in the CI.

@braingram braingram requested a review from a team as a code owner October 17, 2023 21:20
@Cadair
Copy link

Cadair commented Oct 18, 2023

🥳

def _callback_info(cb):
return {
"offset": cb(_attr="offset"),
"uri": cb(_attr="_fd")().uri,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private _attr argument is used here (as is needed for ndarray as noted in the code). Perhaps this can inform what a "low level" block API would look like. By using the offset and the uri this will not work (and likely produce junk data) after an update of the backing ASDF file. Minimally we should document this danger and it may be one better to check the file mode and not allow pickling when the file is opened 'rw'.

def _to_callback(info):
def cb():
with asdf.generic_io.get_file(info["uri"], mode="r") as gf:
return asdf._block.io.read_block(gf, info["offset"])[-1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another instance of using non-public asdf api. I would be happy to see 3.x include at least a provisional introduce of a lower level api that would include a function that could replace this (that loads a block from an ASDF file without parsing the tree, etc).


self._chunk_info = state["_chunk_info"]
self._chunk_callbacks = {k: _to_callback(self._chunk_info[k]) for k in self._chunk_info}
# as __init__ will not be called on self, set up attributed expected
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be improved by changing InternalStore to not rely on code in __init__. I will open a follow-up issue.

@braingram braingram merged commit 4b33502 into asdf-format:main Oct 18, 2023
@braingram braingram deleted the chunky_pickle branch October 18, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants