-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize for better IO performance during BMI init config dataset generation #671
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, just a few comments to work through! Thanks, @robertbartel!
archive **all** the data of a dataset, when the dataset itself requires archiving. Datasets may also contain data | ||
archive files as individual data items, and such archive files are not necessarily restricted to these types. | ||
""" | ||
TAR = (1, ".tar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just drop the integer and slightly simplify this?
TAR = (1, ".tar") | |
TAR = ".tar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially at least because we'd have to change the file extensions for the zip-related values; right now they are all .zip
for simplicity. Whether it makes sense to do that here is debatable, but my initial thought was that this was consistent with real-world usage.
python/lib/core/dmod/core/dataset.py
Outdated
"to store this dataset's data.") | ||
|
||
@validator("data_archiving") | ||
def validate_data_archiving(cls, v, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit of the pot calling the kettle black, but we should probably make this "private".
Ugh, pydantic things... Relying on values
is dependent on field ordering and can just be a little wonky. Multiple root_validator
s on a given model is supported and is less error prone (e.g. root_validator
is always called and field default values will be present, not the case for validator
unless parametrized).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to make data_archiving
private, but it would be nicer if we could (easily) encapsulate the validation with a setter instead of it happening during init.
I have switched the validator itself to a root validator to make sure we avoid issues with defaults, etc..
@@ -245,6 +247,14 @@ def add_data(self, dataset_name: str, dest: str, domain: DataDomain, data: Optio | |||
::method:`_push_file` | |||
::method:`_push_files` | |||
""" | |||
# Prevent adding to read-only dataset except when first setting it up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you please pull this out into its own small method. Something like, _can_add_data
(im sure there is a better name). Just an important invariant that I think deserves to be named.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question for future proofing: we don't account for it yet, but eventually, we will likely need to lock datasets from changes when they are in use. Should this invariant be encapsulated in an isolated (perhaps layered) manner, such that we have a function that is just the read-only-not-new test, or lumped into something that will eventually grow more broad?
# Combine all the files in that directory into an uncompressed zip archive | ||
with tempfile.TemporaryDirectory() as zip_dest_dir_name: | ||
archive_path = Path(f"{zip_dest_dir_name}/{self.datasets[dataset_name].archive_name}") | ||
with ZipFile(archive_path, "w") as archive: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use shutil.make_archive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some requirements that didn't exactly seem like default behavior (path control and no compression), so I looked first to something specifically for ZIP files.
I can't easily tell whether make_archive
would compress a ZIP file or not, though given it uses zlib I would guess it does (FWIW the default on my Linux machine for the zip
CLI command is level 6 compression). I don't see any way to control the compression level either with make_archive
, so I assume it'd be some kind of default.
# (see https://blog.min.io/small-file-archives/) | ||
# Also, we already know from above that, if read-only, dataset must also be empty | ||
# TODO: (later) consider whether there is some minimum file count (perhaps configurable) to also consider | ||
if self.datasets[dataset_name].is_read_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we are injecting a use case specific feature into a general api. For instance, in creating a forcing dataset I would likely mark it as read-only as it is more or less static. I don't think, in that case, it is desirable to zip up a directory of, lets say, netcdf forcing files. Zipping up a directory seems like it should be handled by the specific application and not this api. We will have to rethink how the data_archiving
attribute is set on a Dataset
if that is the case though.
One thought is to create an algebraic datatype like a DataFormat.ARCHIVE[ArchiveFormat, DataFormat]
that is, itself, a DataFormat
but also wraps an ArchiveFormat
and a DataFormat
. It might be a little cumbersome to capture this using a python Enum
, but we can certainly figure out how to best express the idea in the type system. This could be used to set if a Dataset
is data_archiving
or not if it still makes sense to keep that as a top level attribute.
This will move complexity to other places, but I think retain some desirable traits of the system more generally. Im sure there is a simpler way to capture this idea that what ive suggested. I think we should strive for simplicity until we can't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we are injecting a use case specific feature into a general api ...
I disagree with the headline here but agree with parts of your argument in isolation.
It is a use case specific feature, but it's not part of the API. It's a specific implementation of the API that provides this behavior, but behind the scenes and particular to a subset of use cases. I don't see a better way to introduce storage-backend-specific optimizations other than doing it within the thing interacting with the backend. A specific application is not going to have any idea (nor should it) that there is a performance penalty for certain write scenarios (e.g., lots of files to the object store).
And, just to be clear, this is intended as a backend-specific optimization. It's only being introduced here because MinIO can take advantage of ZIP files in a particular way. I don't have a problem with other things also archiving data in a dataset in the future, if they have other reasons to do so. I expect eventually we'll want to; hence, the attribute within Dataset instead of just tracking inside the manager. But those are apples and oranges.
But I agree that this doesn't consider all scenarios sufficiently: a single netcdf forcing file, for example, probably doesn't need to be archive. Probably, 50 forcing CSV files don't really either. I even put in a TODO comment to this effect, but didn't want to just throw out a magic number of files for the too-many threshold. I'm open to discussing either what that number should be or how to better determine when the object store manager really can/should do this.
One thought is to create an algebraic datatype like a DataFormat.ARCHIVE[ArchiveFormat, DataFormat] that is, itself, a DataFormat but also wraps an ArchiveFormat and a DataFormat
That seems much more complicated and far reaching that what's in place now. And I still don't think we could effectively apply it without making the things creating the data start to worry too much about the details of the data storage backend*. Not to mention it would convolute logic for reconciling data requirements.
* Caveat that this is already the case for workers writing output, kind of, though they don't interact with the DMOD data orchestration in the same way, so their data writing is written more in isolation (at least for now).
@@ -64,7 +69,8 @@ def _add(path: Path, dest: Optional[str] = None, add_root: Optional[Path] = None | |||
elif not dest: | |||
dest = path.relative_to(add_root) | |||
|
|||
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes()): | |||
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a great place to use the reader interface.
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(), | |
with path.open("rb") as fp: | |
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=fp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but this is probably too far out of scope here. Admittedly, I did modify calls to self._dataset_manager.add_data
in the InitialDataAdder implementations. One could argue (questionably) that this bled into scope due to it being done as part of the other modifications to BmiAutoGenerationAdder, but, more importantly, those changes were to fix something that was broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could argue this is out of scope, but IMO the change is minor and the potential performance consequences are high. My concern is if a large file is passed as path
. This will blow up the resident memory usage of the process b.c. the file read is greedy (likely a demand paged mmap
that is munmap
ed, but still not great). If we just pass something that has a read()
method, add_data
can perform buffered reads that have a lesser potential to degrade performance.
if self.partial_realization_config is not None: | ||
raise DmodRuntimeError(f"{self.__class__.__name__} can't have 'None' for partial realization property") | ||
|
||
try: | ||
real_config: NgenRealization = self.build_realization_config_from_partial() | ||
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=self._REAL_CONFIG_FILE_NAME, | ||
data=json.dumps(real_config.json()).encode()): | ||
data=json.dumps(real_config.json()).encode(), domain=original_domain): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be, no?
data=json.dumps(real_config.json()).encode(), domain=original_domain): | |
data=real_config.json().encode(), domain=original_domain): |
a = real_config.json().encode()
b = json.dumps(real_config.json()).encode()
assert type(json.loads(a)) == dict
assert type(json.loads(b)) == str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I've pushed a change for it.
python/services/dataservice/dmod/dataservice/initial_data_adder_impl.py
Outdated
Show resolved
Hide resolved
8dc12bb
to
b5d3414
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really just one major comment that I think is worth addressing and then we should be good to go! Thanks, @robertbartel!
@@ -64,7 +69,8 @@ def _add(path: Path, dest: Optional[str] = None, add_root: Optional[Path] = None | |||
elif not dest: | |||
dest = path.relative_to(add_root) | |||
|
|||
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes()): | |||
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could argue this is out of scope, but IMO the change is minor and the potential performance consequences are high. My concern is if a large file is passed as path
. This will blow up the resident memory usage of the process b.c. the file read is greedy (likely a demand paged mmap
that is munmap
ed, but still not great). If we just pass something that has a read()
method, add_data
can perform buffered reads that have a lesser potential to degrade performance.
- Add new StandardDatasetIndex enum value HYDROFABRIC_DATA_ID for referencing associated hydrofabric dataset. - Update BMI_CONFIG DataFormat with indices to reference realization config and hydrofabric datasets, as when such datasets are used within DMOD to generate a new BMI_CONFIG dataset.
- Add DataArchiving enum type with values to define archiving methods. - Add optional data_archiving attribute to Dataset to track when data contained within dataset is entirely contained within a single archive file.
Modifying add_initial_data to write all BMI configs at once to a temporary directory so that files can be added to the dataset all at once, allowing any optimizations available to the manager implementation to be used.
Take advantage of minio archiving feature when adding data to empty, read-only dataset (i.e., adding initial data that will not be changed), since lots of files in minio bucket has significant overhead.
Separating parts of the functionality for deriving BMI init config datasets into more focused/reusable/testable functions.
Adding a mostly complete integration test for BMI init config generation logic (except that it doesn't automatically create a hydrofabric dataset yet), though it must be manually turned on via test env config.
Updating InitialDataAdder implementations to fix usage of calls to DatasetManager.add_data(), which now requires a 'domain' argument, so that the adder just passes the original/initial domain of the dataset (result is no change in eventual domain merge op).
Rearranging existing value indices slightly also.
Fixing incorrect JSON handling for data passed to add_manager call.
Fix another incorrect JSON handling for data passed to add_manager call.
Fixing validator to account properly for scenarios with dataset_type not set (i.e., set to the default of None).
Updating core dep to 0.19.0, communication dep to 0.21.0, and modeldata dep to 0.13.0.
Updating core dep to 0.19.0 and communication dep to 0.21.0.
Updating versions of communication, dataservice, modeldata, and requestservice packages.
Fixing stack name used for start/stop object store stack.
Co-authored-by: Austin Raney <austin.raney@noaa.gov>
e6c6ecf
to
6efb92b
Compare
Update so that call to manager.add_data passes a buffered/binary reader object instead of just reading all the bytes up front and passing those.
@aaraney, I've fixed conflicts and I think addressed your last concern by passing a reader instead of the raw bytes. Let me know if there is anything else. |
Looks like after this was merged the integration tests are now failing on master. I had a hunch this might be a caching issue. So, I cleared the runner caches but that did not fix the issue. Trying to reproduce locally now. |
The service packages were not being installed in the IT tests. Same fix as in #575 but for IT tests. |
Opened #696 to fix this. |
Making some adjustments to the way the integration with ngen-cal's BMI init config generation capabilities are used to create
BMI_CONFIG
DMOD datasets. Primary goal was to reduce the amount of time for creating these, especially during job workflow execution (e.g., during an ngen job).Previous implementation of on-the-fly BMI dataset generation created and wrote files one at a time, which for object-store-backed dataset took too long (about 1 hour for VPU01 catchments creating Noah-OWP-Modular and CFE configs for each catchment). Updated implementation writes all configs first, adds them to the dataset all at once, and utilizes some new optimization within the object store dataset manager. The same operation with the new implementation takes a little less than 1 minute.
Relates to #654.
Additions
HYDROFABRIC_DATA_ID
for referencing associated hydrofabric dataset from derived datasets, in particularBMI_CONFIG
datasets.test_env
fileChanges
BMI_CONFIG
DataFormat with indices to reference realization config and hydrofabric datasets, in particular when such datasets are used for BMI init config derivation/generationadd_data
function implementation so that in certain conditions it optimizes by writing all files to an archive and storing this archive file inside the datasetadd_data()
Testing
Screenshots
Notes
Todos
Checklist