Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize for better IO performance during BMI init config dataset generation #671

Merged
merged 22 commits into from
Aug 5, 2024

Conversation

robertbartel
Copy link
Contributor

Making some adjustments to the way the integration with ngen-cal's BMI init config generation capabilities are used to create BMI_CONFIG DMOD datasets. Primary goal was to reduce the amount of time for creating these, especially during job workflow execution (e.g., during an ngen job).

Previous implementation of on-the-fly BMI dataset generation created and wrote files one at a time, which for object-store-backed dataset took too long (about 1 hour for VPU01 catchments creating Noah-OWP-Modular and CFE configs for each catchment). Updated implementation writes all configs first, adds them to the dataset all at once, and utilizes some new optimization within the object store dataset manager. The same operation with the new implementation takes a little less than 1 minute.

Relates to #654.

Additions

  • New StandardDatasetIndex enum value HYDROFABRIC_DATA_ID for referencing associated hydrofabric dataset from derived datasets, in particular BMI_CONFIG datasets
  • New DataArchiving enum type to define supported file archiving methods, in particular within datasets
  • New mostly comply integration test IntegrationTestDataDeriveUtil for BMI init config generation capabilities of DataDeriveUtil when used with object store dataset backing
    • Skipped unless explicitly activated via .test_env file
    • Requires hydrofabric with attribute data; need reasonable size subset complete with attribute data (future)
  • Adding another realization config for use with aforementioned integration test
    • Used to create dataset within object store manager backend

Changes

  • Update to BMI_CONFIG DataFormat with indices to reference realization config and hydrofabric datasets, in particular when such datasets are used for BMI init config derivation/generation
  • Add optional DataArchiving attribute (and some associated functions) to Dataset to tracking when all the contents of a dataset are wrapped within an archive
  • Optimize BmiAutoGenerationAdder to write files first then add to dataset all at once, via DatasetManager for dataset, to give the manager a chance to perform any implementation-specific optimizations
    • Also makes generated dataset read-only (required for manager optimizations discussed below)
  • Optimize ObjectStoreDatasetManager's add_data function implementation so that in certain conditions it optimizes by writing all files to an archive and storing this archive file inside the dataset
    • Takes advantage of this capability of MinIO: https://blog.min.io/small-file-archives/
    • Dataset must be set as read-only
    • Dataset must be empty
    • Data supplied must be provided as a directory containing one or more files
  • Improve/modularize BMI config generation functionality within DataDeriveUtil
  • Update InitialDataAdder implementations to simply apply the original domain as the "update domain" portion when calling DatasetManager.add_data()
    • A merge of equal domains will not keep the domain the same

Testing

  1. Manual testing of functionality via IntegrationTestDataDeriveUtil and manually-set-up VPU01 hydrofabric dataset, with acceptable generation times (under 60 seconds)

Screenshots

Notes

Todos

  • Get updated hydrofabric and finish test class
  • Update workers to account for possibility of archived or non-archived dataset

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Visually tested in supported browsers and devices (see checklist below 👇)
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

@robertbartel robertbartel added enhancement New feature or request maas MaaS Workstream labels Jul 3, 2024
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, just a few comments to work through! Thanks, @robertbartel!

archive **all** the data of a dataset, when the dataset itself requires archiving. Datasets may also contain data
archive files as individual data items, and such archive files are not necessarily restricted to these types.
"""
TAR = (1, ".tar")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just drop the integer and slightly simplify this?

Suggested change
TAR = (1, ".tar")
TAR = ".tar"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially at least because we'd have to change the file extensions for the zip-related values; right now they are all .zip for simplicity. Whether it makes sense to do that here is debatable, but my initial thought was that this was consistent with real-world usage.

"to store this dataset's data.")

@validator("data_archiving")
def validate_data_archiving(cls, v, values):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit of the pot calling the kettle black, but we should probably make this "private".

Ugh, pydantic things... Relying on values is dependent on field ordering and can just be a little wonky. Multiple root_validators on a given model is supported and is less error prone (e.g. root_validator is always called and field default values will be present, not the case for validator unless parametrized).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to make data_archiving private, but it would be nicer if we could (easily) encapsulate the validation with a setter instead of it happening during init.

I have switched the validator itself to a root validator to make sure we avoid issues with defaults, etc..

@@ -245,6 +247,14 @@ def add_data(self, dataset_name: str, dest: str, domain: DataDomain, data: Optio
::method:`_push_file`
::method:`_push_files`
"""
# Prevent adding to read-only dataset except when first setting it up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you please pull this out into its own small method. Something like, _can_add_data (im sure there is a better name). Just an important invariant that I think deserves to be named.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question for future proofing: we don't account for it yet, but eventually, we will likely need to lock datasets from changes when they are in use. Should this invariant be encapsulated in an isolated (perhaps layered) manner, such that we have a function that is just the read-only-not-new test, or lumped into something that will eventually grow more broad?

# Combine all the files in that directory into an uncompressed zip archive
with tempfile.TemporaryDirectory() as zip_dest_dir_name:
archive_path = Path(f"{zip_dest_dir_name}/{self.datasets[dataset_name].archive_name}")
with ZipFile(archive_path, "w") as archive:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use shutil.make_archive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some requirements that didn't exactly seem like default behavior (path control and no compression), so I looked first to something specifically for ZIP files.

I can't easily tell whether make_archive would compress a ZIP file or not, though given it uses zlib I would guess it does (FWIW the default on my Linux machine for the zip CLI command is level 6 compression). I don't see any way to control the compression level either with make_archive, so I assume it'd be some kind of default.

# (see https://blog.min.io/small-file-archives/)
# Also, we already know from above that, if read-only, dataset must also be empty
# TODO: (later) consider whether there is some minimum file count (perhaps configurable) to also consider
if self.datasets[dataset_name].is_read_only:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we are injecting a use case specific feature into a general api. For instance, in creating a forcing dataset I would likely mark it as read-only as it is more or less static. I don't think, in that case, it is desirable to zip up a directory of, lets say, netcdf forcing files. Zipping up a directory seems like it should be handled by the specific application and not this api. We will have to rethink how the data_archiving attribute is set on a Dataset if that is the case though.

One thought is to create an algebraic datatype like a DataFormat.ARCHIVE[ArchiveFormat, DataFormat] that is, itself, a DataFormat but also wraps an ArchiveFormat and a DataFormat. It might be a little cumbersome to capture this using a python Enum, but we can certainly figure out how to best express the idea in the type system. This could be used to set if a Dataset is data_archiving or not if it still makes sense to keep that as a top level attribute.

This will move complexity to other places, but I think retain some desirable traits of the system more generally. Im sure there is a simpler way to capture this idea that what ive suggested. I think we should strive for simplicity until we can't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we are injecting a use case specific feature into a general api ...

I disagree with the headline here but agree with parts of your argument in isolation.

It is a use case specific feature, but it's not part of the API. It's a specific implementation of the API that provides this behavior, but behind the scenes and particular to a subset of use cases. I don't see a better way to introduce storage-backend-specific optimizations other than doing it within the thing interacting with the backend. A specific application is not going to have any idea (nor should it) that there is a performance penalty for certain write scenarios (e.g., lots of files to the object store).

And, just to be clear, this is intended as a backend-specific optimization. It's only being introduced here because MinIO can take advantage of ZIP files in a particular way. I don't have a problem with other things also archiving data in a dataset in the future, if they have other reasons to do so. I expect eventually we'll want to; hence, the attribute within Dataset instead of just tracking inside the manager. But those are apples and oranges.

But I agree that this doesn't consider all scenarios sufficiently: a single netcdf forcing file, for example, probably doesn't need to be archive. Probably, 50 forcing CSV files don't really either. I even put in a TODO comment to this effect, but didn't want to just throw out a magic number of files for the too-many threshold. I'm open to discussing either what that number should be or how to better determine when the object store manager really can/should do this.

One thought is to create an algebraic datatype like a DataFormat.ARCHIVE[ArchiveFormat, DataFormat] that is, itself, a DataFormat but also wraps an ArchiveFormat and a DataFormat

That seems much more complicated and far reaching that what's in place now. And I still don't think we could effectively apply it without making the things creating the data start to worry too much about the details of the data storage backend*. Not to mention it would convolute logic for reconciling data requirements.

* Caveat that this is already the case for workers writing output, kind of, though they don't interact with the DMOD data orchestration in the same way, so their data writing is written more in isolation (at least for now).

@@ -64,7 +69,8 @@ def _add(path: Path, dest: Optional[str] = None, add_root: Optional[Path] = None
elif not dest:
dest = path.relative_to(add_root)

if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes()):
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a great place to use the reader interface.

Suggested change
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),
with path.open("rb") as fp:
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=fp,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but this is probably too far out of scope here. Admittedly, I did modify calls to self._dataset_manager.add_data in the InitialDataAdder implementations. One could argue (questionably) that this bled into scope due to it being done as part of the other modifications to BmiAutoGenerationAdder, but, more importantly, those changes were to fix something that was broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could argue this is out of scope, but IMO the change is minor and the potential performance consequences are high. My concern is if a large file is passed as path. This will blow up the resident memory usage of the process b.c. the file read is greedy (likely a demand paged mmap that is munmaped, but still not great). If we just pass something that has a read() method, add_data can perform buffered reads that have a lesser potential to degrade performance.

if self.partial_realization_config is not None:
raise DmodRuntimeError(f"{self.__class__.__name__} can't have 'None' for partial realization property")

try:
real_config: NgenRealization = self.build_realization_config_from_partial()
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=self._REAL_CONFIG_FILE_NAME,
data=json.dumps(real_config.json()).encode()):
data=json.dumps(real_config.json()).encode(), domain=original_domain):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be, no?

Suggested change
data=json.dumps(real_config.json()).encode(), domain=original_domain):
data=real_config.json().encode(), domain=original_domain):
a = real_config.json().encode()
b = json.dumps(real_config.json()).encode()

assert type(json.loads(a)) == dict
assert type(json.loads(b)) == str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've pushed a change for it.

Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just one major comment that I think is worth addressing and then we should be good to go! Thanks, @robertbartel!

@@ -64,7 +69,8 @@ def _add(path: Path, dest: Optional[str] = None, add_root: Optional[Path] = None
elif not dest:
dest = path.relative_to(add_root)

if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes()):
if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could argue this is out of scope, but IMO the change is minor and the potential performance consequences are high. My concern is if a large file is passed as path. This will blow up the resident memory usage of the process b.c. the file read is greedy (likely a demand paged mmap that is munmaped, but still not great). If we just pass something that has a read() method, add_data can perform buffered reads that have a lesser potential to degrade performance.

robertbartel and others added 21 commits July 31, 2024 13:06
- Add new StandardDatasetIndex enum value HYDROFABRIC_DATA_ID for
  referencing associated hydrofabric dataset.
- Update BMI_CONFIG DataFormat with indices to reference realization
  config and hydrofabric datasets, as when such datasets are used within
  DMOD to generate a new BMI_CONFIG dataset.
- Add DataArchiving enum type with values to define archiving methods.
- Add optional data_archiving attribute to Dataset to track when data
  contained within dataset is entirely contained within a single archive
  file.
Modifying add_initial_data to write all BMI configs at once to a
temporary directory so that files can be added to the dataset all at
once, allowing any optimizations available to the manager implementation
to be used.
Take advantage of minio archiving feature when adding data to empty,
read-only dataset (i.e., adding initial data that will not be changed),
since lots of files in minio bucket has significant overhead.
Separating parts of the functionality for deriving BMI init config
datasets into more focused/reusable/testable functions.
Adding a mostly complete integration test for BMI init config generation
logic (except that it doesn't automatically create a hydrofabric dataset
yet), though it must be manually turned on via test env config.
Updating InitialDataAdder implementations to fix usage of calls to
DatasetManager.add_data(), which now requires a 'domain' argument, so
that the adder just passes the original/initial domain of the dataset
(result is no change in eventual domain merge op).
Rearranging existing value indices slightly also.
Fixing incorrect JSON handling for data passed to add_manager call.
Fix another incorrect JSON handling for data passed to add_manager call.
Fixing validator to account properly for scenarios with dataset_type not
set (i.e., set to the default of None).
Updating core dep to 0.19.0, communication dep to 0.21.0, and modeldata
dep to 0.13.0.
Updating core dep to 0.19.0 and communication dep to 0.21.0.
Updating versions of communication, dataservice, modeldata, and
requestservice packages.
Fixing stack name used for start/stop object store stack.
Co-authored-by: Austin Raney <austin.raney@noaa.gov>
@robertbartel robertbartel force-pushed the f/bmi_cfg_gen_io/main branch from e6c6ecf to 6efb92b Compare August 1, 2024 14:38
Update so that call to manager.add_data passes a buffered/binary reader
object instead of just reading all the bytes up front and passing those.
@robertbartel
Copy link
Contributor Author

@aaraney, I've fixed conflicts and I think addressed your last concern by passing a reader instead of the raw bytes. Let me know if there is anything else.

@aaraney aaraney merged commit 452bf58 into NOAA-OWP:master Aug 5, 2024
8 checks passed
@aaraney
Copy link
Member

aaraney commented Aug 6, 2024

Looks like after this was merged the integration tests are now failing on master. I had a hunch this might be a caching issue. So, I cleared the runner caches but that did not fix the issue. Trying to reproduce locally now.

@aaraney
Copy link
Member

aaraney commented Aug 6, 2024

The service packages were not being installed in the IT tests. Same fix as in #575 but for IT tests.

@aaraney
Copy link
Member

aaraney commented Aug 6, 2024

Opened #696 to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants