Optimize for better IO performance during BMI init config dataset generation #671

robertbartel · 2024-07-03T14:21:42Z

Making some adjustments to the way the integration with ngen-cal's BMI init config generation capabilities are used to create BMI_CONFIG DMOD datasets. Primary goal was to reduce the amount of time for creating these, especially during job workflow execution (e.g., during an ngen job).

Previous implementation of on-the-fly BMI dataset generation created and wrote files one at a time, which for object-store-backed dataset took too long (about 1 hour for VPU01 catchments creating Noah-OWP-Modular and CFE configs for each catchment). Updated implementation writes all configs first, adds them to the dataset all at once, and utilizes some new optimization within the object store dataset manager. The same operation with the new implementation takes a little less than 1 minute.

Relates to #654.

Additions

New StandardDatasetIndex enum value HYDROFABRIC_DATA_ID for referencing associated hydrofabric dataset from derived datasets, in particular BMI_CONFIG datasets
New DataArchiving enum type to define supported file archiving methods, in particular within datasets
New mostly comply integration test IntegrationTestDataDeriveUtil for BMI init config generation capabilities of DataDeriveUtil when used with object store dataset backing
- Skipped unless explicitly activated via .test_env file
- Requires hydrofabric with attribute data; need reasonable size subset complete with attribute data (future)
Adding another realization config for use with aforementioned integration test
- Used to create dataset within object store manager backend

Changes

Update to BMI_CONFIG DataFormat with indices to reference realization config and hydrofabric datasets, in particular when such datasets are used for BMI init config derivation/generation
Add optional DataArchiving attribute (and some associated functions) to Dataset to tracking when all the contents of a dataset are wrapped within an archive
Optimize BmiAutoGenerationAdder to write files first then add to dataset all at once, via DatasetManager for dataset, to give the manager a chance to perform any implementation-specific optimizations
- Also makes generated dataset read-only (required for manager optimizations discussed below)
Optimize ObjectStoreDatasetManager's add_data function implementation so that in certain conditions it optimizes by writing all files to an archive and storing this archive file inside the dataset
- Takes advantage of this capability of MinIO: https://blog.min.io/small-file-archives/
- Dataset must be set as read-only
- Dataset must be empty
- Data supplied must be provided as a directory containing one or more files
Improve/modularize BMI config generation functionality within DataDeriveUtil
Update InitialDataAdder implementations to simply apply the original domain as the "update domain" portion when calling DatasetManager.add_data()
- A merge of equal domains will not keep the domain the same

Testing

Manual testing of functionality via IntegrationTestDataDeriveUtil and manually-set-up VPU01 hydrofabric dataset, with acceptable generation times (under 60 seconds)

Screenshots

Notes

Todos

Get updated hydrofabric and finish test class
Update workers to account for possibility of archived or non-archived dataset

Checklist

aaraney

Looks solid, just a few comments to work through! Thanks, @robertbartel!

aaraney · 2024-07-05T11:51:48Z

python/lib/core/dmod/core/dataset.py

+    archive **all** the data of a dataset, when the dataset itself requires archiving.  Datasets may also contain data
+    archive files as individual data items, and such archive files are not necessarily restricted to these types.
+    """
+    TAR = (1, ".tar")


Why not just drop the integer and slightly simplify this?

Suggested change

TAR = (1, ".tar")

TAR = ".tar"

Partially at least because we'd have to change the file extensions for the zip-related values; right now they are all .zip for simplicity. Whether it makes sense to do that here is debatable, but my initial thought was that this was consistent with real-world usage.

aaraney · 2024-07-05T12:03:55Z

python/lib/core/dmod/core/dataset.py

+                                                                "to store this dataset's data.")
+
+    @validator("data_archiving")
+    def validate_data_archiving(cls, v, values):


A little bit of the pot calling the kettle black, but we should probably make this "private".

Ugh, pydantic things... Relying on values is dependent on field ordering and can just be a little wonky. Multiple root_validators on a given model is supported and is less error prone (e.g. root_validator is always called and field default values will be present, not the case for validator unless parametrized).

I don't think we need to make data_archiving private, but it would be nicer if we could (easily) encapsulate the validation with a setter instead of it happening during init.

I have switched the validator itself to a root validator to make sure we avoid issues with defaults, etc..

python/lib/core/dmod/core/dataset.py

aaraney · 2024-07-05T12:23:16Z

python/lib/modeldata/dmod/modeldata/data/object_store_manager.py

@@ -245,6 +247,14 @@ def add_data(self, dataset_name: str, dest: str, domain: DataDomain, data: Optio
        ::method:`_push_file`
        ::method:`_push_files`
        """
+        # Prevent adding to read-only dataset except when first setting it up


Will you please pull this out into its own small method. Something like, _can_add_data (im sure there is a better name). Just an important invariant that I think deserves to be named.

A question for future proofing: we don't account for it yet, but eventually, we will likely need to lock datasets from changes when they are in use. Should this invariant be encapsulated in an isolated (perhaps layered) manner, such that we have a function that is just the read-only-not-new test, or lumped into something that will eventually grow more broad?

aaraney · 2024-07-05T12:42:42Z

python/lib/modeldata/dmod/modeldata/data/object_store_manager.py

+                # Combine all the files in that directory into an uncompressed zip archive
+                with tempfile.TemporaryDirectory() as zip_dest_dir_name:
+                    archive_path = Path(f"{zip_dest_dir_name}/{self.datasets[dataset_name].archive_name}")
+                    with ZipFile(archive_path, "w") as archive:


Why not just use shutil.make_archive?

I had some requirements that didn't exactly seem like default behavior (path control and no compression), so I looked first to something specifically for ZIP files.

I can't easily tell whether make_archive would compress a ZIP file or not, though given it uses zlib I would guess it does (FWIW the default on my Linux machine for the zip CLI command is level 6 compression). I don't see any way to control the compression level either with make_archive, so I assume it'd be some kind of default.

aaraney · 2024-07-05T14:26:17Z

python/lib/modeldata/dmod/modeldata/data/object_store_manager.py

+            # (see https://blog.min.io/small-file-archives/)
+            # Also, we already know from above that, if read-only, dataset must also be empty
+            # TODO: (later) consider whether there is some minimum file count (perhaps configurable) to also consider
+            if self.datasets[dataset_name].is_read_only:


This feels like we are injecting a use case specific feature into a general api. For instance, in creating a forcing dataset I would likely mark it as read-only as it is more or less static. I don't think, in that case, it is desirable to zip up a directory of, lets say, netcdf forcing files. Zipping up a directory seems like it should be handled by the specific application and not this api. We will have to rethink how the data_archiving attribute is set on a Dataset if that is the case though.

One thought is to create an algebraic datatype like a DataFormat.ARCHIVE[ArchiveFormat, DataFormat] that is, itself, a DataFormat but also wraps an ArchiveFormat and a DataFormat. It might be a little cumbersome to capture this using a python Enum, but we can certainly figure out how to best express the idea in the type system. This could be used to set if a Dataset is data_archiving or not if it still makes sense to keep that as a top level attribute.

This will move complexity to other places, but I think retain some desirable traits of the system more generally. Im sure there is a simpler way to capture this idea that what ive suggested. I think we should strive for simplicity until we can't.

This feels like we are injecting a use case specific feature into a general api ...

I disagree with the headline here but agree with parts of your argument in isolation.

It is a use case specific feature, but it's not part of the API. It's a specific implementation of the API that provides this behavior, but behind the scenes and particular to a subset of use cases. I don't see a better way to introduce storage-backend-specific optimizations other than doing it within the thing interacting with the backend. A specific application is not going to have any idea (nor should it) that there is a performance penalty for certain write scenarios (e.g., lots of files to the object store).

And, just to be clear, this is intended as a backend-specific optimization. It's only being introduced here because MinIO can take advantage of ZIP files in a particular way. I don't have a problem with other things also archiving data in a dataset in the future, if they have other reasons to do so. I expect eventually we'll want to; hence, the attribute within Dataset instead of just tracking inside the manager. But those are apples and oranges.

But I agree that this doesn't consider all scenarios sufficiently: a single netcdf forcing file, for example, probably doesn't need to be archive. Probably, 50 forcing CSV files don't really either. I even put in a TODO comment to this effect, but didn't want to just throw out a magic number of files for the too-many threshold. I'm open to discussing either what that number should be or how to better determine when the object store manager really can/should do this.

One thought is to create an algebraic datatype like a DataFormat.ARCHIVE[ArchiveFormat, DataFormat] that is, itself, a DataFormat but also wraps an ArchiveFormat and a DataFormat

That seems much more complicated and far reaching that what's in place now. And I still don't think we could effectively apply it without making the things creating the data start to worry too much about the details of the data storage backend*. Not to mention it would convolute logic for reconciling data requirements.

* Caveat that this is already the case for workers writing output, kind of, though they don't interact with the DMOD data orchestration in the same way, so their data writing is written more in isolation (at least for now).

aaraney · 2024-07-05T14:28:03Z

python/services/dataservice/dmod/dataservice/initial_data_adder_impl.py

@@ -64,7 +69,8 @@ def _add(path: Path, dest: Optional[str] = None, add_root: Optional[Path] = None
            elif not dest:
                dest = path.relative_to(add_root)

-            if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes()):
+            if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),


This seems like a great place to use the reader interface.

Suggested change

if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),

with path.open("rb") as fp:

if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=fp,

Perhaps, but this is probably too far out of scope here. Admittedly, I did modify calls to self._dataset_manager.add_data in the InitialDataAdder implementations. One could argue (questionably) that this bled into scope due to it being done as part of the other modifications to BmiAutoGenerationAdder, but, more importantly, those changes were to fix something that was broken.

I think you could argue this is out of scope, but IMO the change is minor and the potential performance consequences are high. My concern is if a large file is passed as path. This will blow up the resident memory usage of the process b.c. the file read is greedy (likely a demand paged mmap that is munmaped, but still not great). If we just pass something that has a read() method, add_data can perform buffered reads that have a lesser potential to degrade performance.

aaraney · 2024-07-05T14:31:49Z

python/services/dataservice/dmod/dataservice/initial_data_adder_impl.py

        if self.partial_realization_config is not None:
            raise DmodRuntimeError(f"{self.__class__.__name__} can't have 'None' for partial realization property")

        try:
            real_config: NgenRealization = self.build_realization_config_from_partial()
            if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=self._REAL_CONFIG_FILE_NAME,
-                                                  data=json.dumps(real_config.json()).encode()):
+                                                  data=json.dumps(real_config.json()).encode(), domain=original_domain):


I think this should be, no?

Suggested change

data=json.dumps(real_config.json()).encode(), domain=original_domain):

data=real_config.json().encode(), domain=original_domain):

a = real_config.json().encode() b = json.dumps(real_config.json()).encode() assert type(json.loads(a)) == dict assert type(json.loads(b)) == str

Good catch. I've pushed a change for it.

python/services/dataservice/dmod/dataservice/initial_data_adder_impl.py

python/services/dataservice/dmod/test/it_data_derive_util.py

aaraney

Really just one major comment that I think is worth addressing and then we should be good to go! Thanks, @robertbartel!

python/lib/core/dmod/core/dataset.py

aaraney · 2024-07-31T16:17:58Z

python/services/dataservice/dmod/dataservice/initial_data_adder_impl.py

@@ -64,7 +69,8 @@ def _add(path: Path, dest: Optional[str] = None, add_root: Optional[Path] = None
            elif not dest:
                dest = path.relative_to(add_root)

-            if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes()):
+            if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),


I think you could argue this is out of scope, but IMO the change is minor and the potential performance consequences are high. My concern is if a large file is passed as path. This will blow up the resident memory usage of the process b.c. the file read is greedy (likely a demand paged mmap that is munmaped, but still not great). If we just pass something that has a read() method, add_data can perform buffered reads that have a lesser potential to degrade performance.

- Add new StandardDatasetIndex enum value HYDROFABRIC_DATA_ID for referencing associated hydrofabric dataset. - Update BMI_CONFIG DataFormat with indices to reference realization config and hydrofabric datasets, as when such datasets are used within DMOD to generate a new BMI_CONFIG dataset.

- Add DataArchiving enum type with values to define archiving methods. - Add optional data_archiving attribute to Dataset to track when data contained within dataset is entirely contained within a single archive file.

Modifying add_initial_data to write all BMI configs at once to a temporary directory so that files can be added to the dataset all at once, allowing any optimizations available to the manager implementation to be used.

Take advantage of minio archiving feature when adding data to empty, read-only dataset (i.e., adding initial data that will not be changed), since lots of files in minio bucket has significant overhead.

Separating parts of the functionality for deriving BMI init config datasets into more focused/reusable/testable functions.

Adding a mostly complete integration test for BMI init config generation logic (except that it doesn't automatically create a hydrofabric dataset yet), though it must be manually turned on via test env config.

Updating InitialDataAdder implementations to fix usage of calls to DatasetManager.add_data(), which now requires a 'domain' argument, so that the adder just passes the original/initial domain of the dataset (result is no change in eventual domain merge op).

Rearranging existing value indices slightly also.

Fixing incorrect JSON handling for data passed to add_manager call.

Fix another incorrect JSON handling for data passed to add_manager call.

Fixing validator to account properly for scenarios with dataset_type not set (i.e., set to the default of None).

Updating core dep to 0.19.0, communication dep to 0.21.0, and modeldata dep to 0.13.0.

Updating core dep to 0.19.0 and communication dep to 0.21.0.

Updating versions of communication, dataservice, modeldata, and requestservice packages.

Fixing stack name used for start/stop object store stack.

Co-authored-by: Austin Raney <austin.raney@noaa.gov>

Update so that call to manager.add_data passes a buffered/binary reader object instead of just reading all the bytes up front and passing those.

robertbartel · 2024-08-01T15:13:20Z

@aaraney, I've fixed conflicts and I think addressed your last concern by passing a reader instead of the raw bytes. Let me know if there is anything else.

aaraney · 2024-08-06T14:21:19Z

Looks like after this was merged the integration tests are now failing on master. I had a hunch this might be a caching issue. So, I cleared the runner caches but that did not fix the issue. Trying to reproduce locally now.

aaraney · 2024-08-06T14:29:12Z

The service packages were not being installed in the IT tests. Same fix as in #575 but for IT tests.

aaraney · 2024-08-06T14:32:25Z

Opened #696 to fix this.

robertbartel added enhancement New feature or request maas MaaS Workstream labels Jul 3, 2024

robertbartel requested review from christophertubbs and aaraney July 3, 2024 14:21

aaraney requested changes Jul 5, 2024

View reviewed changes

robertbartel mentioned this pull request Jul 12, 2024

Update worker images to optimize IO performance using local data #675

Open

robertbartel force-pushed the f/bmi_cfg_gen_io/main branch from 8dc12bb to b5d3414 Compare July 12, 2024 18:19

aaraney reviewed Jul 31, 2024

View reviewed changes

robertbartel and others added 21 commits July 31, 2024 13:06

Add attribute for data archiving to Dataset.

16a5627

- Add DataArchiving enum type with values to define archiving methods. - Add optional data_archiving attribute to Dataset to track when data contained within dataset is entirely contained within a single archive file.

Optimize BmiAutoGenerationAdder IO performance.

6632b43

Modifying add_initial_data to write all BMI configs at once to a temporary directory so that files can be added to the dataset all at once, allowing any optimizations available to the manager implementation to be used.

Optimize object store manager using archiving.

012fa29

Take advantage of minio archiving feature when adding data to empty, read-only dataset (i.e., adding initial data that will not be changed), since lots of files in minio bucket has significant overhead.

Update DataDeriveUtil deriving BMI config dataset.

fc3be9e

Separating parts of the functionality for deriving BMI init config datasets into more focused/reusable/testable functions.

Implement (most of) integration test for BMI gen.

fc9ec4a

Adding a mostly complete integration test for BMI init config generation logic (except that it doesn't automatically create a hydrofabric dataset yet), though it must be manually turned on via test env config.

Switch data_archiving validator to root type.

597b010

Add TAR_XZ archiving type.

18262de

Rearranging existing value indices slightly also.

Fix FromPartialRealizationConfigAdder.

e12d4a1

Fixing incorrect JSON handling for data passed to add_manager call.

Fix FromPartialRealizationConfigAdder here too.

b11322e

Fix another incorrect JSON handling for data passed to add_manager call.

Modify skipping handling for new test.

019eea6

Fix Dataset.validate_data_archiving.

b8a14e2

Fixing validator to account properly for scenarios with dataset_type not set (i.e., set to the default of None).

Update core to version 0.19.0.

08fbf9e

Update comms dep on core to latest.

1089aeb

Update dataservice internal deps to latest.

aa66f0f

Updating core dep to 0.19.0, communication dep to 0.21.0, and modeldata dep to 0.13.0.

Update requestservice internal deps to latest.

d32fe44

Updating core dep to 0.19.0 and communication dep to 0.21.0.

Update pkg versions for some internal pkgs.

acf0fb6

Updating versions of communication, dataservice, modeldata, and requestservice packages.

Fix integration test setup script for dataservice.

5148fff

Fixing stack name used for start/stop object store stack.

Further tweak IntegrationTestDataDeriveUtil skip.

f83272c

Fix docstring for TAR_XZ archiving type.

6efb92b

Co-authored-by: Austin Raney <austin.raney@noaa.gov>

robertbartel force-pushed the f/bmi_cfg_gen_io/main branch from e6c6ecf to 6efb92b Compare August 1, 2024 14:38

Optimize FromFilesInitialDataAdder.

0f35988

Update so that call to manager.add_data passes a buffered/binary reader object instead of just reading all the bytes up front and passing those.

aaraney approved these changes Aug 5, 2024

View reviewed changes

aaraney merged commit 452bf58 into NOAA-OWP:master Aug 5, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize for better IO performance during BMI init config dataset generation #671

Optimize for better IO performance during BMI init config dataset generation #671

robertbartel commented Jul 3, 2024

aaraney left a comment

aaraney Jul 5, 2024

robertbartel Jul 5, 2024

aaraney Jul 5, 2024

robertbartel Jul 5, 2024

aaraney Jul 5, 2024

robertbartel Jul 5, 2024

aaraney Jul 5, 2024

robertbartel Jul 5, 2024

aaraney Jul 5, 2024

robertbartel Jul 5, 2024

aaraney Jul 5, 2024

robertbartel Jul 8, 2024

aaraney Jul 31, 2024

aaraney Jul 5, 2024

robertbartel Jul 8, 2024

aaraney left a comment

aaraney Jul 31, 2024

robertbartel commented Aug 1, 2024

aaraney commented Aug 6, 2024

aaraney commented Aug 6, 2024

aaraney commented Aug 6, 2024

	if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=path.read_bytes(),
	with path.open("rb") as fp:
	if not self._dataset_manager.add_data(dataset_name=self._dataset_name, dest=dest, data=fp,

	data=json.dumps(real_config.json()).encode(), domain=original_domain):
	data=real_config.json().encode(), domain=original_domain):

Optimize for better IO performance during BMI init config dataset generation #671

Optimize for better IO performance during BMI init config dataset generation #671

Conversation

robertbartel commented Jul 3, 2024

Additions

Changes

Testing

Screenshots

Notes

Todos

Checklist

aaraney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaraney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertbartel commented Aug 1, 2024

aaraney commented Aug 6, 2024

aaraney commented Aug 6, 2024

aaraney commented Aug 6, 2024