Configurable versioning #2355

merelcht · 2023-02-22T16:10:10Z

Converted PR #1871 into this issue, to continue the discussion after the PR is closed.

Description

This PR aims to add more customization for VersionedDataSets. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.

Motivation

Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using PartitionedDataSets, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.

Custom format

The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.

cars:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/car_data/%Y/%m/%d/car_data.csv
  versioned: true

The example above dataset would have been translated to data/01_raw/company/car_data/2022/09/25/car_data.csv if today's date was 2022-09-25

Partial timestamp

In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.

kedro run --load-version "cars:2022-09-25"

or

catalog.load("cars", "2022-09-25")

This is now a possible way of selecting the load version.

Custom version class

If the custom date format is not enough to implement the versioning logic, then the user can subclass the Version class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:

# settings.py
# sunday_version.py
from kedro.io import Version, ProxyDateTime
from datetime import timedelta


class SundayVersion(Version):
    def tosunday(self, version: ProxyDateTime) -> ProxyDateTime:
        dt = version.datetime
        dt = dt - timedelta((dt.weekday() + 1) % 7)
        return ProxyDateTime.from_datetime(dt)

    def parse(self, version_str: str) -> ProxyDateTime:
        date_time = super().parse(version_str)
        return self.tosunday(date_time)

VERSION_CLASS = SundayVersion

Development notes

Version class

Instead of using Version as a namedtuple, Version is now a complete class that helps to parse and to unparse filepaths, becoming the former part of AbstractDataSet that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.

Kept the original behaviour

The default versioning behaviour was kept using the new auxiliar methods is_custom_format and is_unique_date_format of the AbstractDataSet

`UnknownDateTime`

This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in Version._safe_parse method can be removed, but the unit tests may need to be changed.

Custom `Version` class demands paying attention

Even though a custom Version class can be specified, its parsing, unparsing, and glob methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique by is_unique_date_format if it implements all ISO format codes. However, because it has changed the %d behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.

Note: This customization of the datetime logic is very important for the use case I intend to use. I need the exact behaviour of the example, haha.

Unit tests

Wrote unit tests for all kedro.io.date_time classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.

Wrote unit tests in test_data_catalog for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.

None of the already present tests were changed in order to make sure the default behaviour was preserved.

The text was updated successfully, but these errors were encountered:

fazilhero · 2023-07-28T09:42:19Z

I would be in favor of custom version class, like in settings like many other conf at the moment. That is the only section i saw kedro wasn't compatible with our own internal tooling. Would be great to see this in action!

astrojuanlu · 2023-10-30T08:53:31Z

A user asks whether there's a way to timestamp datasets according to when the kedro run is launched and not when the dataset is written, or in other words, hardcode the timestamp https://linen-slack.kedro.org/t/16016262/dear-kedro-team-is-there-a-canonical-kedro-way-to-timestamp-#24683f38-88ce-45c3-93b7-ca56ea4e0508

xref #2694 and potentially #1731

jasonmhite · 2024-01-23T20:36:44Z

A user asks whether there's a way to timestamp datasets according to when the kedro run is launched and not when the dataset is written, or in other words, hardcode the timestamp https://linen-slack.kedro.org/t/16016262/dear-kedro-team-is-there-a-canonical-kedro-way-to-timestamp-#24683f38-88ce-45c3-93b7-ca56ea4e0508

I'd agree that this would be a very nice improvement, we would generally prefer all the timestamps for any outputs be the timestamp of the initial run command.

Perhaps this isn't the right thread but I'd like to inject that it'd be nice to include the possibility to not strictly track versions based on the date. Right now my team has been discussing wanting to be able to organize versions by some sort of unique short identifier e.g. akin to git short hashes or unique word phrases instead of the dates in the filename. I've been reading through the Kedro source pondering how to go about this but not having much luck so far.

astrojuanlu · 2024-01-25T09:21:05Z

I've been reading through the Kedro source pondering how to go about this but not having much luck so far.

It's mostly here:

kedro/kedro/io/core.py

Lines 580 to 586 in 7384abd

    
           def resolve_save_version(self) -> str | None: 
        
               """Compute the version the dataset should be saved with.""" 
        
               if not self._version: 
        
                   return None 
        
               if self._version.save: 
        
                   return self._version.save  # type: ignore[no-any-return] 
        
               return self._fetch_latest_save_version()

and you'll see it's hardcoded to generate a timestamp:

kedro/kedro/io/core.py

Lines 557 to 562 in 7384abd

    
           # 'key' is set to prevent cache key overlapping for load and save: 
        
           # https://cachetools.readthedocs.io/en/stable/#cachetools.cachedmethod 
        
           @cachedmethod(cache=attrgetter("_version_cache"), key=partial(hashkey, "save")) 
        
           def _fetch_latest_save_version(self) -> str: 
        
               """Generate and cache the current save version""" 
        
               return generate_timestamp()

astrojuanlu · 2024-01-25T09:22:49Z

Allowing for custom versions would open the gates for a Kedro + DVC integration #2691 lots of people have asked about this.

astrojuanlu · 2024-07-31T13:21:08Z

Had an idea today and got quite close to being able to configure the versioning using custom resolvers.

# settings.py
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
    "base_env": "base",
    "default_run_env": "local",
    "custom_resolvers": {
        "now": lambda: dt.datetime.now().strftime("%Y-%m-%dT%H.%M.%S.%fZ")
    }
}

# datasets.py
class SimpleCSVPolarsDataset(AbstractDataset):
    def __init__(self, filepath: str):
        self._filepath = filepath

    def _load(self) -> pl.DataFrame:
        return pl.read_csv(self._filepath)

    def _save(self, data: pl.DataFrame) -> None:
        data.write_csv(self._filepath)

    def _describe(self) -> dict[str, Any]:
        return {"filepath": self._filepath}

# catalog.yml
test_csv_dataset:
  type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
  filepath: data/02_intermediate/pypi_kedro_demo_${now:}.csv

Usage:

In [1]: from kedro.io import DataCatalog
   ...: from kedro.config import OmegaConfigLoader
   ...: 
   ...: import polars as pl
   ...: 
   ...: config_loader = OmegaConfigLoader(
   ...:     conf_source="conf",
   ...:     base_env="base",
   ...:     default_run_env="local",
   ...: )
   ...: catalog = DataCatalog.from_config(config_loader.get("catalog"))

In [2]: df = pl.DataFrame(...)

In [3]: catalog.save("test_csv_dataset", df)
[07/31/24 14:56:38] INFO     Saving data to test_csv_dataset (SimpleCSVPolarsDataset)...

In [4]: !tree data/
data/
└── 02_intermediate
    └── pypi_kedro_demo_2024-07-31T14.56.33.314040Z.csv

Caveats:

Trivial saving, but no magic "last version" discovery for loading
- For the case of final artifacts (plots, metrics), they usually are not the inputs of further nodes, so it's okay to not _load them
  - In fact, MetricsDataset famously doesn't even have _load
- And for the other cases, is this actually that bad? The desired version can be provided using runtime_params

test_csv_dataset:
  type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
  filepath: data/02_intermediate/pypi_kedro_demo_${now:${runtime_params:test_csv_dataset,''}}.csv

No magic directory creation
- Is that even bad? I always found the fact that Kedro creates directories with extensions extremely confusing
- And, if anything, could be handled in the dataset _save method itself

datajoely · 2024-08-01T08:47:14Z

At this point - shouldn't we just push people towards formats like Iceberg?

datajoely · 2024-08-01T08:57:19Z

I should say your solution is neat and elegant... but do we need to expose this to the user?

astrojuanlu · 2024-08-01T09:18:30Z

At this point - shouldn't we just push people towards formats like Iceberg?

I've found that Data Scientists tend to prefer this no-frills versioning, many folks don't even set up something like MLflow for their local experiment tracking.

OTOH, Delta and Iceberg are perfectly supported through Polars, probably Pandas too. So the option already exists, I've been documenting it in my talks & workshops, and it's a matter of adding that to the docs.

The point is (and this is something @iamelijahko is researching at the moment): given that Delta & Iceberg exist (with versioning, time travel, automatic garbage collection) and also that low-complexity, filename-based versioning is possible with OmegaConf resolvers, do we want to additionally keep maintaining our AbstractVersionedDatasets?

datajoely · 2024-08-01T12:35:49Z

I'm for anything that removes the AbstractVersionedDataset, I guess the flip side - if we delegate the versioning to some other technology how do we standardise the kedro run --load-versions=<dataset_name>:YYYY-MM-DDThh.mm.ss.sssZ functionality?

astrojuanlu · 2024-08-01T12:53:37Z

I actually learned that kedro run --load-versions was a thing just yesterday while I was writing this comment. Wondering how many people use it. Will have a look at our telemetry.

datajoely · 2024-08-01T15:49:01Z

Yeah I suspect if you use versioning in the first place you either need this or Datacatalog.load({name}, version=...) to actually interrogate your work. A very quick scan shows it's baked into some of the deepest bits of Kedro:

astrojuanlu · 2024-09-18T10:53:52Z

A user asked about this exact approach https://kedro.hall.community/support-lY6wDVhxGXNY/pushing-files-to-s3-with-dynamic-names-FfCYxXyxTZF4

astrojuanlu · 2025-02-10T09:25:36Z

Conversation continues in #1979.

merelcht added the Issue: Feature Request New feature or improvement to existing feature label Feb 22, 2023

merelcht added this to the Redesign Catalog and Datasets milestone Feb 22, 2023

merelcht mentioned this issue Feb 22, 2023

Configurable versioning #1871

Closed

5 tasks

astrojuanlu mentioned this issue Aug 22, 2023

How can we improve dataset versioning? #1979

Open

astrojuanlu mentioned this issue Sep 13, 2023

Easier CustomDataset Creation #1936

Open

astrojuanlu mentioned this issue Jan 25, 2024

Document usage of Kedro + DVC #2691

Closed

merelcht modified the milestones: Redesign the API for io.datacatalog and io.core, Dataset Versioning Feb 2, 2024

astrojuanlu mentioned this issue Sep 3, 2024

Design DataCatalog2.0 #3995

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable versioning #2355

Configurable versioning #2355

merelcht commented Feb 22, 2023

fazilhero commented Jul 28, 2023 •

edited

Loading

astrojuanlu commented Oct 30, 2023

jasonmhite commented Jan 23, 2024

astrojuanlu commented Jan 25, 2024

astrojuanlu commented Jan 25, 2024

astrojuanlu commented Jul 31, 2024

datajoely commented Aug 1, 2024

datajoely commented Aug 1, 2024

astrojuanlu commented Aug 1, 2024

datajoely commented Aug 1, 2024

astrojuanlu commented Aug 1, 2024

datajoely commented Aug 1, 2024

astrojuanlu commented Sep 18, 2024

astrojuanlu commented Feb 10, 2025

Configurable versioning #2355

Configurable versioning #2355

Comments

merelcht commented Feb 22, 2023

Description

Motivation

Custom format

Partial timestamp

Custom version class

Development notes

Version class

Kept the original behaviour

UnknownDateTime

Custom Version class demands paying attention

Unit tests

fazilhero commented Jul 28, 2023 • edited Loading

astrojuanlu commented Oct 30, 2023

jasonmhite commented Jan 23, 2024

astrojuanlu commented Jan 25, 2024

astrojuanlu commented Jan 25, 2024

astrojuanlu commented Jul 31, 2024

datajoely commented Aug 1, 2024

datajoely commented Aug 1, 2024

astrojuanlu commented Aug 1, 2024

datajoely commented Aug 1, 2024

astrojuanlu commented Aug 1, 2024

datajoely commented Aug 1, 2024

astrojuanlu commented Sep 18, 2024

astrojuanlu commented Feb 10, 2025

`UnknownDateTime`

Custom `Version` class demands paying attention

fazilhero commented Jul 28, 2023 •

edited

Loading