-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable versioning #2355
Comments
I would be in favor of custom version class, like in settings like many other conf at the moment. That is the only section i saw kedro wasn't compatible with our own internal tooling. Would be great to see this in action! |
A user asks whether there's a way to timestamp datasets according to when the |
I'd agree that this would be a very nice improvement, we would generally prefer all the timestamps for any outputs be the timestamp of the initial run command. Perhaps this isn't the right thread but I'd like to inject that it'd be nice to include the possibility to not strictly track versions based on the date. Right now my team has been discussing wanting to be able to organize versions by some sort of unique short identifier e.g. akin to git short hashes or unique word phrases instead of the dates in the filename. I've been reading through the Kedro source pondering how to go about this but not having much luck so far. |
It's mostly here: Lines 580 to 586 in 7384abd
and you'll see it's hardcoded to generate a timestamp: Lines 557 to 562 in 7384abd
|
Allowing for custom versions would open the gates for a Kedro + DVC integration #2691 lots of people have asked about this. |
Had an idea today and got quite close to being able to configure the versioning using custom resolvers. # settings.py
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"base_env": "base",
"default_run_env": "local",
"custom_resolvers": {
"now": lambda: dt.datetime.now().strftime("%Y-%m-%dT%H.%M.%S.%fZ")
}
}
# datasets.py
class SimpleCSVPolarsDataset(AbstractDataset):
def __init__(self, filepath: str):
self._filepath = filepath
def _load(self) -> pl.DataFrame:
return pl.read_csv(self._filepath)
def _save(self, data: pl.DataFrame) -> None:
data.write_csv(self._filepath)
def _describe(self) -> dict[str, Any]:
return {"filepath": self._filepath} # catalog.yml
test_csv_dataset:
type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
filepath: data/02_intermediate/pypi_kedro_demo_${now:}.csv Usage:
Caveats:
test_csv_dataset:
type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
filepath: data/02_intermediate/pypi_kedro_demo_${now:${runtime_params:test_csv_dataset,''}}.csv
|
At this point - shouldn't we just push people towards formats like Iceberg? |
I should say your solution is neat and elegant... but do we need to expose this to the user? |
I've found that Data Scientists tend to prefer this no-frills versioning, many folks don't even set up something like MLflow for their local experiment tracking. OTOH, Delta and Iceberg are perfectly supported through Polars, probably Pandas too. So the option already exists, I've been documenting it in my talks & workshops, and it's a matter of adding that to the docs. The point is (and this is something @iamelijahko is researching at the moment): given that Delta & Iceberg exist (with versioning, time travel, automatic garbage collection) and also that low-complexity, filename-based versioning is possible with OmegaConf resolvers, do we want to additionally keep maintaining our |
I'm for anything that removes the |
I actually learned that |
Yeah I suspect if you use versioning in the first place you either need this or |
A user asked about this exact approach https://kedro.hall.community/support-lY6wDVhxGXNY/pushing-files-to-s3-with-dynamic-names-FfCYxXyxTZF4 |
Conversation continues in #1979. |
Converted PR #1871 into this issue, to continue the discussion after the PR is closed.
Description
This PR aims to add more customization for
VersionedDataSet
s. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.Motivation
Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using
PartitionedDataSets
, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.Custom format
The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.
The example above dataset would have been translated to
data/01_raw/company/car_data/2022/09/25/car_data.csv
if today's date was2022-09-25
Partial timestamp
In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.
kedro run --load-version "cars:2022-09-25"
or
This is now a possible way of selecting the load version.
Custom version class
If the custom date format is not enough to implement the versioning logic, then the user can subclass the
Version
class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:Development notes
Version class
Instead of using
Version
as a namedtuple,Version
is now a complete class that helps to parse and to unparse filepaths, becoming the former part ofAbstractDataSet
that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.Kept the original behaviour
The default versioning behaviour was kept using the new auxiliar methods
is_custom_format
andis_unique_date_format
of theAbstractDataSet
UnknownDateTime
This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in
Version._safe_parse
method can be removed, but the unit tests may need to be changed.Custom
Version
class demands paying attentionEven though a custom
Version
class can be specified, itsparsing
,unparsing
, andglob
methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique byis_unique_date_format
if it implements all ISO format codes. However, because it has changed the%d
behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.Unit tests
Wrote unit tests for all
kedro.io.date_time
classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.Wrote unit tests in
test_data_catalog
for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.None of the already present tests were changed in order to make sure the default behaviour was preserved.
The text was updated successfully, but these errors were encountered: