Skip to content

Commit c25bab3

Browse files
committed
Update the instructions for creating a new dataset
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
1 parent efdc0f9 commit c25bab3

File tree

3 files changed

+22
-21
lines changed

3 files changed

+22
-21
lines changed

RELEASE.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Upcoming Release 0.19.7
22

33
## Major features and improvements
4+
* Exposed `load` and `save` publicly for each dataset in the core `kedro` library, and enabled other datasets to do the same. If a dataset doesn't expose `load` or `save` publicly, Kedro will fall back to using `_load` or `_save`, respectively.
45

56
## Bug fixes and other changes
67
* Updated error message for invalid catalog entries.

docs/source/data/how_to_create_a_custom_dataset.md

+17-17
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
## AbstractDataset
66

7-
If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
7+
If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to implement the `load` and `save` methods while providing wrappers that enrich the corresponding methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
88

99

1010
## Scenario
@@ -31,8 +31,8 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta
3131

3232
At the minimum, a valid Kedro dataset needs to subclass the base {py:class}`~kedro.io.AbstractDataset` and provide an implementation for the following abstract methods:
3333

34-
* `_load`
35-
* `_save`
34+
* `load`
35+
* `save`
3636
* `_describe`
3737

3838
`AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
@@ -70,15 +70,15 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
7070
"""
7171
self._filepath = filepath
7272

73-
def _load(self) -> np.ndarray:
73+
def load(self) -> np.ndarray:
7474
"""Loads data from the image file.
7575
7676
Returns:
7777
Data from the image file as a numpy array.
7878
"""
7979
...
8080

81-
def _save(self, data: np.ndarray) -> None:
81+
def save(self, data: np.ndarray) -> None:
8282
"""Saves image data to the specified filepath"""
8383
...
8484

@@ -96,11 +96,11 @@ src/kedro_pokemon/datasets
9696
└── image_dataset.py
9797
```
9898

99-
## Implement the `_load` method with `fsspec`
99+
## Implement the `load` method with `fsspec`
100100

101101
Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#dataset-filepath). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats.
102102

103-
Here is the implementation of the `_load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:
103+
Here is the implementation of the `load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:
104104

105105
<details>
106106
<summary><b>Click to expand</b></summary>
@@ -130,7 +130,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
130130
self._filepath = PurePosixPath(path)
131131
self._fs = fsspec.filesystem(self._protocol)
132132

133-
def _load(self) -> np.ndarray:
133+
def load(self) -> np.ndarray:
134134
"""Loads data from the image file.
135135
136136
Returns:
@@ -168,14 +168,14 @@ In [2]: from PIL import Image
168168
In [3]: Image.fromarray(image).show()
169169
```
170170

171-
## Implement the `_save` method with `fsspec`
171+
## Implement the `save` method with `fsspec`
172172

173173
Similarly, we can implement the `_save` method as follows:
174174

175175

176176
```python
177177
class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
178-
def _save(self, data: np.ndarray) -> None:
178+
def save(self, data: np.ndarray) -> None:
179179
"""Saves image data to the specified filepath."""
180180
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
181181
save_path = get_filepath_str(self._filepath, self._protocol)
@@ -243,7 +243,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
243243
self._filepath = PurePosixPath(path)
244244
self._fs = fsspec.filesystem(self._protocol)
245245

246-
def _load(self) -> np.ndarray:
246+
def load(self) -> np.ndarray:
247247
"""Loads data from the image file.
248248
249249
Returns:
@@ -254,7 +254,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
254254
image = Image.open(f).convert("RGBA")
255255
return np.asarray(image)
256256

257-
def _save(self, data: np.ndarray) -> None:
257+
def save(self, data: np.ndarray) -> None:
258258
"""Saves image data to the specified filepath."""
259259
save_path = get_filepath_str(self._filepath, self._protocol)
260260
with self._fs.open(save_path, mode="wb") as f:
@@ -312,7 +312,7 @@ To add versioning support to the new dataset we need to extend the
312312
{py:class}`~kedro.io.AbstractVersionedDataset` to:
313313

314314
* Accept a `version` keyword argument as part of the constructor
315-
* Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
315+
* Adapt the `load` and `save` method to use the versioned data path obtained from `_get_load_path` and `_get_save_path` respectively
316316

317317
The following amends the full implementation of our basic `ImageDataset`. It now loads and saves data to and from a versioned subfolder (`data/01_raw/pokemon-images-and-types/images/images/pikachu.png/<version>/pikachu.png` with `version` being a datetime-formatted string `YYYY-MM-DDThh.mm.ss.sssZ` by default):
318318

@@ -359,7 +359,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
359359
glob_function=self._fs.glob,
360360
)
361361

362-
def _load(self) -> np.ndarray:
362+
def load(self) -> np.ndarray:
363363
"""Loads data from the image file.
364364
365365
Returns:
@@ -370,7 +370,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
370370
image = Image.open(f).convert("RGBA")
371371
return np.asarray(image)
372372

373-
def _save(self, data: np.ndarray) -> None:
373+
def save(self, data: np.ndarray) -> None:
374374
"""Saves image data to the specified filepath."""
375375
save_path = get_filepath_str(self._get_save_path(), self._protocol)
376376
with self._fs.open(save_path, mode="wb") as f:
@@ -435,7 +435,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
435435
+ glob_function=self._fs.glob,
436436
+ )
437437
+
438-
def _load(self) -> np.ndarray:
438+
def load(self) -> np.ndarray:
439439
"""Loads data from the image file.
440440

441441
Returns:
@@ -447,7 +447,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
447447
image = Image.open(f).convert("RGBA")
448448
return np.asarray(image)
449449

450-
def _save(self, data: np.ndarray) -> None:
450+
def save(self, data: np.ndarray) -> None:
451451
"""Saves image data to the specified filepath."""
452452
- save_path = get_filepath_str(self._filepath, self._protocol)
453453
+ save_path = get_filepath_str(self._get_save_path(), self._protocol)

kedro/io/core.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,10 @@ class AbstractDataset(abc.ABC, Generic[_DI, _DO]):
9393
>>> self._param1 = param1
9494
>>> self._param2 = param2
9595
>>>
96-
>>> def _load(self) -> pd.DataFrame:
96+
>>> def load(self) -> pd.DataFrame:
9797
>>> return pd.read_csv(self._filepath)
9898
>>>
99-
>>> def _save(self, df: pd.DataFrame) -> None:
99+
>>> def save(self, df: pd.DataFrame) -> None:
100100
>>> df.to_csv(str(self._filepath))
101101
>>>
102102
>>> def _exists(self) -> bool:
@@ -555,11 +555,11 @@ class AbstractVersionedDataset(AbstractDataset[_DI, _DO], abc.ABC):
555555
>>> self._param1 = param1
556556
>>> self._param2 = param2
557557
>>>
558-
>>> def _load(self) -> pd.DataFrame:
558+
>>> def load(self) -> pd.DataFrame:
559559
>>> load_path = self._get_load_path()
560560
>>> return pd.read_csv(load_path)
561561
>>>
562-
>>> def _save(self, df: pd.DataFrame) -> None:
562+
>>> def save(self, df: pd.DataFrame) -> None:
563563
>>> save_path = self._get_save_path()
564564
>>> df.to_csv(str(save_path))
565565
>>>

0 commit comments

Comments
 (0)