You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/03_tutorial/02_tutorial_template.md
+12-23
Original file line number
Diff line number
Diff line change
@@ -14,19 +14,11 @@ Navigate to your chosen working directory and run the following to [create a new
14
14
kedro new
15
15
```
16
16
17
-
When prompted for a project name, enter `Kedro Tutorial`. Subsequently, accept the default suggestions for `repo_name` and `python_package` by pressing enter.
17
+
When prompted for a project name, enter `Kedro Tutorial`. Subsequently, accept the default suggestions for `repo_name` and `python_package` by pressing enter. Then navigate to the root directory of the project, `kedro-tutorial`.
18
18
19
-
## Install project dependencies with `kedro install`
19
+
## Install dependencies
20
20
21
-
To install the project-specific dependencies, navigate to the root directory of the project and run:
22
-
23
-
```bash
24
-
kedro install
25
-
```
26
-
27
-
### More about project dependencies
28
-
29
-
Up to this point, we haven't discussed project dependencies, so now is a good time to examine them. We use Kedro to specify a project's dependencies and make it easier for others to run your project. It avoids version conflicts because Kedro ensures that you use same Python packages and versions.
21
+
Up to this point, we haven't discussed project dependencies, so now is a good time to examine them. We use a `requirements.txt` file to specify a project's dependencies and make it easier for others to run your project. This avoids version conflicts by ensuring that you use same Python packages and versions.
30
22
31
23
The generic project template bundles some typical dependencies, in `src/requirements.txt`. Here's a typical example, although you may find that the version numbers are slightly different depending on the version of Kedro that you are using:
32
24
@@ -50,28 +42,25 @@ wheel>=0.35, <0.37 # The reference implementation of the Python wheel packaging
50
42
.. note:: If your project has ``conda`` dependencies, you can create a ``src/environment.yml`` file and list them there.
51
43
```
52
44
53
-
### Add and remove project-specific dependencies
54
-
55
-
The dependencies above may be sufficient for some projects, but for the spaceflights project, you need to add some extra requirements.
45
+
The dependencies above may be sufficient for some projects, but for this tutorial you need to add some extra requirements. These will enable us to work with different data formats (including CSV, Excel and Parquet) and to visualise the pipeline.
56
46
57
-
* In this tutorial, we work with different data formats including CSV, Excel and Parquet and want to visualise our pipeline so we will need to provide extra dependencies.
58
-
* By running `kedro install` on a blank template we generate a new file at `src/requirements.in`. You can read more about the benefits of compiling dependencies [here](../04_kedro_project_setup/01_dependencies.md)
59
-
* The most important point to learn here is that you should edit the `requirements.in` file going forward.
60
-
61
-
Add the following requirements to your `src/requirements.in` lock file:
47
+
Edit your `src/requirements.txt` file to include the following lines:
openpyxl==3.0.9 # Use modern Excel engine (will not be required in 0.18.0)
51
+
kedro-viz~=4.0 # Visualise your pipelines
52
+
openpyxl>=3.0.6, <4.0 # Use modern Excel engine (will not be required in 0.18.0)
53
+
scikit-learn~=1.0 # For modelling in the data science pipeline
67
54
```
68
55
69
-
Then run the following command to re-compile your updated dependencies and install them into your environment:
56
+
To install all the project-specific dependencies, navigate to the root directory of the project and run:
70
57
71
58
```bash
72
-
kedro install --build-reqs
59
+
pip install -r src/requirements.txt
73
60
```
74
61
62
+
You can find out more about [how to work with project dependencies](../04_kedro_project_setup/01_dependencies.md) in the Kedro project documentation.
63
+
75
64
## Configure the project
76
65
77
66
You may optionally add in any credentials to `conf/local/credentials.yml` that you would need to load specific data sources like usernames and passwords. Some examples are given within the file to illustrate how you store credentials. Additional information can be found in the [advanced documentation on configuration](../04_kedro_project_setup/02_configuration.md).
.. note:: If this is the first ``kedro`` command you have executed in the project, you will be asked whether you wish to opt into `usage analytics <https://github.com/quantumblacklabs/kedro-telemetry>`_. Your decision is recorded in the ``.telemetry`` file so that subsequent calls to ``kedro`` in this project do not ask you again.
112
+
```
113
+
110
114
The command loads the dataset named `companies` (as per top-level key in `catalog.yml`) from the underlying filepath `data/01_raw/companies.csv` into the variable `companies`, which is of type `pandas.DataFrame`. The `head` method from `pandas` then displays the first five rows of the DataFrame.
111
115
112
116
When you have finished, close `ipython` session as follows:
@@ -129,7 +133,7 @@ shuttles:
129
133
130
134
```eval_rst
131
135
.. note::
132
-
The ``load_args`` are passed to the ``pd.read_excel`` method as `keyword arguments <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_, conversely providing ``save_args`` would be passed to the ``pd.DataFrame.to_excel`` `method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_.
136
+
The ``load_args`` are passed to the ``pd.read_excel`` method as `keyword arguments <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_; although not specified here, ``save_args`` would be passed to the ``pd.DataFrame.to_excel`` `method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html>`_.
133
137
```
134
138
135
139
To test that everything works as expected, load the dataset within a _new_ `kedro ipython` session and display its first five rows:
Copy file name to clipboardexpand all lines: docs/source/03_tutorial/04_create_pipelines.md
+24-37
Original file line number
Diff line number
Diff line change
@@ -17,8 +17,11 @@ In the terminal run the following command:
17
17
kedro pipeline create data_processing
18
18
```
19
19
20
-
* This will generate all the files you need to start writing a `data_processing` pipeline. This command generates a new `nodes.py` and `pipeline.py` under the `src/kedro_tutorial/pipelines/data_processing` folder.
21
-
* The `kedro pipeline create <pipeline_name>` command is a convenience method so you don't have to worry about getting your ``__init__.py`` files in the right place, but of course you are welcome to create all the files manually.
20
+
This generates all the files you need to start writing a `data_processing` pipeline:
21
+
*`nodes.py` and `pipeline.py` in the `src/kedro_tutorial/pipelines/data_processing` folder for the main node functions that form your pipeline
22
+
*`conf/base/parameters/data_processing.yml` to define the parameters used when running the pipeline
23
+
*`src/tests/pipelines/data_processing` for tests for your pipeline
24
+
*`__init__.py` files in the required places to ensure that the pipeline can be imported by Python
Open `src/kedro_tutorial/pipelines/data_processing/nodes.py` and add the code below, which provides two functions (`preprocess_companies` and `preprocess_shuttles`) that each input a raw dataframe and output a dataframe containing pre-processed data:
54
+
Open `src/kedro_tutorial/pipelines/data_processing/nodes.py` and add the code below, which provides two functions (`preprocess_companies` and `preprocess_shuttles`) that each take a raw DataFrame and output a DataFrame containing pre-processed data:
52
55
53
56
<details>
54
57
<summary><b>Click to expand</b></summary>
@@ -115,7 +118,7 @@ Add the following to `src/kedro_tutorial/pipelines/data_processing/pipeline.py`,
@@ -251,11 +256,11 @@ The code above declares explicitly that [pandas.ParquetDataSet](/kedro.extras.da
251
256
252
257
The [Data Catalog](../13_resources/02_glossary.md#data-catalog) will take care of saving the datasets automatically (in this case as Parquet) to the path specified next time the pipeline is run. There is no need to change any code in your preprocessing functions to accommodate this change.
253
258
254
-
[Apache Parquet](https://github.com/apache/parquet-format) is our recommended format for working with processed and typed data. We recommend getting your data out of CSV as soon as possible. Parquet supports things like compression, partitioning and types out of the box. Whilst you do lose the ability to view the file as text, the benefits greatly outweigh the drawbacks.
259
+
We choose the [Apache Parquet](https://github.com/apache/parquet-format) format for working with processed and typed data. We recommend getting your data out of CSV as soon as possible. Parquet supports things like compression, partitioning and types out of the box. Whilst you lose the ability to view the file as text, the benefits greatly outweigh the drawbacks.
255
260
256
261
### Extend the data processing pipeline
257
262
258
-
The next step in the tutorial is to add another node for a function to join together the three dataframes into a single model input table. First, add the `create_model_input_table()` function from the snippet below to `src/kedro_tutorial/pipelines/data_processing/nodes.py`.
263
+
The next step in the tutorial is to add another node for a function to join together the three DataFrames into a single model input table. First, add the `create_model_input_table()` function from the snippet below to `src/kedro_tutorial/pipelines/data_processing/nodes.py`.
259
264
260
265
<details>
261
266
<summary><b>Click to expand</b></summary>
@@ -304,7 +309,7 @@ from .nodes import create_model_input_table, preprocess_companies, preprocess_sh
304
309
```
305
310
306
311
307
-
### Persisting the model input table
312
+
### Persist the model input table
308
313
309
314
If you want the model input table data to be saved to file rather than used in-memory, add an entry to `conf/base/catalog.yml`:
310
315
@@ -343,47 +348,29 @@ You should see output similar to the following:
### Using`kedro viz --autoreload` to see how Kedro brings the pipeline together
351
+
### Use`kedro viz --autoreload`
347
352
348
353
Run the following command:
349
354
350
355
```bash
351
356
kedro viz --autoreload
352
357
```
353
358
354
-
The gif below shows how commenting out the`create_model_input_table_node` in `pipeline.py` will trigger a re-render of the pipeline:
359
+
The `autoreload` flag will ensure that changes to your pipeline are automatically reflected in Kedro-Viz. For example, commenting out `create_model_input_table_node` in `pipeline.py` will trigger a re-render of the pipeline:
355
360
356
361

357
362
358
363
```eval_rst
359
-
.. note:: This is also a great time to highlight how Kedro's `topological sorting <https://en.wikipedia.org/wiki/Topological_sorting>`_ works. The actual order of the ``node()`` calls in the ``Pipeline`` object is irrelevant, Kedro works out the execution graph via the inputs/outputs declared, not the order provided by the user. This means you as a developer simply ask Kedro what data you want and it will derive the execution graph automatically.
364
+
.. note:: This is also a great time to highlight how Kedro's `topological sorting <https://en.wikipedia.org/wiki/Topological_sorting>`_ works. The actual order of the ``node()`` calls in the ``pipeline`` is irrelevant; Kedro works out the execution graph via the inputs/outputs declared, not the order provided by the user. This means that you, as a developer, simply ask Kedro what data you want and it will derive the execution graph automatically.
360
365
```
361
366
362
367
## Data science pipeline
363
368
364
-
We have created a modular pipeline for data processing, which merges three input datasets to create a model input table. Now we will create the data science pipeline for price prediction, which uses the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
365
-
implementation from the [scikit-learn](https://scikit-learn.org/stable/) library.
366
-
367
-
### Update dependencies
368
-
369
-
We now need to add `scikit-learn` to the project's dependencies. This is a slightly different process from the initial change we made early in the tutorial.
370
-
371
-
To **update** the project's dependencies, you should modify `src/requirements.in` to add the following. Note that you do not need to update ``src/requirements.txt`` as you did previously in the tutorial before you built the project's requirements with ``kedro build-reqs``:
372
-
373
-
```text
374
-
scikit-learn==0.23.1
375
-
```
376
-
377
-
Then, re-run `kedro install` with a flag telling Kedro to recompile the requirements:
378
-
379
-
```bash
380
-
kedro install --build-reqs
381
-
```
382
-
383
-
You can find out more about [how to work with project dependencies](../04_kedro_project_setup/01_dependencies) in the Kedro project documentation.
369
+
We have created a modular pipeline for data processing, which merges three input datasets to create a model input table. Now we will create the data science pipeline for price prediction, which uses the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) implementation from the [scikit-learn](https://scikit-learn.org/stable/) library.
384
370
385
371
### Create the data science pipeline
386
372
373
+
Run the following command to create the `data_science` pipeline:
387
374
```bash
388
375
kedro pipeline create data_science
389
376
```
@@ -492,13 +479,13 @@ Versioning is enabled for `regressor`, which means that the pickled output of th
492
479
To create a modular pipeline for the price prediction model, add the following to the top of `src/kedro_tutorial/pipelines/data_science/pipeline.py`:
493
480
494
481
```python
495
-
from kedro.pipeline import Pipeline, node
482
+
from kedro.pipeline import Pipeline, node, pipeline
496
483
497
484
from .nodes import evaluate_model, split_data, train_model
Copy file name to clipboardexpand all lines: docs/source/03_tutorial/05_visualise_pipeline.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
## Install Kedro-Viz
6
6
7
-
You can install Kedro-Viz by running:
7
+
If you did not already install Kedro-Viz when you [installed the tutorial project dependencies](02_tutorial_template.md#install-dependencies) then you can do so now by running:
8
8
```bash
9
9
pip install kedro-viz
10
10
```
@@ -132,7 +132,7 @@ def compare_shuttle_speed():
132
132
133
133
defcreate_pipeline(**kwargs) -> Pipeline:
134
134
"""This is a simple pipeline which generates a plot"""
0 commit comments