Skip to content

Commit 4470623

Browse files
committed
Merge branch 'main' into deepyaman-patch-3
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
2 parents 4aeaa1f + 035f463 commit 4470623

9 files changed

+56
-84
lines changed

docs/source/03_tutorial/01_spaceflights_tutorial.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ When building a Kedro project, you will typically follow a standard development
2121
### 1. Set up the project template
2222

2323
* Create a new project with `kedro new`
24-
* Install project dependencies with `kedro install`
24+
* Install project dependencies with `pip install`
2525
* Configure the following in the `conf` folder:
2626
* Logging
2727
* Credentials

docs/source/03_tutorial/02_tutorial_template.md

+12-23
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,11 @@ Navigate to your chosen working directory and run the following to [create a new
1414
kedro new
1515
```
1616

17-
When prompted for a project name, enter `Kedro Tutorial`. Subsequently, accept the default suggestions for `repo_name` and `python_package` by pressing enter.
17+
When prompted for a project name, enter `Kedro Tutorial`. Subsequently, accept the default suggestions for `repo_name` and `python_package` by pressing enter. Then navigate to the root directory of the project, `kedro-tutorial`.
1818

19-
## Install project dependencies with `kedro install`
19+
## Install dependencies
2020

21-
To install the project-specific dependencies, navigate to the root directory of the project and run:
22-
23-
```bash
24-
kedro install
25-
```
26-
27-
### More about project dependencies
28-
29-
Up to this point, we haven't discussed project dependencies, so now is a good time to examine them. We use Kedro to specify a project's dependencies and make it easier for others to run your project. It avoids version conflicts because Kedro ensures that you use same Python packages and versions.
21+
Up to this point, we haven't discussed project dependencies, so now is a good time to examine them. We use a `requirements.txt` file to specify a project's dependencies and make it easier for others to run your project. This avoids version conflicts by ensuring that you use same Python packages and versions.
3022

3123
The generic project template bundles some typical dependencies, in `src/requirements.txt`. Here's a typical example, although you may find that the version numbers are slightly different depending on the version of Kedro that you are using:
3224

@@ -50,28 +42,25 @@ wheel>=0.35, <0.37 # The reference implementation of the Python wheel packaging
5042
.. note:: If your project has ``conda`` dependencies, you can create a ``src/environment.yml`` file and list them there.
5143
```
5244

53-
### Add and remove project-specific dependencies
54-
55-
The dependencies above may be sufficient for some projects, but for the spaceflights project, you need to add some extra requirements.
45+
The dependencies above may be sufficient for some projects, but for this tutorial you need to add some extra requirements. These will enable us to work with different data formats (including CSV, Excel and Parquet) and to visualise the pipeline.
5646

57-
* In this tutorial, we work with different data formats including CSV, Excel and Parquet and want to visualise our pipeline so we will need to provide extra dependencies.
58-
* By running `kedro install` on a blank template we generate a new file at `src/requirements.in`. You can read more about the benefits of compiling dependencies [here](../04_kedro_project_setup/01_dependencies.md)
59-
* The most important point to learn here is that you should edit the `requirements.in` file going forward.
60-
61-
Add the following requirements to your `src/requirements.in` lock file:
47+
Edit your `src/requirements.txt` file to include the following lines:
6248

6349
```text
6450
kedro[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]==0.17.6 # Specify optional Kedro dependencies
65-
kedro-viz==4.1.1 # Visualise your pipelines
66-
openpyxl==3.0.9 # Use modern Excel engine (will not be required in 0.18.0)
51+
kedro-viz~=4.0 # Visualise your pipelines
52+
openpyxl>=3.0.6, <4.0 # Use modern Excel engine (will not be required in 0.18.0)
53+
scikit-learn~=1.0 # For modelling in the data science pipeline
6754
```
6855

69-
Then run the following command to re-compile your updated dependencies and install them into your environment:
56+
To install all the project-specific dependencies, navigate to the root directory of the project and run:
7057

7158
```bash
72-
kedro install --build-reqs
59+
pip install -r src/requirements.txt
7360
```
7461

62+
You can find out more about [how to work with project dependencies](../04_kedro_project_setup/01_dependencies.md) in the Kedro project documentation.
63+
7564
## Configure the project
7665

7766
You may optionally add in any credentials to `conf/local/credentials.yml` that you would need to load specific data sources like usernames and passwords. Some examples are given within the file to illustrate how you store credentials. Additional information can be found in the [advanced documentation on configuration](../04_kedro_project_setup/02_configuration.md).

docs/source/03_tutorial/03_set_up_data.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,10 @@ companies = catalog.load("companies")
107107
companies.head()
108108
```
109109

110+
```eval_rst
111+
.. note:: If this is the first ``kedro`` command you have executed in the project, you will be asked whether you wish to opt into `usage analytics <https://github.com/quantumblacklabs/kedro-telemetry>`_. Your decision is recorded in the ``.telemetry`` file so that subsequent calls to ``kedro`` in this project do not ask you again.
112+
```
113+
110114
The command loads the dataset named `companies` (as per top-level key in `catalog.yml`) from the underlying filepath `data/01_raw/companies.csv` into the variable `companies`, which is of type `pandas.DataFrame`. The `head` method from `pandas` then displays the first five rows of the DataFrame.
111115

112116
When you have finished, close `ipython` session as follows:
@@ -129,7 +133,7 @@ shuttles:
129133
130134
```eval_rst
131135
.. note::
132-
The ``load_args`` are passed to the ``pd.read_excel`` method as `keyword arguments <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_, conversely providing ``save_args`` would be passed to the ``pd.DataFrame.to_excel`` `method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_.
136+
The ``load_args`` are passed to the ``pd.read_excel`` method as `keyword arguments <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_; although not specified here, ``save_args`` would be passed to the ``pd.DataFrame.to_excel`` `method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html>`_.
133137
```
134138
135139
To test that everything works as expected, load the dataset within a _new_ `kedro ipython` session and display its first five rows:

docs/source/03_tutorial/04_create_pipelines.md

+24-37
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,11 @@ In the terminal run the following command:
1717
kedro pipeline create data_processing
1818
```
1919

20-
* This will generate all the files you need to start writing a `data_processing` pipeline. This command generates a new `nodes.py` and `pipeline.py` under the `src/kedro_tutorial/pipelines/data_processing` folder.
21-
* The `kedro pipeline create <pipeline_name>` command is a convenience method so you don't have to worry about getting your ``__init__.py`` files in the right place, but of course you are welcome to create all the files manually.
20+
This generates all the files you need to start writing a `data_processing` pipeline:
21+
* `nodes.py` and `pipeline.py` in the `src/kedro_tutorial/pipelines/data_processing` folder for the main node functions that form your pipeline
22+
* `conf/base/parameters/data_processing.yml` to define the parameters used when running the pipeline
23+
* `src/tests/pipelines/data_processing` for tests for your pipeline
24+
* `__init__.py` files in the required places to ensure that the pipeline can be imported by Python
2225

2326
```bash
2427

@@ -46,9 +49,9 @@ kedro pipeline create data_processing
4649
   └── test_pipeline.py
4750
```
4851

49-
### Adding the functions to `nodes.py`
52+
### Add node functions
5053

51-
Open `src/kedro_tutorial/pipelines/data_processing/nodes.py` and add the code below, which provides two functions (`preprocess_companies` and `preprocess_shuttles`) that each input a raw dataframe and output a dataframe containing pre-processed data:
54+
Open `src/kedro_tutorial/pipelines/data_processing/nodes.py` and add the code below, which provides two functions (`preprocess_companies` and `preprocess_shuttles`) that each take a raw DataFrame and output a DataFrame containing pre-processed data:
5255

5356
<details>
5457
<summary><b>Click to expand</b></summary>
@@ -115,7 +118,7 @@ Add the following to `src/kedro_tutorial/pipelines/data_processing/pipeline.py`,
115118

116119
```python
117120
def create_pipeline(**kwargs) -> Pipeline:
118-
return Pipeline(
121+
return pipeline(
119122
[
120123
node(
121124
func=preprocess_companies,
@@ -142,7 +145,7 @@ def create_pipeline(**kwargs) -> Pipeline:
142145
Be sure to import `node`, and your functions by adding them to the beginning of `pipeline.py`:
143146

144147
```python
145-
from kedro.pipeline import Pipeline, node
148+
from kedro.pipeline import Pipeline, node, pipeline
146149

147150
from .nodes import preprocess_companies, preprocess_shuttles
148151
```
@@ -208,8 +211,6 @@ kedro run
208211
You should see output similar to the following:
209212

210213
```bash
211-
kedro run
212-
213214
2019-08-19 10:50:39,950 - root - INFO - ** Kedro project kedro-tutorial
214215
2019-08-19 10:50:39,957 - kedro.io.data_catalog - INFO - Loading data from `shuttles` (ExcelDataSet)...
215216
2019-08-19 10:50:48,521 - kedro.pipeline.node - INFO - Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]
@@ -223,12 +224,16 @@ kedro run
223224

224225
```
225226

226-
Running Kedro-Viz at this point renders a very simple, but valid pipeline:
227+
### Visualise the pipeline
228+
229+
Kedro-Viz at this point will render a visualisation of a very simple, but valid, pipeline. To show the visualisation, run:
227230

228231
```bash
229232
kedro viz
230233
```
231234

235+
This command should open up a visualisation in your browser that looks like the following:
236+
232237
![simple_pipeline](../meta/images/simple_pipeline.png)
233238

234239
### Persist pre-processed data
@@ -251,11 +256,11 @@ The code above declares explicitly that [pandas.ParquetDataSet](/kedro.extras.da
251256

252257
The [Data Catalog](../13_resources/02_glossary.md#data-catalog) will take care of saving the datasets automatically (in this case as Parquet) to the path specified next time the pipeline is run. There is no need to change any code in your preprocessing functions to accommodate this change.
253258

254-
[Apache Parquet](https://github.com/apache/parquet-format) is our recommended format for working with processed and typed data. We recommend getting your data out of CSV as soon as possible. Parquet supports things like compression, partitioning and types out of the box. Whilst you do lose the ability to view the file as text, the benefits greatly outweigh the drawbacks.
259+
We choose the [Apache Parquet](https://github.com/apache/parquet-format) format for working with processed and typed data. We recommend getting your data out of CSV as soon as possible. Parquet supports things like compression, partitioning and types out of the box. Whilst you lose the ability to view the file as text, the benefits greatly outweigh the drawbacks.
255260

256261
### Extend the data processing pipeline
257262

258-
The next step in the tutorial is to add another node for a function to join together the three dataframes into a single model input table. First, add the `create_model_input_table()` function from the snippet below to `src/kedro_tutorial/pipelines/data_processing/nodes.py`.
263+
The next step in the tutorial is to add another node for a function to join together the three DataFrames into a single model input table. First, add the `create_model_input_table()` function from the snippet below to `src/kedro_tutorial/pipelines/data_processing/nodes.py`.
259264

260265
<details>
261266
<summary><b>Click to expand</b></summary>
@@ -304,7 +309,7 @@ from .nodes import create_model_input_table, preprocess_companies, preprocess_sh
304309
```
305310

306311

307-
### Persisting the model input table
312+
### Persist the model input table
308313

309314
If you want the model input table data to be saved to file rather than used in-memory, add an entry to `conf/base/catalog.yml`:
310315

@@ -343,47 +348,29 @@ You should see output similar to the following:
343348
2019-08-19 10:56:09,991 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
344349
```
345350

346-
### Using `kedro viz --autoreload` to see how Kedro brings the pipeline together
351+
### Use `kedro viz --autoreload`
347352

348353
Run the following command:
349354

350355
```bash
351356
kedro viz --autoreload
352357
```
353358

354-
The gif below shows how commenting out the `create_model_input_table_node` in `pipeline.py` will trigger a re-render of the pipeline:
359+
The `autoreload` flag will ensure that changes to your pipeline are automatically reflected in Kedro-Viz. For example, commenting out `create_model_input_table_node` in `pipeline.py` will trigger a re-render of the pipeline:
355360

356361
![autoreload](../meta/images/autoreload.gif)
357362

358363
```eval_rst
359-
.. note:: This is also a great time to highlight how Kedro's `topological sorting <https://en.wikipedia.org/wiki/Topological_sorting>`_ works. The actual order of the ``node()`` calls in the ``Pipeline`` object is irrelevant, Kedro works out the execution graph via the inputs/outputs declared, not the order provided by the user. This means you as a developer simply ask Kedro what data you want and it will derive the execution graph automatically.
364+
.. note:: This is also a great time to highlight how Kedro's `topological sorting <https://en.wikipedia.org/wiki/Topological_sorting>`_ works. The actual order of the ``node()`` calls in the ``pipeline`` is irrelevant; Kedro works out the execution graph via the inputs/outputs declared, not the order provided by the user. This means that you, as a developer, simply ask Kedro what data you want and it will derive the execution graph automatically.
360365
```
361366

362367
## Data science pipeline
363368

364-
We have created a modular pipeline for data processing, which merges three input datasets to create a model input table. Now we will create the data science pipeline for price prediction, which uses the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
365-
implementation from the [scikit-learn](https://scikit-learn.org/stable/) library.
366-
367-
### Update dependencies
368-
369-
We now need to add `scikit-learn` to the project's dependencies. This is a slightly different process from the initial change we made early in the tutorial.
370-
371-
To **update** the project's dependencies, you should modify `src/requirements.in` to add the following. Note that you do not need to update ``src/requirements.txt`` as you did previously in the tutorial before you built the project's requirements with ``kedro build-reqs``:
372-
373-
```text
374-
scikit-learn==0.23.1
375-
```
376-
377-
Then, re-run `kedro install` with a flag telling Kedro to recompile the requirements:
378-
379-
```bash
380-
kedro install --build-reqs
381-
```
382-
383-
You can find out more about [how to work with project dependencies](../04_kedro_project_setup/01_dependencies) in the Kedro project documentation.
369+
We have created a modular pipeline for data processing, which merges three input datasets to create a model input table. Now we will create the data science pipeline for price prediction, which uses the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) implementation from the [scikit-learn](https://scikit-learn.org/stable/) library.
384370

385371
### Create the data science pipeline
386372

373+
Run the following command to create the `data_science` pipeline:
387374
```bash
388375
kedro pipeline create data_science
389376
```
@@ -492,13 +479,13 @@ Versioning is enabled for `regressor`, which means that the pickled output of th
492479
To create a modular pipeline for the price prediction model, add the following to the top of `src/kedro_tutorial/pipelines/data_science/pipeline.py`:
493480

494481
```python
495-
from kedro.pipeline import Pipeline, node
482+
from kedro.pipeline import Pipeline, node, pipeline
496483
497484
from .nodes import evaluate_model, split_data, train_model
498485
499486
500487
def create_pipeline(**kwargs) -> Pipeline:
501-
return Pipeline(
488+
return pipeline(
502489
[
503490
node(
504491
func=split_data,

docs/source/03_tutorial/05_visualise_pipeline.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
## Install Kedro-Viz
66

7-
You can install Kedro-Viz by running:
7+
If you did not already install Kedro-Viz when you [installed the tutorial project dependencies](02_tutorial_template.md#install-dependencies) then you can do so now by running:
88
```bash
99
pip install kedro-viz
1010
```
@@ -132,7 +132,7 @@ def compare_shuttle_speed():
132132

133133
def create_pipeline(**kwargs) -> Pipeline:
134134
"""This is a simple pipeline which generates a plot"""
135-
return Pipeline(
135+
return pipeline(
136136
[
137137
node(
138138
func=compare_shuttle_speed,

docs/source/03_tutorial/06_namespacing_pipelines.md docs/source/03_tutorial/06_namespace_pipelines.md

+8-16
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Namespacing pipelines
1+
# Namespace pipelines
22

33
This section covers the following:
44

@@ -25,15 +25,15 @@ Adding namespaces to [modular pipelines](https://kedro.readthedocs.io/en/stable/
2525
from kedro.pipeline import Pipeline, node
2626
from kedro.pipeline.modular_pipeline import pipeline
2727

28-
from spaceflights_tutorial.pipelines.data_processing.nodes import (
28+
from kedro_tutorial.pipelines.data_processing.nodes import (
2929
preprocess_companies,
3030
preprocess_shuttles,
3131
create_model_input_table,
3232
)
3333

3434

3535
def create_pipeline(**kwargs) -> Pipeline:
36-
pipeline_instance = Pipeline(
36+
return pipeline(
3737
[
3838
node(
3939
func=preprocess_companies,
@@ -49,23 +49,15 @@ Adding namespaces to [modular pipelines](https://kedro.readthedocs.io/en/stable/
4949
),
5050
node(
5151
func=create_model_input_table,
52-
inputs={
53-
"companies": "preprocessed_companies",
54-
"shuttles": "preprocessed_shuttles",
55-
"reviews": "reviews",
56-
},
52+
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
5753
outputs="model_input_table",
5854
name="create_model_input_table_node",
5955
),
60-
]
61-
)
62-
namespaced_pipeline = pipeline(
63-
pipe=pipeline_instance,
56+
],
6457
namespace="data_processing",
6558
inputs=["companies", "shuttles", "reviews"],
6659
outputs="model_input_table",
6760
)
68-
return namespaced_pipeline
6961
```
7062

7163
</details>
@@ -89,7 +81,7 @@ In this section we want to add some namespaces in the modelling component of the
8981
```yaml
9082

9183
model_options_experimental:
92-
test_size: 0.3
84+
test_size: 0.2
9385
random_state: 8
9486
features:
9587
- engines
@@ -130,7 +122,7 @@ In this section we want to add some namespaces in the modelling component of the
130122

131123

132124
def create_pipeline(**kwargs) -> Pipeline:
133-
pipeline_instance = Pipeline(
125+
pipeline_instance = pipeline(
134126
[
135127
node(
136128
func=split_data,
@@ -174,7 +166,7 @@ In this section we want to add some namespaces in the modelling component of the
174166
Modular pipelines allow you instantiate multiple instances of pipelines with static structure, but dynamic inputs/outputs/parameters.
175167

176168
```python
177-
pipeline_instance = Pipeline(...)
169+
pipeline_instance = pipeline(...)
178170

179171
ds_pipeline_1 = pipeline(
180172
pipe=pipeline_instance,

docs/source/index.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Welcome to Kedro's documentation!
7474
03_tutorial/03_set_up_data
7575
03_tutorial/04_create_pipelines
7676
03_tutorial/05_visualise_pipeline
77-
03_tutorial/06_namespacing_pipelines
77+
03_tutorial/06_namespace_pipelines
7878
03_tutorial/07_set_up_experiment_tracking
7979
03_tutorial/08_package_a_project
8080

requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ cachetools~=4.1
33
click<8.0
44
cookiecutter~=1.7.0
55
dynaconf>=3.1.2,<4.0.0
6-
fsspec>=2021.04, <2022.01 # Upper bound set arbitrarily, to be reassessed in early 2022
6+
fsspec>=2021.4, <=2022.1
77
gitpython~=3.0
88
jmespath>=0.9.5, <1.0
99
jupyter_client>=5.1, <7.0

test_requirements.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
-r requirements.txt
2-
adlfs~=0.7
2+
adlfs>=2021.7.1, <=2022.2
33
bandit>=1.6.2, <2.0
44
behave==1.2.6
55
biopython~=1.73
@@ -11,7 +11,7 @@ dask[complete]~=2.6; python_version == '3.6'
1111
delta-spark~=1.0
1212
dill~=0.3.1
1313
filelock>=3.4.0, <4.0
14-
gcsfs>=2021.04, <2022.01 # Upper bound set arbitrarily, to be reassessed in early 2022
14+
gcsfs>=2021.4, <=2022.1
1515
geopandas>=0.6.0, <1.0
1616
hdfs>=2.5.8, <3.0
1717
holoviews~=1.13.0

0 commit comments

Comments
 (0)