Merge branch 'main' into deepyaman-patch-3

deepyaman · deepyaman · commit 4470623c5808 · 2022-02-16T23:48:41.000-05:00
Signed-off-by: Deepyaman Datta &lt;deepyaman.datta@utexas.edu&gt;
diff --git a/docs/source/03_tutorial/01_spaceflights_tutorial.md b/docs/source/03_tutorial/01_spaceflights_tutorial.md
@@ -21,7 +21,7 @@ When building a Kedro project, you will typically follow a standard development
 ### 1. Set up the project template
 
 * Create a new project with `kedro new`
-* Install project dependencies with `kedro install`
+* Install project dependencies with `pip install`
 * Configure the following in the `conf` folder:
 	* Logging
 	* Credentials
diff --git a/docs/source/03_tutorial/02_tutorial_template.md b/docs/source/03_tutorial/02_tutorial_template.md
@@ -14,19 +14,11 @@ Navigate to your chosen working directory and run the following to [create a new
 kedro new
 ```
 
-When prompted for a project name, enter `Kedro Tutorial`. Subsequently, accept the default suggestions for `repo_name` and `python_package` by pressing enter.
+When prompted for a project name, enter `Kedro Tutorial`. Subsequently, accept the default suggestions for `repo_name` and `python_package` by pressing enter. Then navigate to the root directory of the project, `kedro-tutorial`.
 
-## Install project dependencies with `kedro install`
+## Install dependencies
 
-To install the project-specific dependencies, navigate to the root directory of the project and run:
-
-```bash
-kedro install
-```
-
-### More about project dependencies
-
-Up to this point, we haven't discussed project dependencies, so now is a good time to examine them. We use Kedro to specify a project's dependencies and make it easier for others to run your project. It avoids version conflicts because Kedro ensures that you use same Python packages and versions.
+Up to this point, we haven't discussed project dependencies, so now is a good time to examine them. We use a `requirements.txt` file to specify a project's dependencies and make it easier for others to run your project. This avoids version conflicts by ensuring that you use same Python packages and versions.
 
 The generic project template bundles some typical dependencies, in `src/requirements.txt`. Here's a typical example, although you may find that the version numbers are slightly different depending on the version of Kedro that you are using:
 
@@ -50,28 +42,25 @@ wheel>=0.35, <0.37 # The reference implementation of the Python wheel packaging
 .. note::  If your project has ``conda`` dependencies, you can create a ``src/environment.yml`` file and list them there.
 ```
 
-### Add and remove project-specific dependencies
-
-The dependencies above may be sufficient for some projects, but for the spaceflights project, you need to add some extra requirements.
+The dependencies above may be sufficient for some projects, but for this tutorial you need to add some extra requirements. These will enable us to work with different data formats (including CSV, Excel and Parquet) and to visualise the pipeline.
 
-* In this tutorial, we work with different data formats including CSV, Excel and Parquet and want to visualise our pipeline so we will need to provide extra dependencies.
-* By running `kedro install` on a blank template we generate a new file at `src/requirements.in`. You can read more about the benefits of compiling dependencies [here](../04_kedro_project_setup/01_dependencies.md)
-* The most important point to learn here is that you should edit the `requirements.in` file going forward.
-
-Add the following requirements to your `src/requirements.in` lock file:
+Edit your `src/requirements.txt` file to include the following lines:
 
 ```text
 kedro[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]==0.17.6   # Specify optional Kedro dependencies
-kedro-viz==4.1.1                                                               # Visualise your pipelines
-openpyxl==3.0.9                                                                # Use modern Excel engine (will not be required in 0.18.0)
+kedro-viz~=4.0                                                                 # Visualise your pipelines
+openpyxl>=3.0.6, <4.0                                                          # Use modern Excel engine (will not be required in 0.18.0)
+scikit-learn~=1.0                                                              # For modelling in the data science pipeline
 ```
 
-Then run the following command to re-compile your updated dependencies and install them into your environment:
+To install all the project-specific dependencies, navigate to the root directory of the project and run:
 
 ```bash
-kedro install --build-reqs
+pip install -r src/requirements.txt
 ```
 
+You can find out more about [how to work with project dependencies](../04_kedro_project_setup/01_dependencies.md) in the Kedro project documentation.
+
 ## Configure the project
 
 You may optionally add in any credentials to `conf/local/credentials.yml` that you would need to load specific data sources like usernames and passwords. Some examples are given within the file to illustrate how you store credentials. Additional information can be found in the [advanced documentation on configuration](../04_kedro_project_setup/02_configuration.md).
diff --git a/docs/source/03_tutorial/03_set_up_data.md b/docs/source/03_tutorial/03_set_up_data.md
@@ -107,6 +107,10 @@ companies = catalog.load("companies")
 companies.head()
 ```
 
+```eval_rst
+.. note:: If this is the first ``kedro`` command you have executed in the project, you will be asked whether you wish to opt into `usage analytics <https://github.com/quantumblacklabs/kedro-telemetry>`_. Your decision is recorded in the ``.telemetry`` file so that subsequent calls to ``kedro`` in this project do not ask you again.
+```
+
 The command loads the dataset named `companies` (as per top-level key in `catalog.yml`) from the underlying filepath `data/01_raw/companies.csv` into the variable `companies`, which is of type `pandas.DataFrame`. The `head` method from `pandas` then displays the first five rows of the DataFrame.
 
 When you have finished, close `ipython` session as follows:
@@ -129,7 +133,7 @@ shuttles:
 
 ```eval_rst
 .. note::
- The ``load_args`` are passed to the ``pd.read_excel`` method as `keyword arguments <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_, conversely providing ``save_args`` would be passed to the ``pd.DataFrame.to_excel`` `method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_.
+ The ``load_args`` are passed to the ``pd.read_excel`` method as `keyword arguments <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html>`_; although not specified here, ``save_args`` would be passed to the ``pd.DataFrame.to_excel`` `method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html>`_.
 ```
 
 To test that everything works as expected, load the dataset within a _new_ `kedro ipython` session and display its first five rows:
diff --git a/docs/source/03_tutorial/04_create_pipelines.md b/docs/source/03_tutorial/04_create_pipelines.md
@@ -17,8 +17,11 @@ In the terminal run the following command:
 kedro pipeline create data_processing
 ```
 
-* This will generate all the files you need to start writing a `data_processing` pipeline. This command generates a new `nodes.py` and `pipeline.py` under the `src/kedro_tutorial/pipelines/data_processing` folder.
-* The `kedro pipeline create <pipeline_name>` command is a convenience method so you don't have to worry about getting your ``__init__.py`` files in the right place, but of course you are welcome to create all the files manually.
+This generates all the files you need to start writing a `data_processing` pipeline:
+* `nodes.py` and `pipeline.py` in the `src/kedro_tutorial/pipelines/data_processing` folder for the main node functions that form your pipeline
+* `conf/base/parameters/data_processing.yml` to define the parameters used when running the pipeline
+* `src/tests/pipelines/data_processing` for tests for your pipeline
+* `__init__.py` files in the required places to ensure that the pipeline can be imported by Python
 
 ```bash
 
@@ -46,9 +49,9 @@ kedro pipeline create data_processing
                 └── test_pipeline.py
 ```
 
-### Adding the functions to `nodes.py`
+### Add node functions
 
-Open `src/kedro_tutorial/pipelines/data_processing/nodes.py` and add the code below, which provides two functions (`preprocess_companies` and `preprocess_shuttles`) that each input a raw dataframe and output a dataframe containing pre-processed data:
+Open `src/kedro_tutorial/pipelines/data_processing/nodes.py` and add the code below, which provides two functions (`preprocess_companies` and `preprocess_shuttles`) that each take a raw DataFrame and output a DataFrame containing pre-processed data:
 
 <details>
 <summary><b>Click to expand</b></summary>
@@ -115,7 +118,7 @@ Add the following to `src/kedro_tutorial/pipelines/data_processing/pipeline.py`,
 
 ```python
 def create_pipeline(**kwargs) -> Pipeline:
-    return Pipeline(
+    return pipeline(
         [
             node(
                 func=preprocess_companies,
@@ -142,7 +145,7 @@ def create_pipeline(**kwargs) -> Pipeline:
 Be sure to import `node`, and your functions by adding them to the beginning of `pipeline.py`:
 
 ```python
-from kedro.pipeline import Pipeline, node
+from kedro.pipeline import Pipeline, node, pipeline
 
 from .nodes import preprocess_companies, preprocess_shuttles
 ```
@@ -208,8 +211,6 @@ kedro run
 You should see output similar to the following:
 
 ```bash
-kedro run
-
 2019-08-19 10:50:39,950 - root - INFO - ** Kedro project kedro-tutorial
 2019-08-19 10:50:39,957 - kedro.io.data_catalog - INFO - Loading data from `shuttles` (ExcelDataSet)...
 2019-08-19 10:50:48,521 - kedro.pipeline.node - INFO - Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]
@@ -223,12 +224,16 @@ kedro run
 
 ```
 
-Running Kedro-Viz at this point renders a very simple, but valid pipeline:
+### Visualise the pipeline
+
+Kedro-Viz at this point will render a visualisation of a very simple, but valid, pipeline. To show the visualisation, run:
 
 ```bash
 kedro viz
 ```
 
+This command should open up a visualisation in your browser that looks like the following:
+
 ![simple_pipeline](../meta/images/simple_pipeline.png)
 
 ### Persist pre-processed data
@@ -251,11 +256,11 @@ The code above declares explicitly that [pandas.ParquetDataSet](/kedro.extras.da
 
 The [Data Catalog](../13_resources/02_glossary.md#data-catalog) will take care of saving the datasets automatically (in this case as Parquet) to the path specified next time the pipeline is run. There is no need to change any code in your preprocessing functions to accommodate this change.
 
-[Apache Parquet](https://github.com/apache/parquet-format) is our recommended format for working with processed and typed data. We recommend getting your data out of CSV as soon as possible. Parquet supports things like compression, partitioning and types out of the box. Whilst you do lose the ability to view the file as text, the benefits greatly outweigh the drawbacks.
+We choose the [Apache Parquet](https://github.com/apache/parquet-format) format for working with processed and typed data. We recommend getting your data out of CSV as soon as possible. Parquet supports things like compression, partitioning and types out of the box. Whilst you lose the ability to view the file as text, the benefits greatly outweigh the drawbacks.
 
 ### Extend the data processing pipeline
 
-The next step in the tutorial is to add another node for a function to join together the three dataframes into a single model input table. First, add the `create_model_input_table()` function from the snippet below to `src/kedro_tutorial/pipelines/data_processing/nodes.py`.
+The next step in the tutorial is to add another node for a function to join together the three DataFrames into a single model input table. First, add the `create_model_input_table()` function from the snippet below to `src/kedro_tutorial/pipelines/data_processing/nodes.py`.
 
 <details>
 <summary><b>Click to expand</b></summary>
@@ -304,7 +309,7 @@ from .nodes import create_model_input_table, preprocess_companies, preprocess_sh
 ```
 
 
-### Persisting the model input table
+### Persist the model input table
 
 If you want the model input table data to be saved to file rather than used in-memory, add an entry to `conf/base/catalog.yml`:
 
@@ -343,47 +348,29 @@ You should see output similar to the following:
 2019-08-19 10:56:09,991 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
 ```
 
-### Using `kedro viz --autoreload` to see how Kedro brings the pipeline together
+### Use `kedro viz --autoreload`
 
 Run the following command:
 
 ```bash
 kedro viz --autoreload
 ```
 
-The gif below shows how commenting out the `create_model_input_table_node` in `pipeline.py` will trigger a re-render of the pipeline:
+The `autoreload` flag will ensure that changes to your pipeline are automatically reflected in Kedro-Viz. For example, commenting out `create_model_input_table_node` in `pipeline.py` will trigger a re-render of the pipeline:
 
 ![autoreload](../meta/images/autoreload.gif)
 
 ```eval_rst
-.. note::  This is also a great time to highlight how Kedro's `topological sorting <https://en.wikipedia.org/wiki/Topological_sorting>`_ works. The actual order of the ``node()`` calls in the ``Pipeline`` object is irrelevant, Kedro works out the execution graph via the inputs/outputs declared, not the order provided by the user. This means you as a developer simply ask Kedro what data you want and it will derive the execution graph automatically.
+.. note::  This is also a great time to highlight how Kedro's `topological sorting <https://en.wikipedia.org/wiki/Topological_sorting>`_ works. The actual order of the ``node()`` calls in the ``pipeline`` is irrelevant; Kedro works out the execution graph via the inputs/outputs declared, not the order provided by the user. This means that you, as a developer, simply ask Kedro what data you want and it will derive the execution graph automatically.
 ```
 
 ## Data science pipeline
 
-We have created a modular pipeline for data processing, which merges three input datasets to create a model input table. Now we will create the data science pipeline for price prediction, which uses the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
-implementation from the [scikit-learn](https://scikit-learn.org/stable/) library.
-
-### Update dependencies
-
-We now need to add `scikit-learn` to the project's dependencies. This is a slightly different process from the initial change we made early in the tutorial.
-
-To **update** the project's dependencies, you should modify `src/requirements.in` to add the following. Note that you do not need to update ``src/requirements.txt`` as you did previously in the tutorial before you built the project's requirements with ``kedro build-reqs``:
-
-```text
-scikit-learn==0.23.1
-```
-
-Then, re-run `kedro install` with a flag telling Kedro to recompile the requirements:
-
-```bash
-kedro install --build-reqs
-```
-
-You can find out more about [how to work with project dependencies](../04_kedro_project_setup/01_dependencies) in the Kedro project documentation.
+We have created a modular pipeline for data processing, which merges three input datasets to create a model input table. Now we will create the data science pipeline for price prediction, which uses the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) implementation from the [scikit-learn](https://scikit-learn.org/stable/) library.
 
 ### Create the data science pipeline
 
+Run the following command to create the `data_science` pipeline:
 ```bash
 kedro pipeline create data_science
 ```
@@ -492,13 +479,13 @@ Versioning is enabled for `regressor`, which means that the pickled output of th
 To create a modular pipeline for the price prediction model, add the following to the top of `src/kedro_tutorial/pipelines/data_science/pipeline.py`:
 
 ```python
-from kedro.pipeline import Pipeline, node
+from kedro.pipeline import Pipeline, node, pipeline
 
 from .nodes import evaluate_model, split_data, train_model
 
 
 def create_pipeline(**kwargs) -> Pipeline:
-    return Pipeline(
+    return pipeline(
         [
             node(
                 func=split_data,
diff --git a/docs/source/03_tutorial/05_visualise_pipeline.md b/docs/source/03_tutorial/05_visualise_pipeline.md
@@ -4,7 +4,7 @@
 
 ## Install Kedro-Viz
 
-You can install Kedro-Viz by running:
+If you did not already install Kedro-Viz when you [installed the tutorial project dependencies](02_tutorial_template.md#install-dependencies) then you can do so now by running:
 ```bash
 pip install kedro-viz
 ```
@@ -132,7 +132,7 @@ def compare_shuttle_speed():
 
 def create_pipeline(**kwargs) -> Pipeline:
     """This is a simple pipeline which generates a plot"""
-    return Pipeline(
+    return pipeline(
         [
             node(
                 func=compare_shuttle_speed,
diff --git a/docs/source/03_tutorial/06_namespace_pipelines.md b/docs/source/03_tutorial/06_namespace_pipelines.md
@@ -1,4 +1,4 @@
-# Namespacing pipelines
+# Namespace pipelines
 
 This section covers the following:
 
@@ -25,15 +25,15 @@ Adding namespaces to [modular pipelines](https://kedro.readthedocs.io/en/stable/
     from kedro.pipeline import Pipeline, node
     from kedro.pipeline.modular_pipeline import pipeline
 
-    from spaceflights_tutorial.pipelines.data_processing.nodes import (
+    from kedro_tutorial.pipelines.data_processing.nodes import (
         preprocess_companies,
         preprocess_shuttles,
         create_model_input_table,
     )
 
 
     def create_pipeline(**kwargs) -> Pipeline:
-        pipeline_instance = Pipeline(
+        return pipeline(
             [
                 node(
                     func=preprocess_companies,
@@ -49,23 +49,15 @@ Adding namespaces to [modular pipelines](https://kedro.readthedocs.io/en/stable/
                 ),
                 node(
                     func=create_model_input_table,
-                    inputs={
-                        "companies": "preprocessed_companies",
-                        "shuttles": "preprocessed_shuttles",
-                        "reviews": "reviews",
-                    },
+                    inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                     outputs="model_input_table",
                     name="create_model_input_table_node",
                 ),
-            ]
-        )
-        namespaced_pipeline = pipeline(
-            pipe=pipeline_instance,
+            ],
             namespace="data_processing",
             inputs=["companies", "shuttles", "reviews"],
             outputs="model_input_table",
         )
-        return namespaced_pipeline
     ```
 
     </details>
@@ -89,7 +81,7 @@ In this section we want to add some namespaces in the modelling component of the
     ```yaml
 
     model_options_experimental:
-      test_size: 0.3
+      test_size: 0.2
       random_state: 8
       features:
         - engines
@@ -130,7 +122,7 @@ In this section we want to add some namespaces in the modelling component of the
 
 
     def create_pipeline(**kwargs) -> Pipeline:
-        pipeline_instance = Pipeline(
+        pipeline_instance = pipeline(
             [
                 node(
                     func=split_data,
@@ -174,7 +166,7 @@ In this section we want to add some namespaces in the modelling component of the
 Modular pipelines allow you instantiate multiple instances of pipelines with static structure, but dynamic inputs/outputs/parameters.
 
 ```python
-pipeline_instance = Pipeline(...)
+pipeline_instance = pipeline(...)
 
 ds_pipeline_1 = pipeline(
     pipe=pipeline_instance,
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -74,7 +74,7 @@ Welcome to Kedro's documentation!
    03_tutorial/03_set_up_data
    03_tutorial/04_create_pipelines
    03_tutorial/05_visualise_pipeline
-   03_tutorial/06_namespacing_pipelines
+   03_tutorial/06_namespace_pipelines
    03_tutorial/07_set_up_experiment_tracking
    03_tutorial/08_package_a_project
 
diff --git a/requirements.txt b/requirements.txt
@@ -3,7 +3,7 @@ cachetools~=4.1
 click<8.0
 cookiecutter~=1.7.0
 dynaconf>=3.1.2,<4.0.0
-fsspec>=2021.04, <2022.01  # Upper bound set arbitrarily, to be reassessed in early 2022
+fsspec>=2021.4, <=2022.1
 gitpython~=3.0
 jmespath>=0.9.5, <1.0
 jupyter_client>=5.1, <7.0
diff --git a/test_requirements.txt b/test_requirements.txt
@@ -1,5 +1,5 @@
 -r requirements.txt
-adlfs~=0.7
+adlfs>=2021.7.1, <=2022.2
 bandit>=1.6.2, <2.0
 behave==1.2.6
 biopython~=1.73
@@ -11,7 +11,7 @@ dask[complete]~=2.6; python_version == '3.6'
 delta-spark~=1.0
 dill~=0.3.1
 filelock>=3.4.0, <4.0
-gcsfs>=2021.04, <2022.01  # Upper bound set arbitrarily, to be reassessed in early 2022
+gcsfs>=2021.4, <=2022.1
 geopandas>=0.6.0, <1.0
 hdfs>=2.5.8, <3.0
 holoviews~=1.13.0