Document distribution of Kedro pipelines with Dask (kedro-org#1248)

deepyaman · lvijnck · commit 3de5710430ba · 2022-04-07T15:06:13.000+01:00
Signed-off-by: Deepyaman Datta &lt;deepyaman.datta@utexas.edu&gt;
diff --git a/RELEASE.md b/RELEASE.md
@@ -1,11 +1,11 @@
 # Release 0.17.8
 
 ## Major features and improvements
+* Documented distribution of Kedro pipelines with Dask.
 
-* Added option to `SparkDataSet` to specify a `schema` load argument that allows for supplying a user-defined schema as opposed to relying on the schema inference of Spark.
+## Bug fixes and other changes
 
-## Thanks for supporting contributions
-[Laurens Vijnck](https://github.com/lvijnck)
+## Upcoming deprecations for Kedro 0.18.0
 
 # Release 0.17.7
 
@@ -24,7 +24,6 @@
 * Added `astro-iris` as alias for `astro-airlow-iris`, so that old tutorials can still be followed.
 * Added details about [Kedro's Technical Steering Committee and governance model](https://kedro.readthedocs.io/en/0.17.7/14_contribution/technical_steering_committee.html).
 
-
 ## Upcoming deprecations for Kedro 0.18.0
 * `kedro pipeline pull` and `kedro pipeline package` will be deprecated. Please use `kedro micropkg` instead.
 
@@ -415,7 +414,7 @@ Check your source directory. If you defined a different source directory (`sourc
 
 ## Major features and improvements
 
-* Added documentation with a focus on single machine and distributed environment deployment; the series includes Docker, Argo, Prefect, Kubeflow, AWS Batch, AWS Sagemaker and extends our section on Databricks
+* Added documentation with a focus on single machine and distributed environment deployment; the series includes Docker, Argo, Prefect, Kubeflow, AWS Batch, AWS Sagemaker and extends our section on Databricks.
 * Added [kedro-starter-spaceflights](https://github.com/kedro-org/kedro-starter-spaceflights/) alias for generating a project: `kedro new --starter spaceflights`.
 
 ## Bug fixes and other changes
diff --git a/docs/conf.py b/docs/conf.py
@@ -192,6 +192,7 @@
 # some of these complain that the sections don't exist (which is not true),
 # too many requests, or forbidden URL
 linkcheck_ignore = [
+    "http://127.0.0.1:8787/status",  # Dask's diagnostics dashboard
     "https://datacamp.com/community/tutorials/docstrings-python",  # "forbidden" url
     "https://github.com/argoproj/argo/blob/master/README.md#quickstart",
     "https://console.aws.amazon.com/batch/home#/jobs",
diff --git a/docs/source/03_tutorial/05_visualise_pipeline.md b/docs/source/03_tutorial/05_visualise_pipeline.md
@@ -16,8 +16,7 @@ You should be in your project root directory, and once Kedro-Viz is installed yo
 kedro viz
 ```
 
-This command will run a server on http://127.0.0.1:4141 that will open up your visualisation on a browser. You should
- be able to see the following:
+This command will run a server on http://127.0.0.1:4141 that will open up your visualisation on a browser. You should be able to see the following:
 
 ![](../meta/images/pipeline_visualisation.png)
 
@@ -113,7 +112,7 @@ We have also used the Plotly integration to allow users to [visualise metrics fr
 
 You need to update requirements.txt in your Kedro project and add the following datasets to enable plotly for your project.
 
- `kedro[plotly.PlotlyDataSet, plotly.JSONDataSet]==0.17.7`
+`kedro[plotly.PlotlyDataSet, plotly.JSONDataSet]==0.17.7`
 
 
 You can view Plotly charts in Kedro-Viz when you use Kedro's plotly datasets.
diff --git a/docs/source/10_deployment/01_deployment_guide.md b/docs/source/10_deployment/01_deployment_guide.md
@@ -15,9 +15,10 @@ We also provide information to help you deploy to the following:
 * to [Kubeflow Workflows](06_kubeflow.md)
 * to [AWS Batch](07_aws_batch.md)
 * to [Databricks](08_databricks.md)
+* to [Dask](dask.md)
 
 <!--- There has to be some non-link text in the bullets above, if it's just links, there's a Sphinx bug that fails the build process-->
 
 In addition, we also provide instructions on [how to integrate a Kedro project with Amazon SageMaker](09_aws_sagemaker.md).
 
-![](../meta/images/deployments.png)
+![](../meta/images/deployments.png)  <!-- TODO(deepyaman): Add Dask to deployment flowchart. -->
diff --git a/docs/source/10_deployment/04_argo.md b/docs/source/10_deployment/04_argo.md
@@ -1,6 +1,6 @@
 # Deployment with Argo Workflows
 
-This page explains how to convert your Kedro pipeline to use [Argo Workflows](https://github.com/argoproj/argo-workflows), an open source container-native workflow engine for orchestrating parallel jobs on [Kubernetes](https://kubernetes.io/).
+This page explains how to convert your Kedro pipeline to use [Argo Workflows](https://github.com/argoproj/argo-workflows), an open-source container-native workflow engine for orchestrating parallel jobs on [Kubernetes](https://kubernetes.io/).
 
 ## Why would you use Argo Workflows?
 
diff --git a/docs/source/10_deployment/05_prefect.md b/docs/source/10_deployment/05_prefect.md
@@ -1,8 +1,8 @@
 # Deployment with Prefect
 
-This page explains how to run your Kedro pipeline using [Prefect Core](https://www.prefect.io/products/core/), an open source workflow management system.
+This page explains how to run your Kedro pipeline using [Prefect Core](https://www.prefect.io/products/core/), an open-source workflow management system.
 
-In scope of this deployment we are interested in [Prefect Server](https://docs.prefect.io/orchestration/server/overview.html#what-is-prefect-server) which is an open-source backend that makes it easy to monitor and execute your Prefect flows and automatically extends the Prefect Core.
+In scope of this deployment, we are interested in [Prefect Server](https://docs.prefect.io/orchestration/server/overview.html#what-is-prefect-server), an open-source backend that makes it easy to monitor and execute your Prefect flows and automatically extends the Prefect Core.
 
 ```eval_rst
 .. note::  Prefect Server ships out-of-the-box with a fully featured user interface.
diff --git a/docs/source/10_deployment/07_aws_batch.md b/docs/source/10_deployment/07_aws_batch.md
@@ -118,12 +118,14 @@ Now that all the resources are in place, it's time to submit jobs to Batch progr
 
 #### Create a custom runner
 
-Create a new Python package `runner` in your `src` folder, i.e. `kedro_tutorial/src/kedro_tutorial/runner/`. Make sure there is an `__init__.py` file at this location and add another file named `batch_runner.py`, which will contain the implementation of your custom runner, `AWSBatchRunner`. The `AWSBatchRunner` will submit and monitor jobs asynchronously, surfacing any errors that occur on Batch.
+Create a new Python package `runner` in your `src` folder, i.e. `kedro_tutorial/src/kedro_tutorial/runner/`. Make sure there is an `__init__.py` file at this location, and add another file named `batch_runner.py`, which will contain the implementation of your custom runner, `AWSBatchRunner`. The `AWSBatchRunner` will submit and monitor jobs asynchronously, surfacing any errors that occur on Batch.
 
-Make sure the `__init__.py` file in the `runner` folder includes the following import:
+Make sure the `__init__.py` file in the `runner` folder includes the following import and declaration:
 
 ```python
-from .batch_runner import AWSBatchRunner  # NOQA
+from .batch_runner import AWSBatchRunner
+
+__all__ = ["AWSBatchRunner"]
 ```
 
 Copy the contents of the script below into `batch_runner.py`:
@@ -286,13 +288,13 @@ def _track_batch_job(job_id: str, client: Any) -> None:
 
 #### Set up Batch-related configuration
 
-You'll need to set the Batch-related configuration that the runner will use. Add a `parameters.yml` file inside the `conf/aws_batch/` directory created as part of the prerequistes steps, which will include the following keys:
+You'll need to set the Batch-related configuration that the runner will use. Add a `parameters.yml` file inside the `conf/aws_batch/` directory created as part of the prerequistes with the following keys:
 
 ```yaml
 aws_batch:
-    job_queue: "spaceflights_queue"
-    job_definition: "kedro_run"
-    max_workers: 2
+  job_queue: "spaceflights_queue"
+  job_definition: "kedro_run"
+  max_workers: 2
 ```
 
 #### Update CLI implementation
@@ -315,6 +317,7 @@ def run(tag, env, parallel, ...):
     node_names = _get_values_as_tuple(node_names) if node_names else node_names
 
     with KedroSession.create(env=env, extra_params=params) as session:
+        context = session.load_context()
         runner_instance = _instantiate_runner(runner, is_async, context)
         session.run(
             tags=tag,
@@ -323,6 +326,7 @@ def run(tag, env, parallel, ...):
             from_nodes=from_nodes,
             to_nodes=to_nodes,
             from_inputs=from_inputs,
+            to_outputs=to_outputs,
             load_versions=load_version,
             pipeline_name=pipeline,
         )
diff --git a/docs/source/10_deployment/dask.md b/docs/source/10_deployment/dask.md
diff --git a/docs/source/index.rst b/docs/source/index.rst
diff --git a/docs/source/meta/images/dask_diagnostics_dashboard.png b/docs/source/meta/images/dask_diagnostics_dashboard.png