Document distribution of Kedro pipelines with Dask #1248

deepyaman · 2022-02-12T19:41:46Z

Description

Update #1131, to which I no longer have write access. This can be either merged directly or merged into that PR, which can then be merged.

Development notes

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

…'s source code, not just `pipelines/` (#1248)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

RELEASE.md

docs/source/03_tutorial/05_visualise_pipeline.md

docs/source/index.rst

docs/source/10_deployment/12_dask.md

lorenabalan · 2022-02-24T11:35:06Z

LGTM, thanks for tidying up some other parts of the docs while here! 🙌

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

docs/source/10_deployment/dask.md

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

antonymilne

This is really awesome stuff, thank you very much @deepyaman.

Very happy for this to be merged - the questions I ask here and in the inline comments are just to satisfy my curiosity or contribute to more general ponderings.

I know this follows the same pattern as the AWS batch runner, which is fine, but there's a couple of things that definitely seem hacky about it (as they do for the AWS batch runner). I don't immediately see any better way of doing it now, but I wonder what it would take on the kedro side to make this sort of extension more elegant in future.

Injecting custom arguments into DaskRunner instantiation via conf/dask/parameters.yml. This is very cunning, but doesn't seem ideal because it's incompatible with running any other run configuration environment (unless you also modify the ConfigLoader to allow for multiple environments to load on top of each other).
def run is essentially the same as the default one; the only difference (unless I'm missing something?) is that it enables you to pass custom arguments into the runner instantiation.

I wonder if there's some other way of configuring our run command in order to make it easier to do this sort of thing. This issue seems very relevant, and I'd be interested if you had any thoughts to add to it based on what you've seen here: #1041

Naively it seems like we could expose RUNNER_CLASS and RUNNER_CLASS_ARGS in settings.py to enable this sort of thing more directly. But given that the runner is arguably runtime configuration (that belongs in conf) rather than application settings (that belongs in settings.py) that probably doesn't make sense. Sooo I don't know how we can support custom runner + arguments a bit more naturally.

Edit: just realised that point 1 actually doesn't prevent us from running whichever run environment we want to use, because I could just put the dask runner config in some already existing conf/env/parameters_dask.yml file, right? Rather than needing to create a whole new environment for it.

docs/source/10_deployment/01_deployment_guide.md

antonymilne · 2022-02-25T16:49:17Z

docs/source/10_deployment/dask.md

+                if load_counts[data_set] < 1 and data_set not in pipeline.outputs():
+                    catalog.release(data_set)
+
+    def run_only_missing(


Two questions out of curiosity here:

Is this actually used somewhere or do you see it being particularly relevant for Dask?

Is there actually a nice way for me to make kedro use this when I do kedro run --runner=kedro_tutorial.runner.DaskRunner just through some simple modification to the CLI run command that you define? I don't immediately see how you could, given that AbstractRunner.run is fixed to call self._run.

Either way, this is 💯 level dedication, given that run_only_missing isn't actually used anywhere in kedro AFAIK.

I implemented this because I felt it was especially relevant for Dask. The DaskDataSet actually publishes data to the cluster, so there's less progress lost than in the case of MemoryDataSet in case of an error.

I did test it months ago when I wrote the behavior, but I probably hacked it in somewhere for testing. Not aware of an easy call off the top of my head (but it probably should be!). Looks like it was raised a few years ago (Add only-missing option to kedro run command #30, Add only_missing option in KedroContext class #60) but decided against back then.

Wow, great work searching the archives to find that! I wasn't even aware that run_only_missing existed until a couple of months ago when Nikos mentioned it.

antonymilne · 2022-02-25T16:59:58Z

docs/source/10_deployment/dask.md

+$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
+$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
+$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786


Why do we need to add PYTHONPATH here?

You need to somehow make the code available to the worker, and in case of a single-machine scheduler, this works.

Client.upload_file is cleaner, and something like that would be necessary for distributed deployment.

https://stackoverflow.com/a/39994128/1093967 for some more details.

antonymilne · 2022-02-25T17:23:47Z

P.S. fun fact, don't know if you're aware of the recent change: if you hadn't edited in RELEASE.md then this PR would only be running the docs workflows on CircleCI. Possibly we should make that regex just look for .md files (rather than files just in docs/) so that PRs like this don't trigger all the code builds.

deepyaman · 2022-02-25T23:03:40Z

P.S. fun fact, don't know if you're aware of the recent change: if you hadn't edited in RELEASE.md then this PR would only be running the docs workflows on CircleCI. Possibly we should make that regex just look for .md files (rather than files just in docs/) so that PRs like this don't trigger all the code builds.

I noticed today when looking at the Windows/Python 3.6 errors that you all adopted dynamic config, nice!

datajoely · 2022-02-28T11:54:03Z

Well done @deepayman 🙏

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman added 17 commits February 12, 2022 12:20

Document distribution of Kedro pipelines with Dask

35d789a

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 05_prefect.md

a64c740

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 04_argo.md

d6e8f51

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 01_deployment_guide.md

c326686

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 07_aws_batch.md

0a1fa80

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 07_aws_batch.md

a91908e

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 07_aws_batch.md

25b6f1e

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 12_dask.md

79c211c

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Add files via upload

903b724

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Delete dask_diagnostics_dashboard.png

5559558

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Add files via upload

0225c26

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 12_dask.md

7c56d42

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Apply suggestions from code review

ee18806

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update 07_aws_batch.md

3f41697

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Update docs/source/10_deployment/12_dask.md

df4bfba

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Add 10_deployment/12_dask to the central toctree

5e053ab

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Integrate Merel's feedback on adding prerequisites

10e7223

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman requested a review from yetudada as a code owner February 12, 2022 19:41

deepyaman force-pushed the deepyaman-patch-3 branch from 938e2aa to 10e7223 Compare February 12, 2022 19:42

Merge branch 'main' into deepyaman-patch-3

e9aab5f

antonymilne mentioned this pull request Feb 15, 2022

Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #1041

Closed

lorenabalan pushed a commit that referenced this pull request Feb 16, 2022

[KED-2784] Enable pulling micro-packages into any part of the project…

e722c18

…'s source code, not just `pipelines/` (#1248)

deepyaman force-pushed the deepyaman-patch-3 branch from 68ba168 to e9aab5f Compare February 16, 2022 22:18

ignore linkcheck for Dask's diagnostic dashboard

497ec53

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman force-pushed the deepyaman-patch-3 branch from 4734689 to 497ec53 Compare February 17, 2022 03:30

deepyaman requested a review from idanov as a code owner February 17, 2022 03:41

Update RELEASE.md

4aeaa1f

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman force-pushed the deepyaman-patch-3 branch 2 times, most recently from 0ca2563 to 4470623 Compare February 17, 2022 04:48

Merge branch 'main' into deepyaman-patch-3

4f07989

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman force-pushed the deepyaman-patch-3 branch from 4470623 to 4f07989 Compare February 17, 2022 12:28

deepyaman added 4 commits February 18, 2022 08:39

Merge branch 'main' into deepyaman-patch-3

c1bda60

Merge branch 'main' into deepyaman-patch-3

3c67017

Merge branch 'main' into deepyaman-patch-3

88da94d

Merge branch 'main' into deepyaman-patch-3

9ba6975

lorenabalan approved these changes Feb 24, 2022

View reviewed changes

deepyaman and others added 4 commits February 24, 2022 21:52

Update 05_visualise_pipeline.md

9dc7297

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Apply suggestions from code review

d4c19bd

Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Rename 12_dask.md to dask.md

9ff15cf

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Create and move to new release header for 0.17.8

5a72a93

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman force-pushed the deepyaman-patch-3 branch from ed3381c to 5a72a93 Compare February 25, 2022 02:56

Merge branch 'main' into deepyaman-patch-3

09c5eab

deepyaman marked this pull request as draft February 25, 2022 04:32

deepyaman marked this pull request as ready for review February 25, 2022 04:32

deepyaman commented Feb 25, 2022

View reviewed changes

docs/source/10_deployment/dask.md Outdated Show resolved Hide resolved

deepyaman commented Feb 25, 2022

View reviewed changes

docs/source/10_deployment/dask.md Outdated Show resolved Hide resolved

Make DaskRunner._run_node into a @staticmethod

353d0e4

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

antonymilne approved these changes Feb 25, 2022

View reviewed changes

lorenabalan merged commit 4e64877 into kedro-org:main Feb 28, 2022

lorenabalan mentioned this pull request Feb 28, 2022

Document distribution of Kedro pipelines with Dask #1131

Closed

5 tasks

deepyaman deleted the deepyaman-patch-3 branch February 28, 2022 13:59

merelcht mentioned this pull request Mar 7, 2022

Update deployment diagram to include Dask #1321

Closed

AhdraMeraliQB pushed a commit that referenced this pull request Mar 30, 2022

Document distribution of Kedro pipelines with Dask (#1248)

60ef0f1

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022

Document distribution of Kedro pipelines with Dask (kedro-org#1248)

3de5710

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document distribution of Kedro pipelines with Dask #1248

Document distribution of Kedro pipelines with Dask #1248

deepyaman commented Feb 12, 2022 •

edited

Loading

lorenabalan commented Feb 24, 2022

antonymilne left a comment •

edited

Loading

antonymilne Feb 25, 2022

deepyaman Feb 25, 2022

antonymilne Feb 25, 2022

antonymilne Feb 25, 2022

deepyaman Feb 25, 2022

antonymilne commented Feb 25, 2022 •

edited

Loading

deepyaman commented Feb 25, 2022

datajoely commented Feb 28, 2022

Document distribution of Kedro pipelines with Dask #1248

Document distribution of Kedro pipelines with Dask #1248

Conversation

deepyaman commented Feb 12, 2022 • edited Loading

Description

Development notes

Checklist

lorenabalan commented Feb 24, 2022

antonymilne left a comment • edited Loading

Choose a reason for hiding this comment

antonymilne Feb 25, 2022

Choose a reason for hiding this comment

deepyaman Feb 25, 2022

Choose a reason for hiding this comment

antonymilne Feb 25, 2022

Choose a reason for hiding this comment

antonymilne Feb 25, 2022

Choose a reason for hiding this comment

deepyaman Feb 25, 2022

Choose a reason for hiding this comment

antonymilne commented Feb 25, 2022 • edited Loading

deepyaman commented Feb 25, 2022

datajoely commented Feb 28, 2022

deepyaman commented Feb 12, 2022 •

edited

Loading

antonymilne left a comment •

edited

Loading

antonymilne commented Feb 25, 2022 •

edited

Loading