Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document distribution of Kedro pipelines with Dask #1248

Merged
merged 31 commits into from
Feb 28, 2022

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Feb 12, 2022

Description

Update #1131, to which I no longer have write access. This can be either merged directly or merged into that PR, which can then be merged.

Development notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman deepyaman force-pushed the deepyaman-patch-3 branch 2 times, most recently from 0ca2563 to 4470623 Compare February 17, 2022 04:48
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@lorenabalan
Copy link
Contributor

LGTM, thanks for tidying up some other parts of the docs while here! 🙌

deepyaman and others added 4 commits February 24, 2022 21:52
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman deepyaman marked this pull request as draft February 25, 2022 04:32
@deepyaman deepyaman marked this pull request as ready for review February 25, 2022 04:32
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome stuff, thank you very much @deepyaman.

Very happy for this to be merged - the questions I ask here and in the inline comments are just to satisfy my curiosity or contribute to more general ponderings.

I know this follows the same pattern as the AWS batch runner, which is fine, but there's a couple of things that definitely seem hacky about it (as they do for the AWS batch runner). I don't immediately see any better way of doing it now, but I wonder what it would take on the kedro side to make this sort of extension more elegant in future.

  1. Injecting custom arguments into DaskRunner instantiation via conf/dask/parameters.yml. This is very cunning, but doesn't seem ideal because it's incompatible with running any other run configuration environment (unless you also modify the ConfigLoader to allow for multiple environments to load on top of each other).
  2. def run is essentially the same as the default one; the only difference (unless I'm missing something?) is that it enables you to pass custom arguments into the runner instantiation.

I wonder if there's some other way of configuring our run command in order to make it easier to do this sort of thing. This issue seems very relevant, and I'd be interested if you had any thoughts to add to it based on what you've seen here: #1041

Naively it seems like we could expose RUNNER_CLASS and RUNNER_CLASS_ARGS in settings.py to enable this sort of thing more directly. But given that the runner is arguably runtime configuration (that belongs in conf) rather than application settings (that belongs in settings.py) that probably doesn't make sense. Sooo I don't know how we can support custom runner + arguments a bit more naturally.

Edit: just realised that point 1 actually doesn't prevent us from running whichever run environment we want to use, because I could just put the dask runner config in some already existing conf/env/parameters_dask.yml file, right? Rather than needing to create a whole new environment for it.

if load_counts[data_set] < 1 and data_set not in pipeline.outputs():
catalog.release(data_set)

def run_only_missing(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions out of curiosity here:

  1. Is this actually used somewhere or do you see it being particularly relevant for Dask?
  2. Is there actually a nice way for me to make kedro use this when I do kedro run --runner=kedro_tutorial.runner.DaskRunner just through some simple modification to the CLI run command that you define? I don't immediately see how you could, given that AbstractRunner.run is fixed to call self._run.

Either way, this is 💯 level dedication, given that run_only_missing isn't actually used anywhere in kedro AFAIK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I implemented this because I felt it was especially relevant for Dask. The DaskDataSet actually publishes data to the cluster, so there's less progress lost than in the case of MemoryDataSet in case of an error.
  2. I did test it months ago when I wrote the behavior, but I probably hacked it in somewhere for testing. Not aware of an easy call off the top of my head (but it probably should be!). Looks like it was raised a few years ago (Add only-missing option to kedro run command #30, Add only_missing option in KedroContext class  #60) but decided against back then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great work searching the archives to find that! I wasn't even aware that run_only_missing existed until a couple of months ago when Nikos mentioned it.

Comment on lines +296 to +298
$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add PYTHONPATH here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to somehow make the code available to the worker, and in case of a single-machine scheduler, this works.

Client.upload_file is cleaner, and something like that would be necessary for distributed deployment.

https://stackoverflow.com/a/39994128/1093967 for some more details.

@antonymilne
Copy link
Contributor

antonymilne commented Feb 25, 2022

P.S. fun fact, don't know if you're aware of the recent change: if you hadn't edited in RELEASE.md then this PR would only be running the docs workflows on CircleCI. Possibly we should make that regex just look for .md files (rather than files just in docs/) so that PRs like this don't trigger all the code builds.

@deepyaman
Copy link
Member Author

P.S. fun fact, don't know if you're aware of the recent change: if you hadn't edited in RELEASE.md then this PR would only be running the docs workflows on CircleCI. Possibly we should make that regex just look for .md files (rather than files just in docs/) so that PRs like this don't trigger all the code builds.

I noticed today when looking at the Windows/Python 3.6 errors that you all adopted dynamic config, nice!

@lorenabalan lorenabalan merged commit 4e64877 into kedro-org:main Feb 28, 2022
@datajoely
Copy link
Contributor

Well done @deepayman 🙏

@deepyaman deepyaman deleted the deepyaman-patch-3 branch February 28, 2022 13:59
AhdraMeraliQB pushed a commit that referenced this pull request Mar 30, 2022
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants