-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document distribution of Kedro pipelines with Dask #1248
Conversation
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
938e2aa
to
10e7223
Compare
…'s source code, not just `pipelines/` (#1248)
68ba168
to
e9aab5f
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
4734689
to
497ec53
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
0ca2563
to
4470623
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
4470623
to
4f07989
Compare
LGTM, thanks for tidying up some other parts of the docs while here! 🙌 |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
ed3381c
to
5a72a93
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really awesome stuff, thank you very much @deepyaman.
Very happy for this to be merged - the questions I ask here and in the inline comments are just to satisfy my curiosity or contribute to more general ponderings.
I know this follows the same pattern as the AWS batch runner, which is fine, but there's a couple of things that definitely seem hacky about it (as they do for the AWS batch runner). I don't immediately see any better way of doing it now, but I wonder what it would take on the kedro side to make this sort of extension more elegant in future.
- Injecting custom arguments into
DaskRunner
instantiation via conf/dask/parameters.yml. This is very cunning, but doesn't seem ideal because it's incompatible with running any other run configuration environment (unless you also modify the ConfigLoader to allow for multiple environments to load on top of each other). def run
is essentially the same as the default one; the only difference (unless I'm missing something?) is that it enables you to pass custom arguments into the runner instantiation.
I wonder if there's some other way of configuring our run
command in order to make it easier to do this sort of thing. This issue seems very relevant, and I'd be interested if you had any thoughts to add to it based on what you've seen here: #1041
Naively it seems like we could expose RUNNER_CLASS
and RUNNER_CLASS_ARGS
in settings.py to enable this sort of thing more directly. But given that the runner is arguably runtime configuration (that belongs in conf) rather than application settings (that belongs in settings.py) that probably doesn't make sense. Sooo I don't know how we can support custom runner + arguments a bit more naturally.
Edit: just realised that point 1 actually doesn't prevent us from running whichever run environment we want to use, because I could just put the dask runner config in some already existing conf/env/parameters_dask.yml
file, right? Rather than needing to create a whole new environment for it.
if load_counts[data_set] < 1 and data_set not in pipeline.outputs(): | ||
catalog.release(data_set) | ||
|
||
def run_only_missing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions out of curiosity here:
- Is this actually used somewhere or do you see it being particularly relevant for Dask?
- Is there actually a nice way for me to make kedro use this when I do
kedro run --runner=kedro_tutorial.runner.DaskRunner
just through some simple modification to the CLI run command that you define? I don't immediately see how you could, given thatAbstractRunner.run
is fixed to callself._run
.
Either way, this is 💯 level dedication, given that run_only_missing
isn't actually used anywhere in kedro AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I implemented this because I felt it was especially relevant for Dask. The
DaskDataSet
actually publishes data to the cluster, so there's less progress lost than in the case ofMemoryDataSet
in case of an error. - I did test it months ago when I wrote the behavior, but I probably hacked it in somewhere for testing. Not aware of an easy call off the top of my head (but it probably should be!). Looks like it was raised a few years ago (Add only-missing option to kedro run command #30, Add only_missing option in KedroContext class #60) but decided against back then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great work searching the archives to find that! I wasn't even aware that run_only_missing
existed until a couple of months ago when Nikos mentioned it.
$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786 | ||
$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786 | ||
$ PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to add PYTHONPATH
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to somehow make the code available to the worker, and in case of a single-machine scheduler, this works.
Client.upload_file
is cleaner, and something like that would be necessary for distributed deployment.
https://stackoverflow.com/a/39994128/1093967 for some more details.
P.S. fun fact, don't know if you're aware of the recent change: if you hadn't edited in |
I noticed today when looking at the Windows/Python 3.6 errors that you all adopted dynamic config, nice! |
Well done @deepayman 🙏 |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Description
Update #1131, to which I no longer have write access. This can be either merged directly or merged into that PR, which can then be merged.
Development notes
Checklist
RELEASE.md
file