Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new function to specify namespace for KedroSession and kedro run #2306

Merged
merged 13 commits into from
Feb 20, 2023

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Feb 10, 2023

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

Close #2145

Description

Currently there's a way to select different sections from a pipeline to run, but selecting by namespace is missing. We should add a way for users to run a specific subpipeline (namespace) by executing this in the terminal:

kedro run --namespace data_engineering.preprocessing
kedro run -ns data_engineering

Development notes

  • Update session.py to add namespace as record data and modify the test accordingly
  • Update project.py to add new option for the -ns or --namespace

How this is tested?

It is tested manually with a modified version of spaceflight, since the current starters do not use namespace, I followed this modular pipeline documentation for testing.

Steps

Kedro CLI

kedro run --namespace active_modelling - 3 nodes should be run
kedro run --namespace active_modelling --pipeline data_science - 3 3 nodes should be run, since the namespace only available in one pipeline
kedro run --namespace 1234 -- Error ValueError: Pipeline does not contain nodes with namespace '1234'

KedroSession

Similar tests are run with the kedro ipython. run %reload_kedro to refresh a session

  • session.run(namespace="active_modelling")
  • session.run(namespace="active_modelling", pipeline_name="data_science")
  • session.run(namespace="1234") -- error ValueError: Pipeline does not contain nodes with namespace '1234'

Questions for Reviewers

  1. Do we need a specific test for the namespace option? I don't think it's needed as it is more or less equal to other arguments and the filtering function is tested in the test_pipeline.py already.
  2. The ordering for these cli options and KedroSession.run is kind of arbitrary, for KedroSession it is the code API so it makes sense to add the new option as the last argument to keep it backward compatible. Noted some arguments have inconsistent names. For example kedro run --pipeline vs KedroSession.run(pipline_name=xxx)

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam requested a review from idanov as a code owner February 10, 2023 07:13
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as draft February 10, 2023 07:24
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as ready for review February 13, 2023 09:49
@noklam noklam requested review from merelcht, jmholzer and AhdraMeraliQB and removed request for merelcht February 13, 2023 09:49
@datajoely
Copy link
Contributor

Super cool!

Copy link
Contributor

@jmholzer jmholzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work ✅

Do we need a specific test for the namespace option? I don't think it's needed as it is more or less equal to other arguments and the filtering function is tested in the test_pipeline.py already.

I think the updated test_run_with_pipeline_filters makes an extra test unnecessary.

The ordering for these cli options and KedroSession.run is kind of arbitrary, for KedroSession it is the code API so it makes sense to add the new option as the last argument to keep it backward compatible.

I think the ordering of click.option decorators is consistent with the order in which their passed value appears in the function signature of run, so personally I think it should be added to the end if possible.

Noted some arguments have inconsistent names. For example kedro run --pipeline vs KedroSession.run(pipline_name=xxx)

I'll take a look if this is on our radar, if not I'll open a ticket.

@@ -336,6 +337,7 @@ def run( # pylint: disable=too-many-arguments,too-many-locals
used as an end point of the new ``Pipeline``.
load_versions: An optional flag to specify a particular dataset
version timestamp to load.
namespace: The namespace of the nodes that is being run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace: The namespace of the nodes that is being run.
namespace: An optional string specifying a namespace to run. If specified, only the nodes in this namespace will be run.

Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! @jmholzer

@@ -353,6 +355,7 @@ def activate_nbstripout(
callback=_reformat_load_versions,
)
@click.option("--pipeline", "-p", type=str, default=None, help=PIPELINE_ARG_HELP)
@click.option("--namespace", "-ns", type=str, default=None, help=NAMESPACE_ARG_HELP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the ordering of click.option decorators consistent with the arguments in the function signature, I think this should be moved to the bottom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually follow the order? The first click option is from_input but the first argument would be tag. I think it only affects when you do kedro run -h, it will follow the order that how you decorate your function. In general, I think it should follow the order of importance. I would expect pipeline to be the first argument in that case.

I don't have a strong opinion about moving it to the back, the only reason I keep it next to pipeline is that I feel these two arguments are in the same category. Similarly, the nodes arguments are together, and the config-related arguments are grouped as well. @merelcht or @AntonyMilneQB Is this something we've discussed in the past?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with you here @noklam 👍 We should go for what makes sense to the user when you do kedro run --help; no need to be consistent with the order of KedroSession.run (which as you've said is somewhat arbitrary).

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ⭐

On the questions you ask:

  1. Agreed with @jmholzer, tests look sufficient 👍
  2. Agreed with @noklam. We should aim for a good CLI UX first and then consistency with Python API second. Adding the argument to the end of the KedroSession.run is correct to maintain backwards compatibility but doesn't need to be reflected in the click run command - as you notice, we already don't have consistency there. (Backwards compatibility is kind of an edge case here because I doubt anyone calls KedroSession.run with positional arguments, but technically what you've done is right.)

Going beyond this PR, it would be nice to have eventual consistency between the Python API and click run command. More importantly than the order is the naming of the arguments. This is something I've noticed before and it's on our radar here: #2247. Note the pipeline_name vs. pipeline question follows the same pattern as the node_name vs. node one noted there.

In fact, I now have another thought on the ordering of arguments - will add it to that issue to keep it in one place.

Comment on lines 56 to 57
NAMESPACE_ARG_HELP = """Name of the namespace to run.
If not set, no namespace is used."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NAMESPACE_ARG_HELP = """Name of the namespace to run.
If not set, no namespace is used."""
NAMESPACE_ARG_HELP = """Name of the node namespace to run."""

IMO the "no namespace is used" just makes this less clear since it can't be really explained fully in just a line.

@@ -353,6 +355,7 @@ def activate_nbstripout(
callback=_reformat_load_versions,
)
@click.option("--pipeline", "-p", type=str, default=None, help=PIPELINE_ARG_HELP)
@click.option("--namespace", "-ns", type=str, default=None, help=NAMESPACE_ARG_HELP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with you here @noklam 👍 We should go for what makes sense to the user when you do kedro run --help; no need to be consistent with the order of KedroSession.run (which as you've said is somewhat arbitrary).

@antonymilne
Copy link
Contributor

@noklam One thing I just thought of. We should add this to the documentation here. Maybe other places too, have a look. But that's the one that springs to mind.

@noklam
Copy link
Contributor Author

noklam commented Feb 14, 2023

Thank you @AntonyMilneQB, I will update the docs🙂

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
@noklam noklam requested a review from yetudada as a code owner February 16, 2023 07:54
@noklam noklam removed request for idanov and yetudada February 16, 2023 07:55
@@ -562,9 +565,13 @@ def test_run_with_pipeline_filters(
):
from_nodes = ["--from-nodes", "splitting_data"]
to_nodes = ["--to-nodes", "training_model"]
tags = ["--tag", "de"]
namespace = ["--namespace", "fake_namespace"]
tags = ["--tags", "de"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tags is being overwritten here 🤔

@@ -336,6 +337,7 @@ def run( # pylint: disable=too-many-arguments,too-many-locals
used as an end point of the new ``Pipeline``.
load_versions: An optional flag to specify a particular dataset
version timestamp to load.
namespace: The namespace of the nodes that is being run.
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! @jmholzer

@noklam noklam merged commit 116ddd0 into main Feb 20, 2023
@noklam noklam deleted the feat/add-namespace-option-kedro-run branch February 20, 2023 08:56
stichbury pushed a commit that referenced this pull request Feb 28, 2023
…un` (#2306)

* Add new namespace options for KedroSession and `kedro run`

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Change the shortcut from `--ns` to `-ns` to be consistent

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Fix lint

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Fix broken tests and update new test for namespace argument

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* update docs and release notes

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Fix tests

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Fix test for incorrect merge

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

---------

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an option to run a specific namespace in KedroSession run and kedro run
5 participants