Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro-viz needs too much to run properly #1159

Closed
foxale opened this issue Nov 2, 2022 · 12 comments · Fixed by #1544
Closed

Kedro-viz needs too much to run properly #1159

foxale opened this issue Nov 2, 2022 · 12 comments · Fixed by #1544

Comments

@foxale
Copy link

foxale commented Nov 2, 2022

Description

So according to this #1125 kedro-viz requires

  • all plugins to be working (incl. mlflow connection to remote tracking server if leveraged)
  • all python packages to be installed

For me it just doesn't make sense. Why would I need to have the python env ready to visualize project flow? The original issue seems dead, but kedro-viz is still in some way a core kedro feature, so..

@foxale foxale changed the title <Title> Kedro-viz needs too much to run properly Nov 2, 2022
@deepyaman
Copy link
Member

+1 that this is annoying. I remember in the past, when had nodes with hard-to-install dependencies (e.g. a node that called R code using rpy2, because epidemiologists don't write their libraries in Python), would comment out imports in order to get kedro-viz working locally.

What I recall:

  • kedro-viz is going to trawl through your pipeline/node definitions, so if you have imports at the top of files, they need to be resolvable. You could move the imports into nodes, but that's ugly.
  • Re kedro-mlflow, I wonder when (in the project lifecycle) is it making a call to set up the connection? Maybe it can be delayed as much as possible. I don't think kedro-viz can entirely ignore hooks, because some could modify the DAG, but at the same time just loading the kedro-mlflow plugin maybe shouldn't create a connection. Would need to look further into how this works.

@merelcht merelcht transferred this issue from kedro-org/kedro Nov 8, 2022
@merelcht
Copy link
Member

merelcht commented Nov 8, 2022

Transferred this issue to the Kedro-Viz repo, because it's not framework related.

@Galileo-Galilei
Copy link
Member

Hi @foxale, Unfortunately I don't have a solution to your problem better than "turn off the mlflow hook".

I feel a bit responsible about this issue, both because I am the maintainer of the kedro-mlflow plugin, and because the root cause of this is due to a bunch of issues and suggestions I made to the core framework. I will try to explain what's going on here and the rationale behind this behaviour.

In these issues: kedro-org/kedro#506 and kedro-org/kedro#1431, there was a huge discussion about enabling hooks to access the config_loader. The pros and cons are discussed in the issues and I won't sum everything up here, but the key idea is that many plugins and hooks need to access proprietary config files or credentials.

After these discussion we decided to introduce an after_context_created hook in kedro-org/kedro#1434, and this is described in kedro-org/kedro#1458. This hook exposes the entire context to solve a lot of hook/plugins authors complaints (see the different discussions to understnad the inner details). One of the main argument was the need to be able to instantiate an external connection "once for all" at the beginning of the pipeline (e.g. for spark or mlflow).

The main advantage of creating connection after context creation is that when you are using the session interactively (in a script of a jupyter notebook), this is done automatically: you don't have to setup manually all connections.

On the other hand, the biggest drawback is that this connection is always instantiated, including when calling kedro viz (which loads a session). This was identified and discussed in this comment : kedro-org/kedro#506 (comment). @AntonyMilneQB and I concluded that this is a very uncommon scenario (it means that you run kedro-mlflow on a machine which cannot instantiate the mlflow connection: by definition, this can't be the machine where you launch kedro run because if you installed kedro-mlflow it was exactly because you wanted to have an mlflow connection. Since kedro-viz is a package that is more likely to be used in a development rather a production environment and considering the advantages, we did not investigate further.

Maybe @noklam or @AntonyMilneQB of the core team can add furhter help, and eventually suggest a solution ?

@noklam
Copy link
Contributor

noklam commented Nov 9, 2022

all plugins to be working (incl. mlflow connection to remote tracking server if leveraged)
all python packages to be installed

@foxale Thank you for your question. The answer to that is pretty much covered. tl;dr

  • Viz needs to load up the pipeline/nodes object, so the files would be loaded and the dependencies need to be installed. I don't see any way to get around it since you need to execute the code to get the Python representation of the pipeline.
  • In most case - plugins aren't relevant, but as @deepyaman mentioned, the hook can actually change the DAG, so viz need to execute that. Otherwise, we can skip all the hooks and it would avoid the mlflow connection problem.

@foxale
Copy link
Author

foxale commented Nov 9, 2022

Well, it still looks like a fixable problem on the kedro-viz side. One solution would be to add a parameter to kedro viz that would skip all import errors and after_catalog_created hooks and render the "raw" DAG.

@tynandebold
Copy link
Member

@foxale would you feel comfortable opening a PR that tries to solve this problem? We'd certainly appreciate it, as we welcome any and all contributions from the community!

@noklam
Copy link
Contributor

noklam commented Nov 10, 2022

This is certainly doable if the team thinks we should do this. It's not the most elegant solution you can find, but I think simply

do something like

if SOME_FLAG:
    session._hook_manager = _NullPluginManager()
# Before this line
context = session.load_context() 

This will remove all the hooks and should avoid this problem.

with KedroSession.create(
project_path=project_path,
env=env,
save_on_close=False,
extra_params=extra_params,
) as session:
context = session.load_context()
session_store = session._store
session_store_location = None
if isinstance(session_store, SQLiteStore):
session_store_location = session_store.location
catalog = context.catalog
# Pipelines is a lazy dict-like object, so we force it to populate here
# in case user doesn't have an active session down the line when it's first accessed.
# Useful for users who have `get_current_session` in their `register_pipelines()`.
pipelines_dict = dict(pipelines)
return catalog, pipelines_dict, session_store_location

@astrojuanlu
Copy link
Member

Potential solution for this: #1459 (comment)

@rashidakanchwala
Copy link
Contributor

Reopening this issue, as there's some issues around working with Spark that --ignore--plugins does no resolve

@ravi-kumar-pilla
Copy link
Contributor

Hi @foxale , We implemented a possible solution here which mocks missing dependencies. This will be merged in the upcoming Kedro-Viz release.

I know it is a long overdue ticket, but it is finally ready :D . I hope this will resolve the dependency problems. Try running kedro viz --lite next time. Cheers !

@ravi-kumar-pilla
Copy link
Contributor

Hi @foxale , Could you please let us know if you have used kedro viz --lite and was it helpful. Thank you

@rashidakanchwala
Copy link
Contributor

Closing this issue as we have some solutions to run kedro-viz easily such as kedro viz --lite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants