Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro-mlflow with kedro 0.19.11 produces multiple runs #624

Closed
gozderam opened this issue Feb 6, 2025 · 13 comments · Fixed by #638
Closed

Kedro-mlflow with kedro 0.19.11 produces multiple runs #624

gozderam opened this issue Feb 6, 2025 · 13 comments · Fixed by #638
Labels
bug Something isn't working

Comments

@gozderam
Copy link

gozderam commented Feb 6, 2025

Description

Once upgraded kedro to 0.19.11, kedro-mlflow started to produce multiple runs in MLFlow.

Context

There are additional runs in MLFlow once running kedro run while only one is expected. When downgrading kedro back to 0.19.10 this problem does not occur.

Steps to Reproduce

  1. Create a new kedro project, can be exemplary spaceflights project with kedro==0.19.11.
  2. Add kedro-mlflow package and mlflow.yml file.
  3. In catalog, make one of the datasets a MLFlow artifact, e.g. for spaceflights:
preprocessed_companies:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
  dataset:
    type: pandas.ParquetDataset
    filepath: data/02_intermediate/preprocessed_companies.parquet
  1. run kedro run

Expected Result

One run is produced in MLFlow.

Actual Result

MLFlow produces two runs, one with "default" name and the other one with some random name.

Your Environment

  • kedro and kedro-mlflow version used (pip show kedro and pip show kedro-mlflow):
    kedro: 0.19.11 (this problem does not occur in 0.19.10!)
    kedro-mlflow: 0.14.0
  • Python version used (python -V):
    3.11
  • Operating system and version:
    Mac os Sonoma 14.4

Does the bug also happen with the last version on master?

yes

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Feb 6, 2025

Thanks for the bug report.

The problem is indeed coming from kedro==0.19.11, I'll have a look this weekend. Several other people complain about this on slack. In the meantime, you should likely downgrade kedro. I tag @ankatiyar and @astrojuanlu for reference, because this is very likely an unexpected behavour of kedro itself.

@Galileo-Galilei Galileo-Galilei moved this from 🆕 New to 🔖 Ready in kedro-mlflow roadmap Feb 6, 2025
@Galileo-Galilei Galileo-Galilei added the bug Something isn't working label Feb 6, 2025
@astrojuanlu
Copy link

Thanks both, we'll look into this. Unless kedro-mlflow was using a private API, this regression shouldn't have happened.

@ankatiyar
Copy link

I've been able to narrow down which change caused the regression and it's kedro-org/kedro#4353 (cc @merelcht)

@Galileo-Galilei
Copy link
Owner

I can reproduce the bug, even without the mlflow.yml and artifacts. Investigation are going on here: https://github.com/Galileo-Galilei/kedro_mlflow_624

pip install uv
uv tool install kedro
uvx kedro new -s spaceflights-pandas -n spaceflights-pandas --telemetry no
cd spaceflights-pandas
kedro run 
kedro mlflow ui

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Feb 8, 2025

@ankatiyar @astrojuanlu : I can confirm the bug appears even with a very simple hook that just starts mlflow (see above repo for code and results). My best guess so far is that the before_pipeline_run hook is triggered in another thread than the one the nodes will be running (even with the sequential runner). Since mlflow is thread-safe, it cannot access this different thread (and I cannot even close it within the hook, so the run remains infinietly looping!)

@merelcht
Copy link

I will work on fixing this.

@merelcht
Copy link

@Galileo-Galilei just to double check that what I've observed is correct: the behaviour described here with the extra run being created was already always the case when using ThreadRunner and ParallelRunner?

@Galileo-Galilei
Copy link
Owner

Not with ThreadRunner

@merelcht
Copy link

Not with ThreadRunner

I've tried all versions back to 0.19.7 and it always creates two runs for ThreadRunner and ParallelRunner.

The issue here is that after my refactoring in kedro-org/kedro#4353, SequentialRunner is more similar to the other runners and we open an executor pool for managing the execution of the tasks. I dug into how mlflow and kedro-mflow work and what I also remember from working with mlflow earlier, is that it creates its own runs in turn. The run creation seems to be linked to whatever thread is running. Where before the run created in the before_pipeline_run hook was the same one triggered in before_node_run for logging the parameters, it now creates a new one because at that point the executor pool has been created.

I fiddled around a bit and managed to get this working by saving the active run ID and passing it on at the point where the parameters are logged, by calling MlflowClient().log_param(self.active_run, name, value) directly (self.active_run is something I added). Before opening a PR I wanted to check with you @Galileo-Galilei if this sounds like a reasonable approach to you? Tests will need to be adjusted.

Another solution would be to revert part of my earlier refactoring and not creating an executor pool for SequentialRunner, but then this issue with the two runs would still remain for the ThreadRunner.

@Galileo-Galilei
Copy link
Owner

Hum, weird. Let me look at it this weekend and I'll get back to you so we decide what's best when we fully understand the issue.

@astrojuanlu
Copy link

xref kedro-org/kedro#4486

Galileo-Galilei added a commit that referenced this issue Feb 17, 2025
… to ensure all the tracking is done within the same run (#623, #624)
Galileo-Galilei added a commit that referenced this issue Feb 17, 2025
…ode (#623, #624, #638)

* Make mlflow not thread safe by reopening the same run before each run to ensure all the tracking is done within the same run (#623, #624)

* fix test

* fix test with thread runner

* no cover for exception catching

* 📝 Fix broken link
@github-project-automation github-project-automation bot moved this from 🔖 Ready to ✅ Done in kedro-mlflow roadmap Feb 17, 2025
@Galileo-Galilei
Copy link
Owner

@merelcht @astrojuanlu I finally fixed it on kedro-mlflow's side because it's mostly due to the recent development to make mlflow thread-safe. Before mlflow 2.18, this wuold not have been a problem.

That said, I think it still need some work from kedro side:

  • It's really confusing that the "SequentialRunner" uses 2 threads and not one. It launches a new thread to run the nodes, which I found very dangerous and hard to debug. The MatplotlibWriter issue is a good warning of things that will happen in the future. I think this should be fixed on kedro's side.
  • that's another problem, but I've found that you can't use ParallelRunner with MlflowArtifactDataset, you get the following error:
AttributeError: The following datasets cannot be used with multiprocessing: ['model_input_table']
In order to utilize multiprocessing you need to make sure all datasets are serialisable, i.e. datasets should not make use of lambda functions, nested functions, closures etc.
If you are using custom decorators ensure they are correctly decorated using functools.wraps().

@merelcht
Copy link

Thanks @Galileo-Galilei, that makes sense. I agree it's confusing SequentialRunner is using 2 threads now, we briefly discussed in backlog grooming yesterday how that could be changed. We'll make sure to prioritise this asap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

5 participants