Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing hyperparameter optimization #246

Closed
a-berg opened this issue Sep 28, 2021 · 3 comments
Closed

Doing hyperparameter optimization #246

a-berg opened this issue Sep 28, 2021 · 3 comments

Comments

@a-berg
Copy link

a-berg commented Sep 28, 2021

Description

I am trying to get hyperparameter optimization up and running, however I have run into many issues (thought it would be easier tbh) to implement in kedro and kedro-mlflow. I come here because I think the community of kedro-mlflow is closer to ML than the Kedro community and will understand me better. Also, the author of kedro-mlflow is sure to know pretty well the inner workings of Kedro.

For any hypertuning to occur, we need the following:

  1. Some way to specify hyperparameter search space, ideally with a custom Kedro DataSet to accomodate different libraries (such as optuna or hyperopt or GPyOpt)
  2. Get the "suggesting" algorithm to give parameter tuples between runs.
  3. A metric to minimize.
  4. Hooks that effectively change the parameters of the run
  5. A runner that interacts with the hypertuning algorithm and executes the whole pipeline.

Essentially I need to modify hyperparameters, run the whole pipeline, get new hyperparameters, and reexecute until some criteria is met (e.g. grid search has no more combinations to try, random search has done MAX_ITER iterations etc.), but what seemed as asy as to "get the pipeline, run the pipeline like a function passing it to HyperOpt's Fmin) has become a major problem where I do not know what to do exactly.

Context

Hyperparameter tuning (and Auto ML in general) are tools of increadising importance in the ML world. We need to incorporate this feature to the already useful kedro workflow.
This kind of feature also applies to other high-level ML workflows such as CV or feature selection.

Possible Implementation

As of now, I have tried a hook that generates a new dict of hyperparameters before running the pipeline and changes the inputs to the nodes with hyperparameters.
However the parameters do not correctly register to MLflow and I am not sure that an after_pipeline_run hook can trigger a re-execution of the pipeline. For that, I have tried to implement a runner, but I am unsure how to make it interact with hooks...

Possible Alternatives

The only alternative I have thought of is to ditch kedro and do hyperparameter tuning exclusively with MLflow and optuna/hyperopt.

Suggestions welcome! To be clear, I am trying to implement this, not asking somebody to implement it for me; if I succeed I will be happy to contribute it as code or as a new plugin to Kedro!

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Sep 28, 2021

First, a side note: I think you could maybe got an answer faster to such question on Kedro Discord. kedro-mlflow community is much smaller, and Kedro's team is paid to provide support, which make them much more available/reactive than I am :). They are also very knowledgeable about machine learning workflows.

Thoughts about your workflow

Hi,

this is an interesting but very unusual workflow so I have some questions before I can provide an accurate answer. A very "high level" common ML training pipeline is the following : preprocess data > train model > compute metrics & post process predictions.

In such a workflow, you do not optimize the "preprocess data" or the "post process predictions" parts: the only part which has hyperparameters you want to tune is the "train model" part (this is not exactly true: you often finetune the "preprocess data" part, e.g. to remove outliers, impute missing values change stopwords list... depending on the metric after training, but this is a very "manual" process which is not automated by hyperparameter search librairies.)

Once we have this set up in mind, it should be clear that the usual workflow is to make hyperparameter tuning at the node level in Kedro, and not at the Pipeline level. If this does not suits your need, could you explain why?

Example of implementation

In my personal experience, people tend to have the following setup (this is pseudo code, but it should be quite explicit):

# parameters.yml
hyperparameter_grid:
  param1: [1,2,3]
  param2: ["a", "b"]
# pipeline.py

def create_pipeline(**kwargs):
	Pipeline(
	[
	 node(
		func=preprocess_data,
		inputs=dict(data="raw_data"),
		outputs="cleaned_data",
	),
	 node(
		func=tune_hyperparams,
		inputs=dict(grid="hyperparameters_grid"),
		outputs=["tuning_metrics", "best_model"],
	),
	]
	)
# nodes\tuning.py

def tune_hyperparams(grid, data):
	metrics_result={}
	models_result={}
	best_set=None, None
	best_metric=10000
	best_model=None
	for i in range(n_trial):
		hyperparams = suggest_param_sample_grid(grid)
		model = SklearnModel(**hyperparams)
		model.train(data)
		metrics_result[hyperparams]= compute_metrics(model, data)
		models_result[model]
		
		# update metric is it is better
		if metrics_result[hyperparams]<best_metric:
			best_metric =metrics_result[hyperparams]
			best_hyperparams=hyperparams
			best_model=model
		
	 return metrics_result, best_model

Above example is very naive, but it is completely straightforward to replace above abstraction (n_trial, suggest_param_sample_grid, compute_metrics) by their counterparts in whatever hyparameter tuning framework you like.

As a side note, it makes sense to leverage mlflow to manage these different sub experiments by logging inside mlflow instead of storing results inside dictionnaries:

# nodes\tuning.py

def tune_hyperparams(grid, data):
	for i in range(n_trial):
		best_metric=10000
		with mlflow.start_run(nested=True): # NEW LINE
			hyperparams = suggest_param_sample_grid(grid)
			mlflow.log_params(hyperparams) # NEW LINE
			model = SklearnModel(**hyperparams)
			model.train(data)
			mlflow.log_params(hyperparams) # NEW LINE
			metric= compute_metrics(model, data)
			mlflow.log_metric(metric) # NEW LINE
			
			# update metric is it is better
			if minimize_metric && metric<best_metric :
				best_model=model
		
	 return best_model

Note that you will automatically benefit from kedro-mlflow's configuration management if you run this through the CLI or a KedroSession, so you don't need to add any extra mlflow configuration inside the node.

Possible future integration with kedro-mlflow

It seem possible to create an abstraction (say "hypernode") which roughly behaves like a node, but takes a function that suggest hyperparameters (see issue #120), compute metrics /models and automatically create nested mlflow runs. It seems honestly difficult to use for developpers, and I guess the "manual" process described above is much easier to understand /implement (readability matters!), that's why I gave up on this idea.

@a-berg
Copy link
Author

a-berg commented Sep 29, 2021

You are right about the hyperparameter tuning being mostly a "node level" thing, however certain use-cases benefit from automatically wrangling the data (as you said it's usually manual, but automating it would allow for a more thorough exploration of e.g. number of basis functions over which project functional data), and implementing the hyperparameter search at the pipeline level gives more flexibility in that regard (maybe some people will find use cases we can't think of yet!).

Anyway, thanks for your suggestions, you gave me food for thought. I think I will keep reading/exploring the code until I have a better grasp of kedro and mlflow (which are new to me). I will see if I can use the hypernode idea to explore the data preprocessing and come up with a solution that works for my use case.

@macksin
Copy link

macksin commented Jan 27, 2022

I consider the hyperparameter step (in general) as a "step" (node level activity) where the output is either a model or the parameters itself, this being logged in mlflow is already great given this experiment is reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

No branches or pull requests

3 participants