-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doing hyperparameter optimization #246
Comments
First, a side note: I think you could maybe got an answer faster to such question on Kedro Discord. kedro-mlflow community is much smaller, and Kedro's team is paid to provide support, which make them much more available/reactive than I am :). They are also very knowledgeable about machine learning workflows. Thoughts about your workflowHi, this is an interesting but very unusual workflow so I have some questions before I can provide an accurate answer. A very "high level" common ML training pipeline is the following : In such a workflow, you do not optimize the "preprocess data" or the "post process predictions" parts: the only part which has hyperparameters you want to tune is the "train model" part (this is not exactly true: you often finetune the "preprocess data" part, e.g. to remove outliers, impute missing values change stopwords list... depending on the metric after training, but this is a very "manual" process which is not automated by hyperparameter search librairies.) Once we have this set up in mind, it should be clear that the usual workflow is to make hyperparameter tuning at the Example of implementationIn my personal experience, people tend to have the following setup (this is pseudo code, but it should be quite explicit): # parameters.yml
hyperparameter_grid:
param1: [1,2,3]
param2: ["a", "b"] # pipeline.py
def create_pipeline(**kwargs):
Pipeline(
[
node(
func=preprocess_data,
inputs=dict(data="raw_data"),
outputs="cleaned_data",
),
node(
func=tune_hyperparams,
inputs=dict(grid="hyperparameters_grid"),
outputs=["tuning_metrics", "best_model"],
),
]
) # nodes\tuning.py
def tune_hyperparams(grid, data):
metrics_result={}
models_result={}
best_set=None, None
best_metric=10000
best_model=None
for i in range(n_trial):
hyperparams = suggest_param_sample_grid(grid)
model = SklearnModel(**hyperparams)
model.train(data)
metrics_result[hyperparams]= compute_metrics(model, data)
models_result[model]
# update metric is it is better
if metrics_result[hyperparams]<best_metric:
best_metric =metrics_result[hyperparams]
best_hyperparams=hyperparams
best_model=model
return metrics_result, best_model Above example is very naive, but it is completely straightforward to replace above abstraction ( As a side note, it makes sense to leverage mlflow to manage these different sub experiments by logging inside mlflow instead of storing results inside dictionnaries: # nodes\tuning.py
def tune_hyperparams(grid, data):
for i in range(n_trial):
best_metric=10000
with mlflow.start_run(nested=True): # NEW LINE
hyperparams = suggest_param_sample_grid(grid)
mlflow.log_params(hyperparams) # NEW LINE
model = SklearnModel(**hyperparams)
model.train(data)
mlflow.log_params(hyperparams) # NEW LINE
metric= compute_metrics(model, data)
mlflow.log_metric(metric) # NEW LINE
# update metric is it is better
if minimize_metric && metric<best_metric :
best_model=model
return best_model Note that you will automatically benefit from kedro-mlflow's configuration management if you run this through the CLI or a KedroSession, so you don't need to add any extra mlflow configuration inside the node. Possible future integration with kedro-mlflowIt seem possible to create an abstraction (say "hypernode") which roughly behaves like a node, but takes a function that suggest hyperparameters (see issue #120), compute metrics /models and automatically create nested mlflow runs. It seems honestly difficult to use for developpers, and I guess the "manual" process described above is much easier to understand /implement (readability matters!), that's why I gave up on this idea. |
You are right about the hyperparameter tuning being mostly a "node level" thing, however certain use-cases benefit from automatically wrangling the data (as you said it's usually manual, but automating it would allow for a more thorough exploration of e.g. number of basis functions over which project functional data), and implementing the hyperparameter search at the pipeline level gives more flexibility in that regard (maybe some people will find use cases we can't think of yet!). Anyway, thanks for your suggestions, you gave me food for thought. I think I will keep reading/exploring the code until I have a better grasp of kedro and mlflow (which are new to me). I will see if I can use the hypernode idea to explore the data preprocessing and come up with a solution that works for my use case. |
I consider the hyperparameter step (in general) as a "step" (node level activity) where the output is either a model or the parameters itself, this being logged in mlflow is already great given this experiment is reproducible. |
Description
I am trying to get hyperparameter optimization up and running, however I have run into many issues (thought it would be easier tbh) to implement in kedro and kedro-mlflow. I come here because I think the community of kedro-mlflow is closer to ML than the Kedro community and will understand me better. Also, the author of
kedro-mlflow
is sure to know pretty well the inner workings of Kedro.For any hypertuning to occur, we need the following:
Essentially I need to modify hyperparameters, run the whole pipeline, get new hyperparameters, and reexecute until some criteria is met (e.g. grid search has no more combinations to try, random search has done MAX_ITER iterations etc.), but what seemed as asy as to "get the pipeline, run the pipeline like a function passing it to HyperOpt's Fmin) has become a major problem where I do not know what to do exactly.
Context
Hyperparameter tuning (and Auto ML in general) are tools of increadising importance in the ML world. We need to incorporate this feature to the already useful kedro workflow.
This kind of feature also applies to other high-level ML workflows such as CV or feature selection.
Possible Implementation
As of now, I have tried a hook that generates a new dict of hyperparameters before running the pipeline and changes the inputs to the nodes with hyperparameters.
However the parameters do not correctly register to MLflow and I am not sure that an
after_pipeline_run
hook can trigger a re-execution of the pipeline. For that, I have tried to implement a runner, but I am unsure how to make it interact with hooks...Possible Alternatives
The only alternative I have thought of is to ditch kedro and do hyperparameter tuning exclusively with MLflow and optuna/hyperopt.
Suggestions welcome! To be clear, I am trying to implement this, not asking somebody to implement it for me; if I succeed I will be happy to contribute it as code or as a new plugin to Kedro!
The text was updated successfully, but these errors were encountered: