-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ml model #2
Add ml model #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, well done. (particularly the README documentation - this will make writing a paper a lot easier)
I made a couple in-line comments, one is very important (one about coefficients). I have two other general comments that you should address:
- Please rerun this entire pipeline after randomly permuting the training data. We want to compare our per-class performance to a randomly shuffled baseline.
- Please output all important data into external files (e.g. predictions, coefficients, average precision, etc.). You can put these in a
results/
folder.
2.ML_model/2.DP_trained_model.sh
Outdated
@@ -0,0 +1,3 @@ | |||
#!/bin/bash | |||
jupyter nbconvert --to python DP_trained_model.ipynb | |||
python DP_trained_model.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question - what is the verb here? Are you training or applying?
It might be helpful to specify the verb in the file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I think it is better to not specify any verb because the model is being trained, evaluated, and interpreted. I removed the verb in 5300c46. Let me know what you think.
Co-authored-by: Gregory Way <gregory.way@gmail.com>
Co-authored-by: Gregory Way <gregory.way@gmail.com>
Co-authored-by: Gregory Way <gregory.way@gmail.com>
Co-authored-by: Gregory Way <gregory.way@gmail.com>
@gwaybio The most recent commits address the important general comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple in line discussion items.
Also, one pretty important comment about the shuffled baseline result:
All of the predictions come back for polylobed!! That is so weird.
You're currently using the approach: y = shuffle(y, random_state=0)
I wonder if this random shuffling is just so happening to pair accurate polylobed values, and your cross validation procedure is getting confused so it's just outputting polylobed for everything.
Instead of shuffling y, can you shuffle x by column, independently?
Applying to test data
So far, this notebook looks like it is training the models and then evaluating models in the training set. Is that accurate?
You should also be applying this to the test set data. Will that be a separate notebook? We really care more about the test data.
Exporting Sklearn model with joblib
Also, you might consider saving the full sklearn model (log_reg_model
). (Details on how to do this here). Saving the full model will help you to apply to a test set externally (and, eventually, to the Cell Health dataset)
@gwaybio This is ready for review! Would you recommend formatting this module in a format different than a pipeline? Running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Two comments (no need for me to re-review)
- Make sure to set random seeds in all three of the notebooks. The splitting data and training notebooks are the most important. (This is a required change for reproducibility)
- Comment below in regards to your observation that running the bash script does not execute the models.
Would you recommend formatting this module in a format different than a pipeline? Running 2.ML_model.sh produces all intermediate files in results/ but doesn't update any diagrams in the jupyter notebooks.
This is ok - folks should be able to understand this. You can add a comment in the README for completeness though.
BTW, it is convention to run the .ipynb
files directly, instead of the nbconverted .py
files. What you have is perfectly fine, but convention is:
# Step 0: Convert all notebooks to scripts
jupyter nbconvert --to=script \
--FilesWriter.build_directory=scripts/nbconverted \. # Note the different directory
*.ipynb
# Step 1: Stratify data into training and testing sets (execute the .ipynb file directly)
jupyter nbconvert --to=html \
--FilesWriter.build_directory=scripts/html \
--ExecutePreprocessor.kernel_name=python3 \
--ExecutePreprocessor.timeout=10000000 \
--execute 0.stratify-data.ipynb
I do not know if executing the ipynb directly updates the notebook rendering. I would guess it doesn't.
@gwaybio This is ready for review!