Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ml model #2

Merged
merged 48 commits into from
Jul 28, 2022
Merged

Add ml model #2

merged 48 commits into from
Jul 28, 2022

Conversation

roshankern
Copy link
Member

@roshankern roshankern commented Jul 12, 2022

@gwaybio This is ready for review!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, well done. (particularly the README documentation - this will make writing a paper a lot easier)

I made a couple in-line comments, one is very important (one about coefficients). I have two other general comments that you should address:

  1. Please rerun this entire pipeline after randomly permuting the training data. We want to compare our per-class performance to a randomly shuffled baseline.
  2. Please output all important data into external files (e.g. predictions, coefficients, average precision, etc.). You can put these in a results/ folder.

@@ -0,0 +1,3 @@
#!/bin/bash
jupyter nbconvert --to python DP_trained_model.ipynb
python DP_trained_model.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question - what is the verb here? Are you training or applying?

It might be helpful to specify the verb in the file name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think it is better to not specify any verb because the model is being trained, evaluated, and interpreted. I removed the verb in 5300c46. Let me know what you think.

roshankern and others added 11 commits July 12, 2022 15:20
Co-authored-by: Gregory Way <gregory.way@gmail.com>
Co-authored-by: Gregory Way <gregory.way@gmail.com>
Co-authored-by: Gregory Way <gregory.way@gmail.com>
Co-authored-by: Gregory Way <gregory.way@gmail.com>
@roshankern
Copy link
Member Author

roshankern commented Jul 14, 2022

@gwaybio The most recent commits address the important general comments.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple in line discussion items.

Also, one pretty important comment about the shuffled baseline result:

All of the predictions come back for polylobed!! That is so weird.

You're currently using the approach: y = shuffle(y, random_state=0)

I wonder if this random shuffling is just so happening to pair accurate polylobed values, and your cross validation procedure is getting confused so it's just outputting polylobed for everything.

Instead of shuffling y, can you shuffle x by column, independently?

Applying to test data

So far, this notebook looks like it is training the models and then evaluating models in the training set. Is that accurate?

You should also be applying this to the test set data. Will that be a separate notebook? We really care more about the test data.

Exporting Sklearn model with joblib

Also, you might consider saving the full sklearn model (log_reg_model). (Details on how to do this here). Saving the full model will help you to apply to a test set externally (and, eventually, to the Cell Health dataset)

@roshankern
Copy link
Member Author

roshankern commented Jul 27, 2022

@gwaybio This is ready for review!

Would you recommend formatting this module in a format different than a pipeline? Running 2.ML_model.sh produces all intermediate files in results/ but doesn't update any diagrams in the jupyter notebooks.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Two comments (no need for me to re-review)

  1. Make sure to set random seeds in all three of the notebooks. The splitting data and training notebooks are the most important. (This is a required change for reproducibility)
  2. Comment below in regards to your observation that running the bash script does not execute the models.

Would you recommend formatting this module in a format different than a pipeline? Running 2.ML_model.sh produces all intermediate files in results/ but doesn't update any diagrams in the jupyter notebooks.

This is ok - folks should be able to understand this. You can add a comment in the README for completeness though.

BTW, it is convention to run the .ipynb files directly, instead of the nbconverted .py files. What you have is perfectly fine, but convention is:

# Step 0: Convert all notebooks to scripts
jupyter nbconvert --to=script \
        --FilesWriter.build_directory=scripts/nbconverted \. # Note the different directory
        *.ipynb

# Step 1: Stratify data into training and testing sets (execute the .ipynb file directly)
jupyter nbconvert --to=html \
        --FilesWriter.build_directory=scripts/html \
        --ExecutePreprocessor.kernel_name=python3 \
        --ExecutePreprocessor.timeout=10000000 \
        --execute 0.stratify-data.ipynb

I do not know if executing the ipynb directly updates the notebook rendering. I would guess it doesn't.

@roshankern roshankern merged commit b3d9db9 into WayScience:main Jul 28, 2022
@roshankern roshankern deleted the add-ML-model branch July 28, 2022 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants