Add ml model #2

roshankern · 2022-07-12T18:10:58Z

@gwaybio This is ready for review!

gwaybio

This is fantastic, well done. (particularly the README documentation - this will make writing a paper a lot easier)

I made a couple in-line comments, one is very important (one about coefficients). I have two other general comments that you should address:

Please rerun this entire pipeline after randomly permuting the training data. We want to compare our per-class performance to a randomly shuffled baseline.
Please output all important data into external files (e.g. predictions, coefficients, average precision, etc.). You can put these in a results/ folder.

gwaybio · 2022-07-12T18:33:30Z

2.ML_model/2.DP_trained_model.sh

@@ -0,0 +1,3 @@
+#!/bin/bash
+jupyter nbconvert --to python DP_trained_model.ipynb
+python DP_trained_model.py


Quick question - what is the verb here? Are you training or applying?

It might be helpful to specify the verb in the file name.

On second thought, I think it is better to not specify any verb because the model is being trained, evaluated, and interpreted. I removed the verb in 5300c46. Let me know what you think.

2.ML_model/DP_trained_model.py

README.md

2.ML_model/README.md

Co-authored-by: Gregory Way <gregory.way@gmail.com>

roshankern · 2022-07-14T16:47:36Z

@gwaybio The most recent commits address the important general comments.

gwaybio

A couple in line discussion items.

Also, one pretty important comment about the shuffled baseline result:

All of the predictions come back for polylobed!! That is so weird.

You're currently using the approach: y = shuffle(y, random_state=0)

I wonder if this random shuffling is just so happening to pair accurate polylobed values, and your cross validation procedure is getting confused so it's just outputting polylobed for everything.

Instead of shuffling y, can you shuffle x by column, independently?

Applying to test data

So far, this notebook looks like it is training the models and then evaluating models in the training set. Is that accurate?

You should also be applying this to the test set data. Will that be a separate notebook? We really care more about the test data.

Exporting Sklearn model with joblib

Also, you might consider saving the full sklearn model (log_reg_model). (Details on how to do this here). Saving the full model will help you to apply to a test set externally (and, eventually, to the Cell Health dataset)

2.ML_model/README.md

roshankern · 2022-07-27T14:57:50Z

@gwaybio This is ready for review!

Would you recommend formatting this module in a format different than a pipeline? Running 2.ML_model.sh produces all intermediate files in results/ but doesn't update any diagrams in the jupyter notebooks.

gwaybio

This looks great. Two comments (no need for me to re-review)

Make sure to set random seeds in all three of the notebooks. The splitting data and training notebooks are the most important. (This is a required change for reproducibility)
Comment below in regards to your observation that running the bash script does not execute the models.

Would you recommend formatting this module in a format different than a pipeline? Running 2.ML_model.sh produces all intermediate files in results/ but doesn't update any diagrams in the jupyter notebooks.

This is ok - folks should be able to understand this. You can add a comment in the README for completeness though.

BTW, it is convention to run the .ipynb files directly, instead of the nbconverted .py files. What you have is perfectly fine, but convention is:

# Step 0: Convert all notebooks to scripts
jupyter nbconvert --to=script \
        --FilesWriter.build_directory=scripts/nbconverted \. # Note the different directory
        *.ipynb

# Step 1: Stratify data into training and testing sets (execute the .ipynb file directly)
jupyter nbconvert --to=html \
        --FilesWriter.build_directory=scripts/html \
        --ExecutePreprocessor.kernel_name=python3 \
        --ExecutePreprocessor.timeout=10000000 \
        --execute 0.stratify-data.ipynb

I do not know if executing the ipynb directly updates the notebook rendering. I would guess it doesn't.

2.ML_model/0.split_data.py

roshankern added 15 commits July 1, 2022 10:50

rename format bash script

136c0a5

create ML module

eba8506

confusion matrix, weight heatmap

e6aa716

black formatting

6cb1245

data prep info

9dd89d0

update documentation

3ceb24e

add plots

08445dc

hypertuning

2631e73

order barcharts

bd61949

formatting

02a5ef3

update documentation

c2d12cd

documentation

a3c027a

documentation

0b49fbc

final run

5980093

readme update

ec39df8

gwaybio requested changes Jul 12, 2022

View reviewed changes

roshankern and others added 11 commits July 12, 2022 15:20

renaming

5300c46

shuffling info

a8395ba

Update README.md

49ea61b

Co-authored-by: Gregory Way <gregory.way@gmail.com>

Update README.md

846e62a

Co-authored-by: Gregory Way <gregory.way@gmail.com>

Update 2.ML_model/README.md

2b092f2

Co-authored-by: Gregory Way <gregory.way@gmail.com>

update readme

5ae379b

Update 2.ML_model/README.md

a747378

Co-authored-by: Gregory Way <gregory.way@gmail.com>

rerun

a218a54

final estimator

087b188

baseline + save results

445acdf

baseline comparison

e7ee44d

gwaybio requested changes Jul 14, 2022

View reviewed changes

2.ML_model/README.md Outdated Show resolved Hide resolved

2.ML_model/README.md Outdated Show resolved Hide resolved

clarify shuffling

cd0752b

roshankern added 17 commits July 15, 2022 13:11

shuffled baseline clarification

81387ef

column independent shuffling

6fb05c8

export sklearn model

b344839

DP_model update

37138a4

split data

6aa0b0e

train model

9f8710d

evaluation

85b1468

number steps, utils file

294407f

evaluation

3b419b3

interpret model

2395aa1

rerun pipeline

3a87e81

intelligent holdout data

8f8502b

full rerun

0915c99

documentation

c1e32fc

stratified test split

c7349f5

final rerun

92be301

pipelinization

1306e2f

gwaybio approved these changes Jul 27, 2022

View reviewed changes

2.ML_model/0.split_data.py Outdated Show resolved Hide resolved

2.ML_model/0.split_data.py Outdated Show resolved Hide resolved

roshankern added 4 commits July 27, 2022 14:26

test html

a81c27d

restructure

c0315c5

restructure 2

dc0ae39

Finalize

17fa898

roshankern merged commit b3d9db9 into WayScience:main Jul 28, 2022

roshankern deleted the add-ML-model branch July 28, 2022 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ml model #2

Add ml model #2

roshankern commented Jul 12, 2022 •

edited

Loading

gwaybio left a comment •

edited

Loading

gwaybio Jul 12, 2022

roshankern Jul 12, 2022

roshankern commented Jul 14, 2022 •

edited

Loading

gwaybio left a comment

roshankern commented Jul 27, 2022 •

edited

Loading

gwaybio left a comment •

edited

Loading

Add ml model #2

Add ml model #2

Conversation

roshankern commented Jul 12, 2022 • edited Loading

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

gwaybio Jul 12, 2022

Choose a reason for hiding this comment

roshankern Jul 12, 2022

Choose a reason for hiding this comment

roshankern commented Jul 14, 2022 • edited Loading

gwaybio left a comment

Choose a reason for hiding this comment

Also, one pretty important comment about the shuffled baseline result:

Applying to test data

Exporting Sklearn model with joblib

roshankern commented Jul 27, 2022 • edited Loading

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

roshankern commented Jul 12, 2022 •

edited

Loading

gwaybio left a comment •

edited

Loading

roshankern commented Jul 14, 2022 •

edited

Loading

roshankern commented Jul 27, 2022 •

edited

Loading

gwaybio left a comment •

edited

Loading