Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter Tuning Code Integration #193

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

ntalluri
Copy link
Collaborator

@ntalluri ntalluri commented Nov 4, 2024

No description provided.

config/egfr.yaml Outdated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I add the gold standard to this?

edge_freq.to_csv(OUT_DIR + 'node-ensemble.csv', sep="\t", index=False)
assert filecmp.cmp(OUT_DIR + 'node-ensemble.csv', EXPECT_DIR + 'expected-node-ensemble.csv', shallow=False)

def test_precision_recal_curve_ensemble_nodes(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how else to test the ensemble node outputs other than looking at the image

@ntalluri ntalluri requested a review from agitter November 18, 2024 18:33
# adds evaluation per algorithm per dataset-goldstandard pair
# evalution per algortihm will not run unless ml include and ml aggregate_per_algorithm is set to true
aggregate_per_algorithm: true
# TODO: should we decouple parts of eval that involve ml
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of coupling happening now. I put in a solution for now in config.py, but is it worth separating the functions into their own true/ false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe deal with some of the coupling by giving warnings and stopping the flow rather than silently shutting things off

@@ -142,8 +142,14 @@ def pca(dataframe: pd.DataFrame, output_png: str, output_var: str, output_coord:
if not isinstance(labels, bool):
raise ValueError(f"labels={labels} must be True or False")

scaler = StandardScaler()
#TODO: MinMaxScaler changes nothing about the data
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it is better to use StandardScalar or MinMaxScalar for the binary data

for file in file_paths:
df = pd.read_table(file, sep="\t", header=0, usecols=["Node1", "Node2"])
# TODO: do we want to include the pathways that are empty for evaluation / in the pr_df?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the code will add a precision and recall for empty pathways. Is that something we shouldn't include?

Snakefile Outdated
final_input.extend(expand('{out_dir}{sep}{dataset_gold_standard_pair}-eval{sep}precision-recall-per-pathway.png',out_dir=out_dir,sep=SEP,dataset_gold_standard_pair=dataset_gold_standard_pairs))
final_input.extend(expand('{out_dir}{sep}{dataset_gold_standard_pair}-eval{sep}precision-recall-pca-chosen-pathway.txt',out_dir=out_dir,sep=SEP,dataset_gold_standard_pair=dataset_gold_standard_pairs))
final_input.extend(expand('{out_dir}{sep}{dataset_gold_standard_pair}-eval{sep}precision-recall-curve-ensemble-nodes.png',out_dir=out_dir,sep=SEP,dataset_gold_standard_pair=dataset_gold_standard_pairs,algorithm_params=algorithms_with_params))
# TODO: should we provide the node ensemble frequencies
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already calculating the node ensembles, should we give it to the user?

@ntalluri
Copy link
Collaborator Author

Note for self: Try seeing if it makes more sense to separate the parameter tuning and evaluation code into their classes for organization

@ntalluri
Copy link
Collaborator Author

I think I need to redo the idea for using ensembling for parameter tuning.

Currently the code will take in the ensemble file and build a node ensemble file by processing an ensemble of edge frequencies to identify the highest frequency associated with each node (y_scores). This is then compared to the node gold standard (y_trues). Then these are plotted (with no point labels) to a graph showing the PRC between the nodes in the output pathways and the gold standard, as well as the average PR between all the nodes.

I don’t think this can be used to do ensemble parameter tuning the way I think I would do it. It could be used to help parameter tune though to get a better grid search or for evaluation in general.

  • also the visualization for this could be way better (it lacks label information making it hard to fully understand)

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to redo the idea for using ensembling for parameter tuning.

Is there a way to break this huge pull request into smaller parts? Currently it combines evaluation, parameter tuning, ensembling, etc. That is making it hard to coordinate all of those decisions and also review the big pull request. I'm wondering if we can start small and merge in a subset of the file changes to lock in some progress, like the evaluation code.

If it is all interdependent, we'll have to deal with that and proceed as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some notes about how you prepared this file and link it back to the related issue? We'll want to be able to track that later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants