Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.5.1 #86

Merged
merged 13 commits into from
Mar 10, 2025
Merged

v0.5.1 #86

merged 13 commits into from
Mar 10, 2025

Conversation

alxndrkalinin
Copy link
Collaborator

Running both phenotypic activity and consistency examples fails in the new version.

In the activity example is was due to the bug introduced in the earlier version, fixed in a8a35c5.

@afermg phenotypic consistency returns incorrect result for multilabel AP calculation, which in turn fails mAP calculation—can you take a look?

@alxndrkalinin
Copy link
Collaborator Author

We should include running demo notebooks in github actions to automate checks.

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

It is this one, right?
image

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

I don't have the rights to edit your branch, but a (partial) fix is here. I forgot to replace find_pairs with find_pairs_multilabel. We still have fewer results, but I think that has to do with the fact that I am getting pairs on whether or not two lists intersect, not on the specific items of this intersection.

@alxndrkalinin
Copy link
Collaborator Author

Looks like the issue is with defining positive pairs, because find_pairs_multilabel returns indices of rows from original dataframe, which is not what we need.

In this multilabel case, a positive pair has to be between a profile with a specific value in Metadata_target and another profile with the same value in Metadata_target. Then, AP score is calculated per each unique combination of profile and target:

Screenshot 2025-02-25 at 11 17 57 PM

That's why the original code exploded the dataframe.

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

Yes, I noticed that I misunderstood the problem in my original implementation. It's not far from the actual solution but I need to ask you a couple of questions before I implement that one.

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

I am almost there: With this commit I get very similar results, I need to check where the different number of pos and total pairs are coming from. There is some filter that I am applying that I should not do (I am getting 74 final rows instead of 64).

target_aps.loc[target_aps["Metadata_target_list"] == "HTR3A"]

    average_precision  n_pos_pairs  n_total_pairs Metadata_target_list  
63           0.500000            1             42                 EGFR  
25           0.142857            1             39                HTR2A  
31           1.000000            1             39                CHRM5  
38           0.045455            1             39                HTR1E  
40           0.083333            1             39                HTR2B  
..                ...          ...            ...                  ...  
13           0.105128            2             42                 HRH1  
15           0.076923            1             41                 XIAP  
19           0.071429            1             41                CHRM2  
24           0.500000            1             42                CHRM4  
71           0.100000            1             42                CHRM3  

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

Just for a sanity check, the data is exactly the same, right? @alxndrkalinin

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

Got it working thanks to John's help. I was not ordering properly the keys and counts. This commit already includes all the changes. I also fixed a comment in build_rank_list that was mistaken as to what the np.lexsort does. This is the same as the picture above, the main difference is that I'm using Metadata_target_list as sameby name (due to the issue I've mentioned with pandas complaining when trying to group with a column that is a list.

     Metadata_broad_sample                                    Metadata_target  \
52  BRD-A69636825-003-04-7        CACNA1C|CACNA1S|CACNA2D1|CACNG1|HTR3A|KCNA5   
32  BRD-A72309220-001-04-1  HTR1A|HTR1B|HTR1D|HTR1E|HTR1F|HTR2A|HTR2B|HTR2...   
37  BRD-A72309220-001-04-1  HTR1A|HTR1B|HTR1D|HTR1E|HTR1F|HTR2A|HTR2B|HTR2...   
39  BRD-A72309220-001-04-1  HTR1A|HTR1B|HTR1D|HTR1E|HTR1F|HTR2A|HTR2B|HTR2...   
41  BRD-A72309220-001-04-1  HTR1A|HTR1B|HTR1D|HTR1E|HTR1F|HTR2A|HTR2B|HTR2...   
..                     ...                                                ...   
16  BRD-K74363950-004-01-0                      CHRM1|CHRM2|CHRM3|CHRM4|CHRM5   
19  BRD-K74363950-004-01-0                      CHRM1|CHRM2|CHRM3|CHRM4|CHRM5   
22  BRD-K74363950-004-01-0                      CHRM1|CHRM2|CHRM3|CHRM4|CHRM5   
28  BRD-K76908866-001-07-6                                              ERBB2   
61  BRD-K81258678-001-01-0                                               RELA   

    average_precision  n_pos_pairs  n_total_pairs Metadata_target_list  
52           0.500000            1             42                HTR3A  
32           0.406071            4             42                HTR1A  
37           0.142857            1             39                HTR1B  
39           0.142857            1             39                HTR1D  
41           0.142857            1             39                HTR1E  
..                ...          ...            ...                  ...  
16           0.105128            2             42                CHRM3  
19           0.105128            2             42                CHRM4  
22           0.105128            2             42                CHRM5  
28           0.500000            1             42                ERBB2  
61           0.100000            1             42                 RELA  

[64 rows x 6 columns]


Do take notice that I slightly changed the interface, and now find_pairs_multilabel provides keys and counts in addition to pairs. When the multilabel_col is not in sameby it instead returns only pairs.

@afermg
Copy link
Contributor

afermg commented Feb 26, 2025

Opened a pull request on your branch to make it easier. I already rebased alxndrkalinin#3.

@alxndrkalinin
Copy link
Collaborator Author

@afermg thanks, can you also check tests? I get this:

FAILED tests/test_matching_multilabel.py::test_sameby - duckdb.duckdb.BinderException: Binder Error: Referenced column "shared_items" not found in FROM clause!
FAILED tests/test_matching_multilabel.py::test_sameby_other_cols - duckdb.duckdb.BinderException: Binder Error: Referenced column "shared_items" not found in FROM clause!
FAILED tests/test_matching_multilabel.py::test_only_sameby_many_cols - duckdb.duckdb.BinderException: Binder Error: Referenced column "shared_items" not found in FROM clause!

@afermg
Copy link
Contributor

afermg commented Feb 27, 2025

Got it working. It required some minor fixes and adjusting the multilabel test to filter the pairs only when multilabel_col is in sameby.

@alxndrkalinin alxndrkalinin changed the title Fix examples in 0.5.0 v0.5.1 Feb 27, 2025
@alxndrkalinin alxndrkalinin requested a review from Copilot March 2, 2025 18:23

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR bumps the version to 0.5.1 and fixes issues related to phenotypic activity and consistency examples by addressing a bug in multilabel AP calculation and updating various supporting codes and notebooks.

  • Fix and refactor the multilabel pairing function in src/copairs/matching.py.
  • Update test behavior in tests/test_matching_multilabel.py to account for the modified output structure.
  • Adjust grouping, type casts, and citation details in map modules, pyproject.toml, and README.md, as well as minor updates in example notebooks.

Reviewed Changes

File Description
src/copairs/matching.py Refactored find_pairs_multilabel with SQL condition and variable updates.
tests/test_matching_multilabel.py Modified test to extract the expected tuple element for sameby multilabel cases.
pyproject.toml Updated version from 0.5.0 to 0.5.1.
src/copairs/map/map.py Adjusted type annotations and grouping in mean_average_precision.
README.md Updated citation information.
src/copairs/compute.py Wrapped AP score computation with a np.errstate context for better numerical handling.
src/copairs/map/multilabel.py Updated parameter annotations and ensured proper dtype conversions.
Examples notebooks Updated execution counts and model IDs.

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

src/copairs/matching.py:571

  • When shared_item is 'NOT', the SQL condition becomes 'WHERE NOT len(shared_items) > 0'. To ensure correct SQL syntax and readability, consider wrapping the condition in parentheses as 'WHERE NOT (len(shared_items) > 0)'.
f" WHERE {shared_item} len(shared_items) > 0"

src/copairs/matching.py:558

  • [nitpick] The variable 'shared_item' is used as a flag to toggle the NOT operator in the SQL condition. Renaming it to something like 'negation_operator' or 'not_operator' could make its purpose clearer.
shared_item = ""  // or shared_item = "NOT"
@johnarevalo johnarevalo merged commit e494da1 into cytomining:main Mar 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants