Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/primekg loader #49

Merged
merged 25 commits into from
Jan 9, 2025
Merged

feat/primekg loader #49

merged 25 commits into from
Jan 9, 2025

Conversation

awmulyadi
Copy link
Contributor

For authors

Description

Please:
I have developed a set of codes for loading a series of biomedical knowledge graph-based datasets, i.e., PrimeKG, StarkQA-PrimeKG, and BioBridge-PrimeKG. Following are the detailed key updates:

  1. Source Code: Within the folder of aiagents4pharma/talk2knowledgegraphs/datasets, I included an abstract class of Dataset followed by a set of implementation classes of PrimeKG, StarkQAPrimeKG, and BioBridgePrimeKG for loading the corresponding datasets.
  2. Pytest: I included test cases of the above classes under the folder of aiagents4pharma/talk2knowledgegraphs/tests/*
  3. Tutorial Notebooks: I added an interactive notebook for showcasing the use cases of the classes under docs/notebooks/talk2knowledgegraphs/*.
  4. Documentation: Finally, I updated the related documentation using mkdocs, which is available in docs/talk2knowledgegraphs/*.

Fixes #32

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests you conducted to verify your changes. These may involve creating new test scripts or updating existing ones.

  • Added new test(s) in the tests folder
  • Added new function(s) to an existing test(s) (i.e., aiagents4pharma/talk2knowledgegraphs/tests/*)

Checklist

  • My code follows the style guidelines mentioned in the Code/DevOps guides
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. MkDocs)
  • My changes generate no new warnings
  • I have added or updated tests (in the tests folder) that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

For reviewers

Checklist pre-approval

  • Is there enough documentation?
  • If a new feature has been added, or a bug fixed, has a test been added to confirm good behavior?
  • Does the test(s) successfully test edge/corner cases?
  • Does the PR pass the tests? (if the repository has continuous integration)

Checklist post-approval

  • Does this PR merge develop into main? If so, please make sure to add a prefix (feat/fix/chore) and/or a suffix BREAKING CHANGE (if it's a major release) to your commit message.
  • Does this PR close an issue? If so, please make sure to descriptively close this issue when the PR is merged.

Checklist post-merge

  • When you approve of the PR, merge and close it (Read this article to know about different merge methods on GitHub)
  • Did this PR merge develop into main and is it suppose to run an automated release workflow (if applicable)? If so, please make sure to check under the "Actions" tab to see if the workflow has been initiated, and return later to verify that it has completed successfully.

@awmulyadi awmulyadi self-assigned this Jan 8, 2025
@awmulyadi awmulyadi requested a review from dmccloskey January 8, 2025 13:27
Copy link
Member

@dmccloskey dmccloskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 💪. The notebooks were very easy to follow and demonstrate the functionality well which will hopefully come in handy for our hackathon participants.

Please take a look at my questions/comments and let me know if there is anything that needs to be clarified. In principle, I think everything looks good pending my assumptions I note are correct 😉.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any pre-trained embeddings for stark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rechecked the repository of Stark about this, and apparently they provided the query and node embeddings using 'text-embedding-ada-002'. However, there is no information on edge embeddings. I have updated the class of StarkQAPrimeKG by including pre-loaded embeddings of queries and nodes. However, we also need to embed the edges (~8M) using the 'text-embedding-ada-002' in the next step (KG construction). Alternatively, I am preparing codes to embed both nodes and edges using Ollama's nomic-embed-text to be included in the next PR.

Ref:
https://github.com/snap-stanford/stark/blob/main/emb_download.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I do not believe they used edge embeddings in their manuscript (?).

I wouldn't worry about embedding the edges using text-embedding-ada-002 if time is of the essence. It does not seem that it would help us reproduce their work. Please correct me if I am wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, in their manuscript and code repository, they didn't mention edge/relation embeddings. Thus, the methods used in their benchmark most likely didn't incorporate these features.

@awmulyadi awmulyadi requested a review from dmccloskey January 9, 2025 12:16
Copy link
Member

@dmccloskey dmccloskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to be merged whenever you are ready.

self.starkqa_node_info: dict = None
self.query_emb_dict: dict = None
self.node_emb_dict: dict = None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to include the relation embeddings in a subsequent PR as mentioned in another comment.


return starkqa, starkqa_split_idx, starkqa_node_info

def _load_stark_embeddings(self) -> tuple:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@awmulyadi awmulyadi merged commit 5320b55 into main Jan 9, 2025
6 checks passed
@awmulyadi awmulyadi deleted the feat/primekg-loader branch January 9, 2025 16:12
Copy link
Contributor

github-actions bot commented Jan 9, 2025

🎉 This PR is included in version 1.5.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example biomedical knowledge graph for training and benchmarking
2 participants