-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: hackathon/Sheng + Inigo PR (Team Sanofi Germany) #134
base: main
Are you sure you want to change the base?
Conversation
…4Pharma into sheng-inigo-PR
Hi @inigoo18 and Sheng. Thanks for the PR 👍 These are thorough working Jupyter notebooks to perform model annotation by utilizing RAG over SBML and PDF files with alignment over KG nodes. Just a quick question about your approach that was implemented here, especially for taking into consideration the neighboring nodes of KG (without training any network). |
Good afternoon @awmulyadi ! Thanks for the follow-up :) The idea about neighborhood aggregation came from traditional methods like Message Passing, as well as embedding techniques like Node2Vec or GraphSAGE. Since I'm used to working with GNNs, and we had species and nodes in vector format, this approach seemed like a good fit. For instance in this paper, they process text documents through a relation graph where they apply convolutional operations on top to obtain embeddings based on the surrounding structure. This other paper might also be interesting to you, they consider neighbor text nodes for feature aggregation later in the pipeline. Moreover, it is true that the proportions we give are fixed - we didn't have enough time to push for a more sophisticated working solution. On that note, while creating this notebook we thought that our approach had a major weak point - we don't think that assigning every N-hop edge the same weight would give an optimal embedding. Instead, each edge could have an individual weight, the weight would be higher the more related the neighbor node is to the current node. We thought it would be interesting to use a deep learning model to assign these weights automatically, based on attention. One last thing - the BERT model we use isn't fine-tuned for domain-specific biological data. Fine-tuning it might improve the quality of the embeddings. Hope that was useful to you, feel free to reach out with any other question you have :) |
This PR first obtains more background on species using GenAI. Then, we use a BERT model to embed these to numerical representations.
As for the KG, we first enrich it using STARK (the notebook was already present in the project), then embed the nodes using their rich representations using BERT. Note that we also take into consideration the neighboring nodes in order to account for related terms to said node.
Finally, we make a similarity comparison between all nodes in the KG with each of the species we want to annotate, and end up fetching the most similar node, and its related code