-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/primekg loader #49
Merged
+9,773
−26
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
2e23586
feat: add primekg loader tool
awmulyadi 114d1f5
chpre: fix comments to load dataframes
awmulyadi e2f9298
fix: change file_id of dataverse
awmulyadi 989d2c4
feat: add starkqa-primekg loader tool
awmulyadi 1c01cea
chore: merge changes from main
awmulyadi 95c2af0
feat: add tutorials of talk2knowledgegraphs
awmulyadi 2bae5b3
fix: execute a notebook
awmulyadi 2c82016
feat: include processed node information of StarkQA-PrimeKG
awmulyadi d931aab
fix: update the approach to load PrimeKG
awmulyadi 319beca
fix: reorganize the dataset loaders
awmulyadi 27deb4e
feat: add biobridge-primekg loader
awmulyadi 3e23007
chore: merge changes from main
awmulyadi 29df106
feat: update mkdocs for talk2knowledgegraphs
awmulyadi 783bc33
feat: update gha workflows
awmulyadi 542b38d
fix: modify workflows
awmulyadi 4b2dd9b
fix: remove argument for negative sampling
awmulyadi 68a434b
fix: remove check-path workflows
awmulyadi ac06b05
fix: modify coverage
awmulyadi 9e15375
fix: modify coverage over multi-os
awmulyadi 40cce60
feat: add embeddings for starkqa, node info for biobridge
awmulyadi f474202
chore: merge changes from main
awmulyadi 9bf8730
fix: change torch version
awmulyadi ad48b53
fix: update workflow for installing reqs
awmulyadi d7e6eb9
fix: update biobridge
awmulyadi 632b625
fix: remove bash on workflows
awmulyadi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# This is a basic workflow to help you get started with GitHub Actions | ||
name: TESTS Talk2BioModels | ||
|
||
# Controls when the workflow will run | ||
on: | ||
# Triggers the workflow on push or pull request events | ||
pull_request: | ||
branches: [ main ] | ||
paths: | ||
- 'aiagents4pharma/talk2biomodels/**' | ||
|
||
# Allows you to run this workflow manually from the Actions tab | ||
workflow_dispatch: | ||
|
||
env: | ||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
|
||
# This workflow contains jobs covering linting and code coverage (along with testing). | ||
jobs: | ||
pylint-windows-ubuntu-macos: | ||
# The type of runner that the job will run on | ||
name: pylint-windows-ubuntu-macos | ||
runs-on: ${{ matrix.os }} | ||
|
||
strategy: | ||
matrix: | ||
os: [windows-latest, ubuntu-latest, macos-latest] | ||
|
||
# Steps represent a sequence of tasks that will be executed as part of the job | ||
steps: | ||
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it | ||
- uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: 3.12 | ||
|
||
# install requirements | ||
- name: Install the requirements | ||
run: | | ||
pip3 install --break-system-packages -r requirements.txt | ||
# pylint | ||
- name: Run pylint | ||
run: | | ||
pylint --disable=R0801,R0902,W0221,W0122 aiagents4pharma/talk2biomodels | ||
# code coverage job for ubuntu and macos | ||
code-cov-ubuntu-macos: | ||
name: code-coverage-ubuntu-macos | ||
runs-on: ${{ matrix.os }} | ||
|
||
strategy: | ||
matrix: | ||
os: [ubuntu-latest, macos-latest] | ||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: 3.12 | ||
|
||
- name: Install dependencies | ||
run: pip3 install -r requirements.txt # Adjust this according to your project | ||
|
||
- name: Run tests with coverage | ||
run: coverage run --include=aiagents4pharma/talk2biomodels/* -m pytest --cache-clear aiagents4pharma/talk2biomodels/tests/ | ||
|
||
- name: Check coverage | ||
run: | | ||
coverage report -m | ||
TOTAL_COVERAGE=$(coverage report -m | awk 'END {print int($NF)}') | ||
if [[ $TOTAL_COVERAGE -ne 100 ]]; then | ||
echo "Code coverage is not 100%. Please check the coverage report." | ||
exit 1 | ||
fi | ||
env: | ||
COVERAGE_FILE: './.coverage' | ||
|
||
# code coverage job for windows | ||
code-cov-windows: | ||
name: code-coverage-windows | ||
runs-on: ${{ matrix.os }} | ||
|
||
strategy: | ||
matrix: | ||
os: [windows-latest] | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: 3.x | ||
|
||
- name: Install dependencies | ||
run: pip3 install -r requirements.txt # Adjust this according to your project | ||
|
||
- name: Run tests with coverage | ||
run: coverage run --include=aiagents4pharma/talk2biomodels/* -m pytest --cache-clear aiagents4pharma/talk2biomodels/tests/ | ||
|
||
- name: Check coverage | ||
run: | | ||
coverage report -m | ||
# $TOTAL_COVERAGE=(& coverage report -m | Select-Object -Last 1) -replace "[^\d]" # Extract the last line and remove non-numeric characters | ||
$TOTAL_COVERAGE=(& coverage report -m | Select-Object -Last 1) | ||
# split and extract the last element | ||
$TOTAL_COVERAGE=($TOTAL_COVERAGE -split " ")[-1] | ||
# remove non-numeric characters | ||
$TOTAL_COVERAGE=($TOTAL_COVERAGE -replace "[^\d]") | ||
# convert to int | ||
$TOTAL_COVERAGE=[int]$TOTAL_COVERAGE | ||
echo "Total coverage: $TOTAL_COVERAGE" | ||
if ($TOTAL_COVERAGE -ne 100) { | ||
Write-Host "Code coverage is not 100%. Please check the coverage report." | ||
exit 1 | ||
} | ||
env: | ||
COVERAGE_FILE: './.coverage' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# This is a basic workflow to help you get started with GitHub Actions | ||
name: TESTS Talk2KnowledgeGraphs | ||
|
||
# Controls when the workflow will run | ||
on: | ||
# Triggers the workflow on push or pull request events | ||
pull_request: | ||
branches: [ main ] | ||
paths: | ||
- 'aiagents4pharma/talk2knowledgegraphs/**' | ||
|
||
# Allows you to run this workflow manually from the Actions tab | ||
workflow_dispatch: | ||
|
||
env: | ||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
|
||
# This workflow contains jobs covering linting and code coverage (along with testing). | ||
jobs: | ||
pylint-windows-ubuntu-macos: | ||
# The type of runner that the job will run on | ||
name: pylint-windows-ubuntu-macos | ||
runs-on: ${{ matrix.os }} | ||
|
||
strategy: | ||
matrix: | ||
os: [windows-latest, ubuntu-latest, macos-latest] | ||
|
||
# Steps represent a sequence of tasks that will be executed as part of the job | ||
steps: | ||
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it | ||
- uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: 3.12 | ||
|
||
# install requirements | ||
- name: Install the requirements | ||
run: | | ||
pip3 install --break-system-packages -r requirements.txt | ||
# pylint | ||
- name: Run pylint | ||
run: | | ||
pylint --disable=R0801,R0902,W0221,W0122 aiagents4pharma/talk2knowledgegraphs | ||
# code coverage job for ubuntu and macos | ||
code-cov-ubuntu-macos: | ||
name: code-coverage-ubuntu-macos | ||
runs-on: ${{ matrix.os }} | ||
|
||
strategy: | ||
matrix: | ||
os: [ubuntu-latest, macos-latest] | ||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: 3.12 | ||
|
||
- name: Install dependencies | ||
run: pip3 install -r requirements.txt # Adjust this according to your project | ||
|
||
- name: Run tests with coverage | ||
run: coverage run --include=aiagents4pharma/talk2knowledgegraphs/* -m pytest --cache-clear aiagents4pharma/talk2knowledgegraphs/tests/ | ||
|
||
- name: Check coverage | ||
run: | | ||
coverage report -m | ||
TOTAL_COVERAGE=$(coverage report -m | awk 'END {print int($NF)}') | ||
if [[ $TOTAL_COVERAGE -ne 100 ]]; then | ||
echo "Code coverage is not 100%. Please check the coverage report." | ||
exit 1 | ||
fi | ||
env: | ||
COVERAGE_FILE: './.coverage' | ||
|
||
# code coverage job for windows | ||
code-cov-windows: | ||
name: code-coverage-windows | ||
runs-on: ${{ matrix.os }} | ||
|
||
strategy: | ||
matrix: | ||
os: [windows-latest] | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: 3.x | ||
|
||
- name: Install dependencies | ||
run: pip3 install -r requirements.txt # Adjust this according to your project | ||
|
||
- name: Run tests with coverage | ||
run: coverage run --include=aiagents4pharma/talk2knowledgegraphs/* -m pytest --cache-clear aiagents4pharma/talk2knowledgegraphs/tests/ | ||
|
||
- name: Check coverage | ||
run: | | ||
coverage report -m | ||
# $TOTAL_COVERAGE=(& coverage report -m | Select-Object -Last 1) -replace "[^\d]" # Extract the last line and remove non-numeric characters | ||
$TOTAL_COVERAGE=(& coverage report -m | Select-Object -Last 1) | ||
# split and extract the last element | ||
$TOTAL_COVERAGE=($TOTAL_COVERAGE -split " ")[-1] | ||
# remove non-numeric characters | ||
$TOTAL_COVERAGE=($TOTAL_COVERAGE -replace "[^\d]") | ||
# convert to int | ||
$TOTAL_COVERAGE=[int]$TOTAL_COVERAGE | ||
echo "Total coverage: $TOTAL_COVERAGE" | ||
if ($TOTAL_COVERAGE -ne 100) { | ||
Write-Host "Code coverage is not 100%. Please check the coverage report." | ||
exit 1 | ||
} | ||
env: | ||
COVERAGE_FILE: './.coverage' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
''' | ||
This file is used to import the talk2biomodels module. | ||
This file is used to import aiagents4pharma modules. | ||
''' | ||
|
||
from . import talk2biomodels | ||
from . import talk2cells | ||
from . import talk2knowledgegraphs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
''' | ||
This file is used to import the datasets, utils, and tools. | ||
''' | ||
from . import datasets |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
''' | ||
This file is used to import all the models in the package. | ||
''' | ||
from . import dataset | ||
from . import primekg | ||
from . import starkqa_primekg | ||
from . import biobridge_primekg |
553 changes: 553 additions & 0 deletions
553
aiagents4pharma/talk2knowledgegraphs/datasets/biobridge_primekg.py
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
#!/usr/bin/env python3 | ||
|
||
""" | ||
Abstract class for dataset. | ||
""" | ||
|
||
from abc import ABC, abstractmethod | ||
|
||
class Dataset(ABC): | ||
""" | ||
Abstract class for dataset. | ||
""" | ||
@abstractmethod | ||
def setup(self): | ||
""" | ||
A method to set up the dataset. | ||
""" | ||
|
||
@abstractmethod | ||
def load_data(self): | ||
""" | ||
A method to load the dataset and potentially preprocess it. | ||
""" |
201 changes: 201 additions & 0 deletions
201
aiagents4pharma/talk2knowledgegraphs/datasets/primekg.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
""" | ||
Class for loading PrimeKG dataset. | ||
""" | ||
|
||
import os | ||
import requests | ||
from tqdm import tqdm | ||
import pandas as pd | ||
from .dataset import Dataset | ||
|
||
class PrimeKG(Dataset): | ||
""" | ||
Class for loading PrimeKG dataset. | ||
It downloads the data from the Harvard Dataverse and stores it in the local directory. | ||
The data is then loaded into pandas DataFrame of nodes and edges. | ||
""" | ||
|
||
def __init__(self, local_dir: str = "../../../data/primekg/"): | ||
""" | ||
Constructor for PrimeKG class. | ||
Args: | ||
local_dir (str): The local directory where the data will be stored. | ||
""" | ||
self.name: str = "primekg" | ||
self.server_path: str = "https://dataverse.harvard.edu/api/access/datafile/" | ||
self.file_ids: dict = {"nodes": 6180617, "edges": 6180616} | ||
self.local_dir: str = local_dir | ||
|
||
# Attributes to store the data | ||
self.nodes: pd.DataFrame = None | ||
self.edges: pd.DataFrame = None | ||
|
||
# Set up the dataset | ||
self.setup() | ||
|
||
def setup(self): | ||
""" | ||
A method to set up the dataset. | ||
""" | ||
# Make the directory if it doesn't exist | ||
os.makedirs(os.path.dirname(self.local_dir), exist_ok=True) | ||
|
||
|
||
def _download_file(self, remote_url:str, local_path: str): | ||
""" | ||
A helper function to download a file from remote URL to the local directory. | ||
Args: | ||
remote_url (str): The remote URL of the file to be downloaded. | ||
local_path (str): The local path where the file will be saved. | ||
""" | ||
response = requests.get(remote_url, stream=True, timeout=300) | ||
response.raise_for_status() | ||
progress_bar = tqdm( | ||
total=int(response.headers.get("content-length", 0)), | ||
unit="iB", | ||
unit_scale=True, | ||
) | ||
with open(local_path, "wb") as file: | ||
for data in response.iter_content(1024): | ||
progress_bar.update(len(data)) | ||
file.write(data) | ||
progress_bar.close() | ||
|
||
def _load_nodes(self) -> pd.DataFrame: | ||
""" | ||
Private method to load the nodes dataframe of PrimeKG dataset. | ||
This method downloads the nodes file from the Harvard Dataverse if it does not exist | ||
in the local directory. Otherwise, it loads the data from the local directory. | ||
It further processes the dataframe of nodes and returns it. | ||
Returns: | ||
pd.DataFrame: The nodes dataframe of PrimeKG dataset. | ||
""" | ||
local_file = os.path.join(self.local_dir, f"{self.name}_nodes.tsv.gz") | ||
if os.path.exists(local_file): | ||
print(f"{local_file} already exists. Loading the data from the local directory.") | ||
|
||
# Load the dataframe from the local directory and assign it to the nodes attribute | ||
nodes = pd.read_csv(local_file, sep="\t", compression="gzip", low_memory=False) | ||
else: | ||
print(f"Downloading node file from {self.server_path}{self.file_ids['nodes']}") | ||
|
||
# Download the file from the Harvard Dataverse with designated file_id for node | ||
self._download_file(f"{self.server_path}{self.file_ids['nodes']}", | ||
os.path.join(self.local_dir, "nodes.tab")) | ||
|
||
# Load the downloaded file into a pandas DataFrame | ||
nodes = pd.read_csv(os.path.join(self.local_dir, "nodes.tab"), | ||
sep="\t", low_memory=False) | ||
|
||
# Further processing of the dataframe | ||
nodes = nodes[ | ||
["node_index", "node_name", "node_source", "node_id", "node_type"] | ||
] | ||
|
||
# Store compressed dataframe in the local directory | ||
nodes.to_csv(local_file, index=False, sep="\t", compression="gzip") | ||
|
||
return nodes | ||
|
||
def _load_edges(self, nodes: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Private method to load the edges dataframe of PrimeKG dataset. | ||
This method downloads the edges file from the Harvard Dataverse if it does not exist | ||
in the local directory. Otherwise, it loads the data from the local directory. | ||
It further processes the dataframe of edges and returns it. | ||
Args: | ||
nodes (pd.DataFrame): The nodes dataframe of PrimeKG dataset. | ||
Returns: | ||
pd.DataFrame: The edges dataframe of PrimeKG dataset. | ||
""" | ||
local_file = os.path.join(self.local_dir, f"{self.name}_edges.tsv.gz") | ||
if os.path.exists(local_file): | ||
print(f"{local_file} already exists. Loading the data from the local directory.") | ||
|
||
# Load the dataframe from the local directory and assign it to the edges attribute | ||
edges = pd.read_csv(local_file, sep="\t", compression="gzip", low_memory=False) | ||
else: | ||
print(f"Downloading edge file from {self.server_path}{self.file_ids['edges']}") | ||
|
||
# Download the file from the Harvard Dataverse with designated file_id for edge | ||
self._download_file(f"{self.server_path}{self.file_ids['edges']}", | ||
os.path.join(self.local_dir, "edges.csv")) | ||
|
||
# Load the downloaded file into a pandas DataFrame | ||
edges = pd.read_csv(os.path.join(self.local_dir, "edges.csv"), | ||
sep=",", low_memory=False) | ||
|
||
# Further processing of the dataframe | ||
edges = edges.merge( | ||
nodes, left_on="x_index", right_on="node_index" | ||
) | ||
edges.drop(["x_index"], axis=1, inplace=True) | ||
edges.rename( | ||
columns={ | ||
"node_index": "head_index", | ||
"node_name": "head_name", | ||
"node_source": "head_source", | ||
"node_id": "head_id", | ||
"node_type": "head_type", | ||
}, | ||
inplace=True, | ||
) | ||
edges = edges.merge( | ||
nodes, left_on="y_index", right_on="node_index" | ||
) | ||
edges.drop(["y_index"], axis=1, inplace=True) | ||
edges.rename( | ||
columns={ | ||
"node_index": "tail_index", | ||
"node_name": "tail_name", | ||
"node_source": "tail_source", | ||
"node_id": "tail_id", | ||
"node_type": "tail_type" | ||
}, | ||
inplace=True, | ||
) | ||
edges = edges[ | ||
[ | ||
"head_index", "head_name", "head_source", "head_id", "head_type", | ||
"tail_index", "tail_name", "tail_source", "tail_id", "tail_type", | ||
"display_relation", "relation", | ||
] | ||
] | ||
|
||
# Store compressed dataframe in the local directory | ||
edges.to_csv(local_file, index=False, sep="\t", compression="gzip") | ||
|
||
return edges | ||
|
||
def load_data(self): | ||
""" | ||
Load the PrimeKG dataset into pandas DataFrame of nodes and edges. | ||
""" | ||
print("Loading nodes of PrimeKG dataset ...") | ||
self.nodes = self._load_nodes() | ||
|
||
print("Loading edges of PrimeKG dataset ...") | ||
self.edges = self._load_edges(self.nodes) | ||
|
||
def get_nodes(self) -> pd.DataFrame: | ||
""" | ||
Get the nodes dataframe of PrimeKG dataset. | ||
Returns: | ||
pd.DataFrame: The nodes dataframe of PrimeKG dataset. | ||
""" | ||
return self.nodes | ||
|
||
def get_edges(self) -> pd.DataFrame: | ||
""" | ||
Get the edges dataframe of PrimeKG dataset. | ||
Returns: | ||
pd.DataFrame: The edges dataframe of PrimeKG dataset. | ||
""" | ||
return self.edges |
198 changes: 198 additions & 0 deletions
198
aiagents4pharma/talk2knowledgegraphs/datasets/starkqa_primekg.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
""" | ||
Class for loading StarkQAPrimeKG dataset. | ||
""" | ||
|
||
import os | ||
import shutil | ||
import pickle | ||
import numpy as np | ||
import pandas as pd | ||
from tqdm import tqdm | ||
import torch | ||
from huggingface_hub import hf_hub_download, list_repo_files | ||
import gdown | ||
from .dataset import Dataset | ||
|
||
class StarkQAPrimeKG(Dataset): | ||
""" | ||
Class for loading StarkQAPrimeKG dataset. | ||
It downloads the data from the HuggingFace repo and stores it in the local directory. | ||
The data is then loaded into pandas DataFrame of QA pairs, dictionary of split indices, | ||
and node information. | ||
""" | ||
|
||
def __init__(self, local_dir: str = "../../../data/starkqa_primekg/"): | ||
""" | ||
Constructor for StarkQAPrimeKG class. | ||
Args: | ||
local_dir (str): The local directory to store the dataset files. | ||
""" | ||
self.name: str = "starkqa_primekg" | ||
self.hf_repo_id: str = "snap-stanford/stark" | ||
self.local_dir: str = local_dir | ||
# Attributes to store the data | ||
self.starkqa: pd.DataFrame = None | ||
self.starkqa_split_idx: dict = None | ||
self.starkqa_node_info: dict = None | ||
self.query_emb_dict: dict = None | ||
self.node_emb_dict: dict = None | ||
|
||
# Set up the dataset | ||
self.setup() | ||
|
||
def setup(self): | ||
""" | ||
A method to set up the dataset. | ||
""" | ||
# Make the directory if it doesn't exist | ||
os.makedirs(os.path.dirname(self.local_dir), exist_ok=True) | ||
|
||
def _load_stark_repo(self) -> pd.DataFrame: | ||
""" | ||
Private method to load related files of StarkQAPrimeKG dataset. | ||
Returns: | ||
pd.DataFrame: The nodes dataframe of StarkQAPrimeKG dataset. | ||
""" | ||
# Download the file if it does not exist in the local directory | ||
# Otherwise, load the data from the local directory | ||
local_file = os.path.join(self.local_dir, "qa/prime/stark_qa/stark_qa.csv") | ||
if os.path.exists(local_file): | ||
print(f"{local_file} already exists. Loading the data from the local directory.") | ||
else: | ||
print(f"Downloading files from {self.hf_repo_id}") | ||
|
||
# List all related files in the HuggingFace Hub repository | ||
files = list_repo_files(self.hf_repo_id, repo_type="dataset") | ||
files = [f for f in files if ((f.startswith("qa/prime/") or | ||
f.startswith("skb/prime/")) and f.find("raw") == -1)] | ||
|
||
# Download and save each file in the specified folder | ||
for file in tqdm(files): | ||
_ = hf_hub_download(self.hf_repo_id, | ||
file, | ||
repo_type="dataset", | ||
local_dir=self.local_dir) | ||
|
||
# Unzip the processed files | ||
shutil.unpack_archive( | ||
os.path.join(self.local_dir, "skb/prime/processed.zip"), | ||
os.path.join(self.local_dir, "skb/prime/") | ||
) | ||
|
||
# Load StarkQA dataframe | ||
starkqa = pd.read_csv( | ||
os.path.join(self.local_dir, "qa/prime/stark_qa/stark_qa.csv"), | ||
low_memory=False) | ||
|
||
# Read split indices | ||
qa_indices = sorted(starkqa['id'].tolist()) | ||
starkqa_split_idx = {} | ||
for split in ['train', 'val', 'test', 'test-0.1']: | ||
indices_file = os.path.join(self.local_dir, "qa/prime/split", f'{split}.index') | ||
with open(indices_file, 'r', encoding='utf-8') as f: | ||
indices = f.read().strip().split('\n') | ||
query_ids = [int(idx) for idx in indices] | ||
starkqa_split_idx[split] = np.array( | ||
[qa_indices.index(query_id) for query_id in query_ids] | ||
) | ||
|
||
# Load the node info of PrimeKG preprocessed for StarkQA | ||
with open(os.path.join(self.local_dir, 'skb/prime/processed/node_info.pkl'), 'rb') as f: | ||
starkqa_node_info = pickle.load(f) | ||
|
||
return starkqa, starkqa_split_idx, starkqa_node_info | ||
|
||
def _load_stark_embeddings(self) -> tuple: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
""" | ||
Private method to load the embeddings of StarkQAPrimeKG dataset. | ||
Returns: | ||
tuple: A tuple of query and node embeddings dictionaries. | ||
""" | ||
# Load the provided embeddings of query and nodes | ||
# Note that they utilized 'text-embedding-ada-002' for embeddings | ||
emb_model = 'text-embedding-ada-002' | ||
query_emb_url = 'https://drive.google.com/uc?id=1MshwJttPZsHEM2cKA5T13SIrsLeBEdyU' | ||
node_emb_url = 'https://drive.google.com/uc?id=16EJvCMbgkVrQ0BuIBvLBp-BYPaye-Edy' | ||
|
||
# Prepare respective directories to store the embeddings | ||
emb_dir = os.path.join(self.local_dir, emb_model) | ||
query_emb_dir = os.path.join(emb_dir, "query") | ||
node_emb_dir = os.path.join(emb_dir, "doc") | ||
os.makedirs(query_emb_dir, exist_ok=True) | ||
os.makedirs(node_emb_dir, exist_ok=True) | ||
query_emb_path = os.path.join(query_emb_dir, "query_emb_dict.pt") | ||
node_emb_path = os.path.join(node_emb_dir, "candidate_emb_dict.pt") | ||
|
||
# Download the embeddings if they do not exist in the local directory | ||
if not os.path.exists(query_emb_path) or not os.path.exists(node_emb_path): | ||
# Download the query embeddings | ||
gdown.download(query_emb_url, query_emb_path, quiet=False) | ||
|
||
# Download the node embeddings | ||
gdown.download(node_emb_url, node_emb_path, quiet=False) | ||
|
||
# Load the embeddings | ||
query_emb_dict = torch.load(query_emb_path) | ||
node_emb_dict = torch.load(node_emb_path) | ||
|
||
return query_emb_dict, node_emb_dict | ||
|
||
def load_data(self): | ||
""" | ||
Load the StarkQAPrimeKG dataset into pandas DataFrame of QA pairs, | ||
dictionary of split indices, and node information. | ||
""" | ||
print("Loading StarkQAPrimeKG dataset...") | ||
self.starkqa, self.starkqa_split_idx, self.starkqa_node_info = self._load_stark_repo() | ||
|
||
print("Loading StarkQAPrimeKG embeddings...") | ||
self.query_emb_dict, self.node_emb_dict = self._load_stark_embeddings() | ||
|
||
|
||
def get_starkqa(self) -> pd.DataFrame: | ||
""" | ||
Get the dataframe of StarkQAPrimeKG dataset, containing the QA pairs. | ||
Returns: | ||
pd.DataFrame: The nodes dataframe of PrimeKG dataset. | ||
""" | ||
return self.starkqa | ||
|
||
def get_starkqa_split_indicies(self) -> dict: | ||
""" | ||
Get the split indices of StarkQAPrimeKG dataset. | ||
Returns: | ||
dict: The split indices of StarkQAPrimeKG dataset. | ||
""" | ||
return self.starkqa_split_idx | ||
|
||
def get_starkqa_node_info(self) -> dict: | ||
""" | ||
Get the node information of StarkQAPrimeKG dataset. | ||
Returns: | ||
dict: The node information of StarkQAPrimeKG dataset. | ||
""" | ||
return self.starkqa_node_info | ||
|
||
def get_query_embeddings(self) -> dict: | ||
""" | ||
Get the query embeddings of StarkQAPrimeKG dataset. | ||
Returns: | ||
dict: The query embeddings of StarkQAPrimeKG dataset. | ||
""" | ||
return self.query_emb_dict | ||
|
||
def get_node_embeddings(self) -> dict: | ||
""" | ||
Get the node embeddings of StarkQAPrimeKG dataset. | ||
Returns: | ||
dict: The node embeddings of StarkQAPrimeKG dataset. | ||
""" | ||
return self.node_emb_dict |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
[build-system] | ||
requires = ["setuptools", "wheel"] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[project] | ||
name = "aiagents4pharma-talk2knowledgegraphs" | ||
description = "AI Agents for drug discovery, drug development, and other pharmaceutical R&D" | ||
readme = "README.md" | ||
requires-python = ">=3.10" | ||
classifiers = [ | ||
"Programming Language :: Python :: 3", | ||
"License :: OSI Approved :: MIT License", | ||
"Operating System :: OS Independent", | ||
] | ||
dependencies = [ | ||
dmccloskey marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"coverage==7.6.4", | ||
"gdown==5.2.0", | ||
"huggingface_hub==0.26.5", | ||
"joblib==1.4.2", | ||
"pandas==2.2.3", | ||
"pydantic==2.9.2", | ||
"pylint==3.3.1", | ||
"pytest==8.3.3", | ||
"torch==2.5.1", | ||
"tqdm==4.66.6", | ||
"mkdocs==1.6.1", | ||
"mkdocs-jupyter==0.25.1", | ||
"mkdocs-material==9.5.47", | ||
"mkdocstrings-python==1.12.2", | ||
"mkdocs-include-markdown-plugin==7.1.2", | ||
"mkdocstrings==0.27.0" | ||
] | ||
dynamic = ["version"] | ||
|
||
[tool.setuptools.dynamic] | ||
version = {file = "release_version.txt"} | ||
|
||
# find packages | ||
[tool.setuptools] | ||
packages = ["aiagents4pharma", | ||
"aiagents4pharma.talk2knowledgegraphs", | ||
"aiagents4pharma.talk2knowledgegraphs.datasets", | ||
"aiagents4pharma.talk2knowledgegraphs.utils"] |
Empty file.
242 changes: 242 additions & 0 deletions
242
aiagents4pharma/talk2knowledgegraphs/tests/test_biobridge_primekg_loader.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,242 @@ | ||
""" | ||
Test cases for primekg_loader.py | ||
""" | ||
|
||
import os | ||
import shutil | ||
import pytest | ||
from ..datasets.biobridge_primekg import BioBridgePrimeKG | ||
|
||
# Remove the data folder for testing if it exists | ||
PRIMEKG_LOCAL_DIR = "../data/primekg_test/" | ||
LOCAL_DIR = "../data/biobridge_primekg_test/" | ||
shutil.rmtree(LOCAL_DIR, ignore_errors=True) | ||
|
||
@pytest.fixture(name="biobridge_primekg") | ||
def biobridge_primekg_fixture(): | ||
""" | ||
Fixture for creating an instance of PrimeKG. | ||
""" | ||
return BioBridgePrimeKG(primekg_dir=PRIMEKG_LOCAL_DIR, | ||
local_dir=LOCAL_DIR) | ||
|
||
def test_download_primekg(biobridge_primekg): | ||
""" | ||
Test the loading method of the BioBridge-PrimeKG class by downloading data from repository. | ||
""" | ||
# Load BioBridge-PrimeKG data | ||
biobridge_primekg.load_data() | ||
primekg_nodes = biobridge_primekg.get_primekg().get_nodes() | ||
primekg_edges = biobridge_primekg.get_primekg().get_edges() | ||
biobridge_data_config = biobridge_primekg.get_data_config() | ||
biobridge_emb_dict = biobridge_primekg.get_node_embeddings() | ||
biobridge_triplets = biobridge_primekg.get_primekg_triplets() | ||
biobridge_splits = biobridge_primekg.get_train_test_split() | ||
biobridge_node_info = biobridge_primekg.get_node_info_dict() | ||
|
||
# Check if the local directories exists | ||
assert os.path.exists(biobridge_primekg.primekg_dir) | ||
assert os.path.exists(biobridge_primekg.local_dir) | ||
# Check if downloaded and processed files exist | ||
# PrimeKG files | ||
files = ["nodes.tab", "primekg_nodes.tsv.gz", | ||
"edges.csv", "primekg_edges.tsv.gz"] | ||
for file in files: | ||
path = f"{biobridge_primekg.primekg_dir}/{file}" | ||
assert os.path.exists(path) | ||
# BioBridge data config | ||
assert os.path.exists(f"{biobridge_primekg.local_dir}/data_config.json") | ||
# BioBridge embeddings | ||
files = [ | ||
"protein.pkl", | ||
"mf.pkl", | ||
"cc.pkl", | ||
"bp.pkl", | ||
"drug.pkl", | ||
"disease.pkl", | ||
"embedding_dict.pkl" | ||
] | ||
for file in files: | ||
path = f"{biobridge_primekg.local_dir}/embeddings/{file}" | ||
assert os.path.exists(path) | ||
# BioBridge processed files | ||
files = [ | ||
"protein.csv", | ||
"mf.csv", | ||
"cc.csv", | ||
"bp.csv", | ||
"drug.csv", | ||
"disease.csv", | ||
"triplet_full.tsv.gz", | ||
"triplet_full_altered.tsv.gz", | ||
"node_train.tsv.gz", | ||
"triplet_train.tsv.gz", | ||
"node_test.tsv.gz", | ||
"triplet_test.tsv.gz", | ||
] | ||
for file in files: | ||
path = f"{biobridge_primekg.local_dir}/processed/{file}" | ||
assert os.path.exists(path) | ||
# Check processed PrimeKG dataframes | ||
# Nodes | ||
assert primekg_nodes is not None | ||
assert len(primekg_nodes) > 0 | ||
assert primekg_nodes.shape[0] == 129375 | ||
dmccloskey marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Edges | ||
assert primekg_edges is not None | ||
assert len(primekg_edges) > 0 | ||
assert primekg_edges.shape[0] == 8100498 | ||
# Check processed BioBridge data config | ||
assert biobridge_data_config is not None | ||
assert len(biobridge_data_config) > 0 | ||
assert len(biobridge_data_config['node_type']) == 10 | ||
assert len(biobridge_data_config['relation_type']) == 18 | ||
assert len(biobridge_data_config['emb_dim']) == 6 | ||
# Check processed BioBridge embeddings | ||
assert biobridge_emb_dict is not None | ||
assert len(biobridge_emb_dict) > 0 | ||
assert len(biobridge_emb_dict) == 85466 | ||
# Check processed BioBridge triplets | ||
assert biobridge_triplets is not None | ||
assert len(biobridge_triplets) > 0 | ||
assert biobridge_triplets.shape[0] == 3904610 | ||
assert list(biobridge_splits.keys()) == ['train', 'node_train', 'test', 'node_test'] | ||
assert len(biobridge_splits['train']) == 3510930 | ||
assert len(biobridge_splits['node_train']) == 76486 | ||
assert len(biobridge_splits['test']) == 393680 | ||
assert len(biobridge_splits['node_test']) == 8495 | ||
# Check node info dictionary | ||
assert list(biobridge_node_info.keys()) == ['gene/protein', | ||
'molecular_function', | ||
'cellular_component', | ||
'biological_process', | ||
'drug', | ||
'disease'] | ||
assert len(biobridge_node_info['gene/protein']) == 19162 | ||
assert len(biobridge_node_info['molecular_function']) == 10966 | ||
assert len(biobridge_node_info['cellular_component']) == 4013 | ||
assert len(biobridge_node_info['biological_process']) == 27478 | ||
assert len(biobridge_node_info['drug']) == 6948 | ||
assert len(biobridge_node_info['disease']) == 44133 | ||
|
||
|
||
def test_load_existing_primekg(biobridge_primekg): | ||
dmccloskey marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Test the loading method of the BioBridge-PrimeKG class by loading existing data in local. | ||
""" | ||
# Load BioBridge-PrimeKG data | ||
biobridge_primekg.load_data() | ||
primekg_nodes = biobridge_primekg.get_primekg().get_nodes() | ||
primekg_edges = biobridge_primekg.get_primekg().get_edges() | ||
biobridge_data_config = biobridge_primekg.get_data_config() | ||
biobridge_emb_dict = biobridge_primekg.get_node_embeddings() | ||
biobridge_triplets = biobridge_primekg.get_primekg_triplets() | ||
biobridge_splits = biobridge_primekg.get_train_test_split() | ||
biobridge_node_info = biobridge_primekg.get_node_info_dict() | ||
|
||
# Check if the local directories exists | ||
assert os.path.exists(biobridge_primekg.primekg_dir) | ||
assert os.path.exists(biobridge_primekg.local_dir) | ||
# Check if downloaded and processed files exist | ||
# PrimeKG files | ||
files = ["nodes.tab", "primekg_nodes.tsv.gz", | ||
"edges.csv", "primekg_edges.tsv.gz"] | ||
for file in files: | ||
path = f"{biobridge_primekg.primekg_dir}/{file}" | ||
assert os.path.exists(path) | ||
# BioBridge data config | ||
assert os.path.exists(f"{biobridge_primekg.local_dir}/data_config.json") | ||
# BioBridge embeddings | ||
files = [ | ||
"protein.pkl", | ||
"mf.pkl", | ||
"cc.pkl", | ||
"bp.pkl", | ||
"drug.pkl", | ||
"disease.pkl", | ||
"embedding_dict.pkl" | ||
] | ||
for file in files: | ||
path = f"{biobridge_primekg.local_dir}/embeddings/{file}" | ||
assert os.path.exists(path) | ||
# BioBridge processed files | ||
files = [ | ||
"protein.csv", | ||
"mf.csv", | ||
"cc.csv", | ||
"bp.csv", | ||
"drug.csv", | ||
"disease.csv", | ||
"triplet_full.tsv.gz", | ||
"triplet_full_altered.tsv.gz", | ||
"node_train.tsv.gz", | ||
"triplet_train.tsv.gz", | ||
"node_test.tsv.gz", | ||
"triplet_test.tsv.gz", | ||
] | ||
for file in files: | ||
path = f"{biobridge_primekg.local_dir}/processed/{file}" | ||
assert os.path.exists(path) | ||
# Check processed PrimeKG dataframes | ||
# Nodes | ||
assert primekg_nodes is not None | ||
assert len(primekg_nodes) > 0 | ||
assert primekg_nodes.shape[0] == 129375 | ||
# Edges | ||
assert primekg_edges is not None | ||
assert len(primekg_edges) > 0 | ||
assert primekg_edges.shape[0] == 8100498 | ||
# Check processed BioBridge data config | ||
assert biobridge_data_config is not None | ||
assert len(biobridge_data_config) > 0 | ||
assert len(biobridge_data_config['node_type']) == 10 | ||
assert len(biobridge_data_config['relation_type']) == 18 | ||
assert len(biobridge_data_config['emb_dim']) == 6 | ||
# Check processed BioBridge embeddings | ||
assert biobridge_emb_dict is not None | ||
assert len(biobridge_emb_dict) > 0 | ||
assert len(biobridge_emb_dict) == 85466 | ||
# Check processed BioBridge triplets | ||
assert biobridge_triplets is not None | ||
assert len(biobridge_triplets) > 0 | ||
assert biobridge_triplets.shape[0] == 3904610 | ||
assert list(biobridge_splits.keys()) == ['train', 'node_train', 'test', 'node_test'] | ||
assert len(biobridge_splits['train']) == 3510930 | ||
assert len(biobridge_splits['node_train']) == 76486 | ||
assert len(biobridge_splits['test']) == 393680 | ||
assert len(biobridge_splits['node_test']) == 8495 | ||
# Check node info dictionary | ||
assert list(biobridge_node_info.keys()) == ['gene/protein', | ||
'molecular_function', | ||
'cellular_component', | ||
'biological_process', | ||
'drug', | ||
'disease'] | ||
assert len(biobridge_node_info['gene/protein']) == 19162 | ||
assert len(biobridge_node_info['molecular_function']) == 10966 | ||
assert len(biobridge_node_info['cellular_component']) == 4013 | ||
assert len(biobridge_node_info['biological_process']) == 27478 | ||
assert len(biobridge_node_info['drug']) == 6948 | ||
assert len(biobridge_node_info['disease']) == 44133 | ||
|
||
# def test_load_existing_primekg_with_negative_triplets(biobridge_primekg): | ||
# """ | ||
# Test the loading method of the BioBridge-PrimeKG class by loading existing data in local. | ||
# In addition, it builds negative triplets for training data. | ||
# """ | ||
# # Load BioBridge-PrimeKG data | ||
# # Using 1 negative sample per positive triplet | ||
# biobridge_primekg.load_data(build_neg_triplest=True, n_neg_samples=1) | ||
# biobridge_neg_triplets = biobridge_primekg.get_primekg_triplets_negative() | ||
|
||
# # Check if the local directories exists | ||
# assert os.path.exists(biobridge_primekg.primekg_dir) | ||
# assert os.path.exists(biobridge_primekg.local_dir) | ||
# # Check if downloaded and processed files exist | ||
# path = f"{biobridge_primekg.local_dir}/processed/triplet_train_negative.tsv.gz" | ||
# assert os.path.exists(path) | ||
# # Check processed BioBridge triplets | ||
# assert biobridge_neg_triplets is not None | ||
# assert len(biobridge_neg_triplets) > 0 | ||
# assert biobridge_neg_triplets.shape[0] == 3510930 | ||
# assert len(biobridge_neg_triplets.negative_tail_index[0]) == 1 |
73 changes: 73 additions & 0 deletions
73
aiagents4pharma/talk2knowledgegraphs/tests/test_primekg_loader.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
""" | ||
Test cases for primekg_loader.py | ||
""" | ||
|
||
import os | ||
import shutil | ||
import pytest | ||
from ..datasets.primekg import PrimeKG | ||
|
||
# Remove the data folder for testing if it exists | ||
LOCAL_DIR = "../data/primekg_test/" | ||
shutil.rmtree(LOCAL_DIR, ignore_errors=True) | ||
|
||
@pytest.fixture(name="primekg") | ||
def primekg_fixture(): | ||
""" | ||
Fixture for creating an instance of PrimeKG. | ||
""" | ||
return PrimeKG(local_dir=LOCAL_DIR) | ||
|
||
def test_download_primekg(primekg): | ||
""" | ||
Test the loading method of the PrimeKG class by downloading PrimeKG from server. | ||
""" | ||
# Load PrimeKG data | ||
primekg.load_data() | ||
primekg_nodes = primekg.get_nodes() | ||
primekg_edges = primekg.get_edges() | ||
|
||
# Check if the local directory exists | ||
assert os.path.exists(primekg.local_dir) | ||
# Check if downloaded and processed files exist | ||
files = ["nodes.tab", f"{primekg.name}_nodes.tsv.gz", | ||
"edges.csv", f"{primekg.name}_edges.tsv.gz"] | ||
for file in files: | ||
path = f"{primekg.local_dir}/{file}" | ||
assert os.path.exists(path) | ||
# Check processed PrimeKG dataframes | ||
# Nodes | ||
assert primekg_nodes is not None | ||
assert len(primekg_nodes) > 0 | ||
assert primekg_nodes.shape[0] == 129375 | ||
# Edges | ||
assert primekg_edges is not None | ||
assert len(primekg_edges) > 0 | ||
assert primekg_edges.shape[0] == 8100498 | ||
|
||
def test_load_existing_primekg(primekg): | ||
""" | ||
Test the loading method of the PrimeKG class by loading existing PrimeKG in local. | ||
""" | ||
# Load PrimeKG data | ||
primekg.load_data() | ||
primekg_nodes = primekg.get_nodes() | ||
primekg_edges = primekg.get_edges() | ||
|
||
# Check if the local directory exists | ||
assert os.path.exists(primekg.local_dir) | ||
# Check if downloaded and processed files exist | ||
files = ["nodes.tab", f"{primekg.name}_nodes.tsv.gz", | ||
"edges.csv", f"{primekg.name}_edges.tsv.gz"] | ||
for file in files: | ||
path = f"{primekg.local_dir}/{file}" | ||
assert os.path.exists(path) | ||
# Check processed PrimeKG dataframes | ||
# Nodes | ||
assert primekg_nodes is not None | ||
assert len(primekg_nodes) > 0 | ||
assert primekg_nodes.shape[0] == 129375 | ||
# Edges | ||
assert primekg_edges is not None | ||
assert len(primekg_edges) > 0 | ||
assert primekg_edges.shape[0] == 8100498 |
116 changes: 116 additions & 0 deletions
116
aiagents4pharma/talk2knowledgegraphs/tests/test_starkqa_primekg_loader.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
""" | ||
Test cases for starkqa_primekg_loader.py | ||
""" | ||
|
||
import os | ||
import shutil | ||
import pytest | ||
from ..datasets.starkqa_primekg import StarkQAPrimeKG | ||
|
||
# Remove the data folder for testing if it exists | ||
LOCAL_DIR = "../data/starkqa_primekg_test/" | ||
shutil.rmtree(LOCAL_DIR, ignore_errors=True) | ||
|
||
@pytest.fixture(name="starkqa_primekg") | ||
def starkqa_primekg_fixture(): | ||
""" | ||
Fixture for creating an instance of StarkQAPrimeKGData. | ||
""" | ||
return StarkQAPrimeKG(local_dir=LOCAL_DIR) | ||
|
||
def test_download_starkqa_primekg(starkqa_primekg): | ||
""" | ||
Test the loading method of the StarkQAPrimeKGLoaderTool class by downloading files | ||
from HuggingFace Hub. | ||
""" | ||
# Load StarkQA PrimeKG data | ||
starkqa_primekg.load_data() | ||
starkqa_df = starkqa_primekg.get_starkqa() | ||
primekg_node_info = starkqa_primekg.get_starkqa_node_info() | ||
split_idx = starkqa_primekg.get_starkqa_split_indicies() | ||
query_embeddings = starkqa_primekg.get_query_embeddings() | ||
node_embeddings = starkqa_primekg.get_node_embeddings() | ||
|
||
# Check if the local directory exists | ||
assert os.path.exists(starkqa_primekg.local_dir) | ||
# Check if downloaded files exist in the local directory | ||
files = ['qa/prime/split/test-0.1.index', | ||
'qa/prime/split/test.index', | ||
'qa/prime/split/train.index', | ||
'qa/prime/split/val.index', | ||
'qa/prime/stark_qa/stark_qa.csv', | ||
'qa/prime/stark_qa/stark_qa_human_generated_eval.csv', | ||
'skb/prime/processed.zip'] | ||
for file in files: | ||
path = f"{starkqa_primekg.local_dir}/{file}" | ||
assert os.path.exists(path) | ||
# Check dataframe | ||
assert starkqa_df is not None | ||
assert len(starkqa_df) > 0 | ||
assert starkqa_df.shape[0] == 11204 | ||
# Check node information | ||
assert primekg_node_info is not None | ||
assert len(primekg_node_info) == 129375 | ||
# Check split indices | ||
assert list(split_idx.keys()) == ['train', 'val', 'test', 'test-0.1'] | ||
assert len(split_idx['train']) == 6162 | ||
assert len(split_idx['val']) == 2241 | ||
assert len(split_idx['test']) == 2801 | ||
assert len(split_idx['test-0.1']) == 280 | ||
# Check query embeddings | ||
assert query_embeddings is not None | ||
assert len(query_embeddings) == 11204 | ||
assert query_embeddings[0].shape[1] == 1536 | ||
# Check node embeddings | ||
assert node_embeddings is not None | ||
assert len(node_embeddings) == 129375 | ||
assert node_embeddings[0].shape[1] == 1536 | ||
|
||
def test_load_existing_starkqa_primekg(starkqa_primekg): | ||
""" | ||
Test the loading method of the StarkQAPrimeKGLoaderTool class by loading existing files | ||
in the local directory. | ||
""" | ||
# Load StarkQA PrimeKG data | ||
starkqa_primekg.load_data() | ||
starkqa_df = starkqa_primekg.get_starkqa() | ||
primekg_node_info = starkqa_primekg.get_starkqa_node_info() | ||
split_idx = starkqa_primekg.get_starkqa_split_indicies() | ||
query_embeddings = starkqa_primekg.get_query_embeddings() | ||
node_embeddings = starkqa_primekg.get_node_embeddings() | ||
|
||
# Check if the local directory exists | ||
assert os.path.exists(starkqa_primekg.local_dir) | ||
# Check if downloaded and processed files exist | ||
files = ['qa/prime/split/test-0.1.index', | ||
'qa/prime/split/test.index', | ||
'qa/prime/split/train.index', | ||
'qa/prime/split/val.index', | ||
'qa/prime/stark_qa/stark_qa.csv', | ||
'qa/prime/stark_qa/stark_qa_human_generated_eval.csv', | ||
'skb/prime/processed.zip'] | ||
for file in files: | ||
path = f"{starkqa_primekg.local_dir}/{file}" | ||
assert os.path.exists(path) | ||
# Check dataframe | ||
assert starkqa_df is not None | ||
assert len(starkqa_df) > 0 | ||
assert starkqa_df.shape[0] == 11204 | ||
# Check node information | ||
assert primekg_node_info is not None | ||
assert len(primekg_node_info) == 129375 | ||
# Check split indices | ||
assert list(split_idx.keys()) == ['train', 'val', 'test', 'test-0.1'] | ||
assert len(split_idx['train']) == 6162 | ||
assert len(split_idx['val']) == 2241 | ||
assert len(split_idx['test']) == 2801 | ||
assert len(split_idx['test-0.1']) == 280 | ||
# Check query embeddings | ||
assert query_embeddings is not None | ||
assert len(query_embeddings) == 11204 | ||
assert query_embeddings[0].shape[1] == 1536 | ||
# Check node embeddings | ||
assert node_embeddings is not None | ||
assert len(node_embeddings) == 129375 | ||
assert node_embeddings[0].shape[1] == 1536 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: aiagents4pharma.talk2knowledgegraphs.datasets.biobridge_primekg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: aiagents4pharma.talk2knowledgegraphs.datasets.dataset |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: aiagents4pharma.talk2knowledgegraphs.datasets.primekg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4,648 changes: 4,648 additions & 0 deletions
4,648
docs/notebooks/talk2knowledgegraphs/tutorial_biobridge_primekg_loader.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
665 changes: 665 additions & 0 deletions
665
docs/notebooks/talk2knowledgegraphs/tutorial_primekg_loader.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
2,704 changes: 2,704 additions & 0 deletions
2,704
docs/notebooks/talk2knowledgegraphs/tutorial_starkqa_primekg_loader.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to include the relation embeddings in a subsequent PR as mentioned in another comment.