Source code and supplementary materials for "LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration".
LEGO-GraphRAG is a modular framework designed for exploring design spaces using graph-based retrieval-augmented generation. This repository contains the source code and supplementary materials for the project, including data preprocessing, experiments, and evaluation tools.
.
├── requirements.txt
├── Preprocess
├── SEPRAlign
├── Instance
├── CostRecord
├── Evaluation
├── Results
├── ExistingInstances
├── ApproximatePPR
└── FineTune
The system is divided into several independently executed repositories, each serving a specific purpose:
Handles data preprocessing tasks, including cleaning, formatting, and preparing datasets for further processing.
Responsible for generating LEGO-GraphRAG instances and running the Instance Experiments.
Aligns subgraph-extraction and path-retrieval modules to run the Module Experiments.
Records various runtime costs, such as computational resource consumption by LLMs and EEMs.
Provides tools to evaluate the performance of different components and generate visualizations, including charts and graphs.
All the reported results in our paper are stored in this folder.
Reproduces instance alignment experiments, including:
- KELP
- RoG
- ToG
Reproduces personalized PageRank (PPR) experiments, including:
- Fora
- TopPPR
Includes scripts and configurations for fine-tuning EEMs and LLMs.
-
Clone this repository:
git clone https://github.com/gzy02/LEGO-GraphRAG.git
-
Navigate to the project directory and install dependencies:
cd LEGO-GraphRAG pip install -r requirements.txt cd Instance
-
Prepare the Graph and datasets by following the instructions in the Preprocess module.
-
Deploy a LLM using the provided script:
./vllm_qwen.sh
-
Change the config.py file to the desired configuration. The keyworks should include:
- llm_url: the url of the deployed LLM
- reasoning_model: the reasoning model used in the LLM
- reasoning_dataset: the dataset used in the experiment
- subgraph_list: the subgraphs used in the experiment
- emb_model_dir: the directory of the embedding model
- temperature: the temperature used in the LLM generation
- max_tokens: the maximum tokens generated by the LLM
- Run the Instance Experiments:
- Subgraph Extraction Experiments:
python SE_PPR.py python SE_EEM.py python SE_LLM.py
- Path Retrieval Experiments:
python Instances.py
- Generate the results:
python Generate.py
This project is licensed under the terms specified in the LICENSE file.