LEGO-GraphRAG

Source code and supplementary materials for "LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration".

Introduction

LEGO-GraphRAG is a modular framework designed for exploring design spaces using graph-based retrieval-augmented generation. This repository contains the source code and supplementary materials for the project, including data preprocessing, experiments, and evaluation tools.

Directory Structure

.
├── requirements.txt
├── Preprocess
├── SEPRAlign
├── Instance
├── CostRecord
├── Evaluation
├── Results
├── ExistingInstances
├── ApproximatePPR
└── FineTune

System Overview

The system is divided into several independently executed repositories, each serving a specific purpose:

Preprocess

Handles data preprocessing tasks, including cleaning, formatting, and preparing datasets for further processing.

Instance

Responsible for generating LEGO-GraphRAG instances and running the Instance Experiments.

SEPRAlign

Aligns subgraph-extraction and path-retrieval modules to run the Module Experiments.

CostRecord

Records various runtime costs, such as computational resource consumption by LLMs and EEMs.

Evaluation

Provides tools to evaluate the performance of different components and generate visualizations, including charts and graphs.

Results

All the reported results in our paper are stored in this folder.

ExistingInstances

Reproduces instance alignment experiments, including:

KELP
RoG
ToG

ApproximatePPR

Reproduces personalized PageRank (PPR) experiments, including:

Fora
TopPPR

FineTune

Includes scripts and configurations for fine-tuning EEMs and LLMs.

Run the Instance Experiments

Clone this repository:

git clone https://github.com/gzy02/LEGO-GraphRAG.git

Navigate to the project directory and install dependencies:

cd LEGO-GraphRAG
pip install -r requirements.txt
cd Instance

Prepare the Graph and datasets by following the instructions in the Preprocess module.
Deploy a LLM using the provided script:
```
./vllm_qwen.sh
```
Change the config.py file to the desired configuration. The keyworks should include:

llm_url: the url of the deployed LLM
reasoning_model: the reasoning model used in the LLM
reasoning_dataset: the dataset used in the experiment
subgraph_list: the subgraphs used in the experiment
emb_model_dir: the directory of the embedding model
temperature: the temperature used in the LLM generation
max_tokens: the maximum tokens generated by the LLM

Run the Instance Experiments:

Subgraph Extraction Experiments:

python SE_PPR.py
python SE_EEM.py
python SE_LLM.py

Path Retrieval Experiments:
```
python Instances.py
```
Generate the results:
```
python Generate.py
```

License

This project is licensed under the terms specified in the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEGO-GraphRAG

Table of Contents

Introduction

Directory Structure

System Overview

Preprocess

Instance

SEPRAlign

CostRecord

Evaluation

Results

ExistingInstances

ApproximatePPR

FineTune

Run the Instance Experiments

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ApproximatePPR		ApproximatePPR
CostRecord		CostRecord
Evaluation		Evaluation
ExistingInstances		ExistingInstances
FineTune		FineTune
Instance		Instance
Preprocess		Preprocess
Results		Results
SEPRAlign		SEPRAlign
LICENSE		LICENSE
README.md		README.md
SUPPLIMENTARY MATERIALS.pdf		SUPPLIMENTARY MATERIALS.pdf
requirements.txt		requirements.txt

License

gzy02/LEGO-GraphRAG

Folders and files

Latest commit

History

Repository files navigation

LEGO-GraphRAG

Table of Contents

Introduction

Directory Structure

System Overview

Preprocess

Instance

SEPRAlign

CostRecord

Evaluation

Results

ExistingInstances

ApproximatePPR

FineTune

Run the Instance Experiments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages