Oaken is an accleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware, leveraging online-offline hybrid KV cache quantization algorithm and dedicated quantization/dequantization hardware modules.
This repository provides source code for evaluating the inference accuracy of Oaken and other baselines. Running the evaluation code requires Python 3.10 or later and CUDA 12.1 or later.
We provide a Dockerfile to build a container with Python 3.10, CUDA 12.1, and Miniconda pre-installed. Run the following command to build the Docker image and create the corresponding container.
$ docker build --force-rm -t oaken-ae-img .
$ docker run \
--name oaken-ae-container \
-it \
--gpus '"device=[GPU LIST]"' \
oaken-ae-img
- Open
src/model.py
. - Set
MODEL_STORAGE_PREFIX
, andMODEL_STORATE_POSTFIX
to the directory where huggingface models are stored. - Huggingface model directory names should follow the format
{model_name}-{model_size}
, such as llama2-7b.
- Put your Huggingface access token to the variable
HF_TOKEN
inmodel_downloader.py
. (Some of the models might require indiviual access grant) - Select models to download by setting the variables
DOWNLOAD_*
inmodel_downloader.py
. - Run the following command to download LLMs from Huggingface.
$ pip install huggingface_hub
$ python3 model_downloader.py
Oaken, KVQuant, QServe, and Tender share the same installation commands.
$ git submodule update --init --recursive
$ conda create -n oaken python=3.10
$ conda activate oaken
$ pip install torch protobuf sentencepiece
$ pushd transformers
$ pip install -e .
$ popd
$ pushd lm-evaluation-harness
$ pip install -e .
$ popd
$ pushd kvquant/quant
$ pip install -e .
$ popd
$ pip install flash-attn --no-build-isolation
You can run the entire accuracy evaluation to get the results for Table 2 with the following command. Please configure for list of models and number of GPUs that will be used for evaluation. Note that running the entire accuracy evaluation with this script takes very long time.
$ python3 scripts/accuracy_oaken.py
$ python3 scripts/accuracy_kvquant.py
$ python3 scripts/accuracy_qserve.py
$ python3 scripts/accuracy_tender.py
Use the following command to get the results for Figure 12(a).
$ python3 explore_oaken.py
The following sections describe the instructions for running individual models and benchmarks.
You can run offline profiling for Oaken with the following command.
$ python3 oaken_preprocess_activation.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-t wikitext \
-f 0.04 0.9 0.06 \
-o quantizer/oaken/[OUTPUT QUANTIZER].json
Example command:
$ python3 oaken_preprocess_activation.py \
-m llama2 \
-s 7b \
-t wikitext \
-f 0.04 0.9 0.06 \
-o quantizer/oaken/llama2-7b.json
Run profiling process for KVQuant with the following command.
Set [MODEL PATH]
to the directory where you downloaded huggingface model.
$ cd kvquant/quant
$ CUDA_VISIBLLE_DEVICES=0 python3 llama_simquant.py \
[MODEL PATH] \
--abits 4 \
--nsamples 16 \
--seqlen 2048 \
--nuq \
--quantize \
--include_sparse \
--sparsity-threshold 0.99 \
--quantizer-path quantizer/kvquant/[OUTPUT QUANTIZER].pickle
Please refer to detailed instruction at https://github.com/SqueezeAILab/KVQuant/tree/main/quant.
Run profiling process for QServe with the following command.
$ python3 qserve_preprocess_activation.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-t wikitext \
-o quantizer/qserve/[OUTPUT QUANTIZER].json
You should download val.jsonl.zst
file with the following command.
wget https://huggingface.co/datasets/mit-han-lab/pile-val-backup/resolve/main/val.jsonl.zst
You can run preprocessing for Tender with the following command.
$ python3 tender_preprocess_activation.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-d val.jsonl.zst \
-o quantizer/tender/[OUTPUT QUANTIZER].pt
Use --quant-method
option to select quantization method.
You can use --gpu-start-idx
and --gpu-count
option to choose which GPUs will be used for the evaluation.
Specify the path for the quantizer files generated by the above commands, if required by the quantization algorithm.
$ python3 eval_perplexity.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-t wikitext \
-q quantizer/[QUANTIZER PATH] \
--quant-method [oaken | qserve | kvquant | tender] \
--gpu-start-idx 0 \
--gpu-count 1
Use --quant-method
option to select quantization method.
Use -t
option to select evaluation dataset.
You can use --gpu-start-idx
and --gpu-count
option to choose which GPUs will be used for the evaluation.
$ python3 eval_workload.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-t [piqa | winogrande | hellaswag] \
-q quantizer/[QUANTIZER PATH] \
--quant-method [oaken | qserve | kvquant | tender] \
--gpu-start-idx 0 \
--gpu-count 1
$ conda create -n kivi python=3.10
$ conda activate kivi
$ cd KIVI
$ pip install -e .
$ cd quant
$ pip install -e .
$ cd ../..
$ pip install datasets
$ pushd transformers
$ pip install -e .
$ popd
$ pushd lm-evaluation-harness
$ pip install -e .
$ popd
You can run the entire accuracy evaluation to get the results for Table 2 with the following command. Please configure for list of models and number of GPUs that will be used for evaluation. Note that running the entire accuracy evaluation with this script takes very long time.
$ python3 scripts/accuracy_kivi.py
$ python3 eval_perplexity.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-t wikitext \
--quant-method kivi \
--gpu-start-idx 0 \
--gpu-count 1
$ python3 eval_workload.py \
-m [MODEL NAME] \
-s [MODEL SIZE] \
-t [piqa | winogrande | hellaswag] \
--quant-method kivi \
--gpu-start-idx 0 \
--gpu-count 1
conda create -n atom python=3.10
conda activate atom
cd Atom/model
pip install -r requirements.txt
$ cd Atom
$ ./scripts/run_atom_ppl.sh [MODEL NAME] [# of GPUs]
Example command:
$ ./scripts/run_atom_ppl.sh llama2-7b 1
$ ./scripts/run_atom_ppl.sh llama2-70b 4
$ cd Atom
$ ./scripts/run_atom_zeroshot_acc.sh [MODEL NAME] [# of GPUs]
Configure the environment variables with the following commands.
export PATH="/usr/local/cuda-[CUDA_VERSION]/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda-[CUDA_VERSION]/lib64:${LD_LIBRARY_PATH}"
Use smaller sampling rate rather than 1.0 by giving option --sample-rate
or modifying SAMPLING_RATE
variable in the script.
Modify freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
at src/transformers/models/llama/modeling_llama.py
to
freqs = (inv_freq_expanded.float().to(position_ids_expanded.device) @ position_ids_expanded.float()).transpose(1, 2)
.
Install the package with the following command.
$ pip install peft==0.10.0