This repo is our project "TPC-DS Benchmark Using DuckDB" in the course "Data Warehouses" at Université Libre de Bruxelles (ULB). In this project, we implement the TPC-DS Benchmark on DuckDB Database Management System.
- Clone the repo
git clone https://github.com/hieunm44/dw-tpcds-duckdb.git cd dw-tpcds-duckdb
- Install
duckdb
packagepip install duckdb
- Go to https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and download
TPC-DS_Tools_v3.2.0.zip
, then unzip it to a foldertpcds-kit
. - Check the document
TPC-DS_v3.2.0.pdf
(also from the link above) to get details about the TPC-DS benchmark. - Give full access permission to data folders
chmod 777 generated_data chmod 777 generated_queries chmod 777 refreshed_data
We only show examples for scale factor 1. Other scales can be reimplemented similarly.
- Data generation
Then 25
cd tpcds-kit/tools make LINUX_CC=gcc-9 OS=LINUX ./dsdgen -scale 1 -dir "../../generated_data/scale_1" -suffix .csv -verbose Y -force Y # Notes: folder path must not contain spaces
.csv
files will be generated in the folderrefreshed_data/scale_1
. - Query generation
Then a file
./dsqgen -directory "../query_templates/" -input "../query_templates/templates.lst" -dialect netezza -scale 1 -output_dir "../../generated_queries" -verbose Y
query_0.sql
containing 99 queries will be created in the foldergenerated_queries
. Now we split all queries into 99 separate files, named fromquery_1.sql
toquery_99.sql
.cd ../../generated_queries python3 split_queries.py
- Create database
Go to foldersrc/create_database
, then run:A DuckDB database filepython3 create_db.py
scale_1.db
will be created in the foldercreated_db
. - Load data and load test
Go to foldersrc/load_data
, then run:This script will load data from generatedpython3 load_data.py
.csv
files to tables in our databases, then it will give us load time. - Power test
Go to foldersrc/test
, then run:This script measures running time of individual query and also the total time of running 99 queries.python3 power_test.py
- Throughput test
The script will return the throughput test time.
python3 throughput_test.py
- Maintenance test
Generate the dataset again as refreshed data:Then 23./dsdgen -scale 1 -dir "../../refreshed_data/scale_1" -suffix .dat -update 1 -verbose Y -force Y
.dat
files will be generated in the folderrefreshed_data/scale_1
.
Next, run two filessrc/create_database/create_mtnc.py
andsrc/load_data/load_mtnc.py
to create the database and load the dataset again for mantenence test.
Finally, go to foldersrc/test
and run the test:The script will give time for running maintenance functions.python3 maintenance_test.py