Data Warehouses Project 1 - TPC-DS Benchmark Using DuckDB

Overview

This repo is our project "TPC-DS Benchmark Using DuckDB" in the course "Data Warehouses" at Université Libre de Bruxelles (ULB). In this project, we implement the TPC-DS Benchmark on DuckDB Database Management System.

Setup

Clone the repo

git clone https://github.com/hieunm44/dw-tpcds-duckdb.git
cd dw-tpcds-duckdb

Install duckdb package
```
pip install duckdb
```
Go to https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and download TPC-DS_Tools_v3.2.0.zip, then unzip it to a folder tpcds-kit.
Check the document TPC-DS_v3.2.0.pdf (also from the link above) to get details about the TPC-DS benchmark.

Give full access permission to data folders

chmod 777 generated_data
chmod 777 generated_queries
chmod 777 refreshed_data

Usage

We only show examples for scale factor 1. Other scales can be reimplemented similarly.

Data generation

cd tpcds-kit/tools
make LINUX_CC=gcc-9 OS=LINUX
./dsdgen -scale 1 -dir "../../generated_data/scale_1" -suffix .csv -verbose Y -force Y
# Notes: folder path must not contain spaces

Then 25 .csv files will be generated in the folder refreshed_data/scale_1.

Query generation

./dsqgen -directory "../query_templates/" -input "../query_templates/templates.lst" -dialect netezza -scale 1 -output_dir "../../generated_queries" -verbose Y

Then a file query_0.sql containing 99 queries will be created in the folder generated_queries. Now we split all queries into 99 separate files, named from query_1.sql to query_99.sql.

cd ../../generated_queries
python3 split_queries.py

Create database
Go to folder src/create_database, then run:
```
python3 create_db.py
```
A DuckDB database file scale_1.db will be created in the folder created_db.
Load data and load test
Go to folder src/load_data, then run:
```
python3 load_data.py
```
This script will load data from generated .csv files to tables in our databases, then it will give us load time.
Power test
Go to folder src/test, then run:
```
python3 power_test.py
```
This script measures running time of individual query and also the total time of running 99 queries.
Throughput test
```
python3 throughput_test.py
```
The script will return the throughput test time.
Maintenance test
Generate the dataset again as refreshed data:
```
./dsdgen -scale 1 -dir "../../refreshed_data/scale_1" -suffix .dat -update 1 -verbose Y -force Y
```
Then 23 .dat files will be generated in the folder refreshed_data/scale_1.
Next, run two files src/create_database/create_mtnc.py and src/load_data/load_mtnc.py to create the database and load the dataset again for mantenence test.
Finally, go to folder src/test and run the test:
```
python3 maintenance_test.py
```
The script will give time for running maintenance functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehouses Project 1 - TPC-DS Benchmark Using DuckDB

Overview

Setup

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
created_db		created_db
generated_data/scale_1		generated_data/scale_1
generated_queries		generated_queries
refreshed_data/scale_1		refreshed_data/scale_1
src		src
tpcds-kit		tpcds-kit
README.md		README.md
TPC-DS_v3.2.0.pdf		TPC-DS_v3.2.0.pdf
TPC_DS_Report.pdf		TPC_DS_Report.pdf

hieunm44/dw-tpcds-duckdb

Folders and files

Latest commit

History

Repository files navigation

Data Warehouses Project 1 - TPC-DS Benchmark Using DuckDB

Overview

Setup

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages