Skip to content

Commit

Permalink
finalizing the repo
Browse files Browse the repository at this point in the history
  • Loading branch information
cole_foster@brown.edu committed Sep 3, 2024
1 parent 58d0be2 commit beb5a84
Show file tree
Hide file tree
Showing 9 changed files with 253 additions and 9 deletions.
7 changes: 2 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
data/
result/
results/
dev/
slurm*
run.sh
res.csv
result_*
run*
res*
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,37 @@
# Team HSP Submission to SISAP 2024 Indexing Challenge
# Submission to the SISAP 2024 Indexing Challenge
This repository is a submission to the [SISAP 2024 Indexing Challenge](https://sisap-challenges.github.io/)[1]. This implementation is submitted under the team name `HSP`, and accompanied by a short paper to appear in the SISAP Conference Proceedings:

> Cole Foster, Edgar Chavez and Benjamin Kimia. "Top-Down Construction of Locally Monotonic Graphs for Similarity Search." In International Conference on Similarity Search and Applications, 2024.
#### Overview
This approach tests the performance of the HSP Graph, a monotonic graph, for graph-based similarity search. Since the exact HSP Graph has a quadratic construction complexity, we instead focus on preserving monotonicity locally, i.e., with a *locally monotonic graph*. This submission defines a top-down graph construction approach to build a high-quality approximation of the HSP Graph, where a graph on each layer in the hierarchy facilitates the construction of the graph on the layer below. The approach leverages a hierarchical partitioning for the dataset, where each layer contains a set of pivots sampled from the dataset according to some scaling factor `s`, and each layer is partitioned into the domains of the pivots on the layer above. The graph on each layer undergoes a distributed construction, where each node performs graph-search over the coarser-layer graph to identify `p` nearby domains and then selects their neighbors from that region. The hyperparameters `s` and `p` play a major role in the overall index construction time and graph quality.

![main diagram](assets/main-diagram.png)


#### The LAION Dataset
The LAION dataset[2] is large-scale open source collection of text-image pairs. The challenge uses a subset of 100M text-image pairs from the English subset of the LAION dataset. It extracts 768D floating-point vectors to serve as the dataset, where the similarity is measured by the inner product.

The dataset can be obtained from the [Indexing Challenge Website](https://sisap-challenges.github.io/2024/datasets/#public_queries__2).
The example script `search.py` data automatically downloads the dataset and queries to the `data/` folder. This implementation uses the negative inner product as a distance measure, $d(x,y) = 1 - <x,y>$.

#### Running
The source code is written in C++, and leverages the skeleton of [hnswlib](https://github.com/nmslib/hnswlib.git) to maximize efficiency. The implementation also has Python bindings. Please see the example in `search/search.py` to see how to build the index and search over it.

#### Evaluation
The `search.py` script automatically saves the results of the search to the directory `result/`. The performance may be evaluated using the scripts in `eval/` directory. The scripts are copied from [here](https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation/tree/master).

The recall for all results can be measured using the `eval/eval.py` script. This script prepares all results in a nice CSV file `res.csv`.
~~~bash
python3 eval/eval.py
~~~

The script `eval/plot.py` will plot the recall-throughput performance for a single dataset size. It takes the result csv and the dataset size as an input and saves the plot as a png, e.g., `result_300K.png`.
~~~bash
python3 eval/plot.py res.csv --size 300K
~~~

#### References
> [1] Tellez, Eric S., Martin Aumüller, and Vladimir Mic. "Overview of the SISAP 2024 Indexing Challenge." In International Conference on Similarity Search and Applications. Cham: Springer Nature Switzerland, 2024.
>
> [2] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.
Binary file added assets/main-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 84 additions & 0 deletions eval/eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import h5py
import numpy as np
import os
import csv
from pathlib import Path
from urllib.request import urlretrieve

'''
Evaluation script copied from: https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation.git
'''

data_directory = "data"
def download(src, dst):
if not os.path.exists(dst):
os.makedirs(Path(dst).parent, exist_ok=True)
print('downloading %s -> %s...' % (src, dst))
urlretrieve(src, dst)

def get_groundtruth(size="300K"):

# public queries gt file
url = f"http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize={size}--public-queries-2024-laion2B-en-clip768v2-n=10k.h5"

# private queries gt file
#url = f"http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize=100M--private-queries-2024-laion2B-en-clip768v2-n=10k-epsilon=0.2.h5"

out_fn = os.path.join(data_directory, f"groundtruth-{size}.h5")
download(url, out_fn)
gt_f = h5py.File(out_fn, "r")
true_I = np.array(gt_f['knns'])
gt_f.close()
return true_I

def get_all_results(dirname):
for root, _, files in os.walk(dirname):
for fn in files:
if os.path.splitext(fn)[-1] != ".h5":
continue
try:
f = h5py.File(os.path.join(root, fn), "r")
yield f
f.close()
except:
print("Unable to read", fn)

def get_recall(I, gt, k):
assert k <= I.shape[1]
assert len(I) == len(gt)

n = len(I)
recall = 0
for i in range(n):
recall += len(set(I[i, :k]) & set(gt[i, :k]))
return recall / (n * k)

def return_h5_str(f, param):
if param not in f:
return 0
x = f[param][()]
if type(x) == np.bytes_:
return x.decode()
return x


if __name__ == "__main__":
true_I_cache = {}

columns = ["data", "size", "algo", "buildtime", "querytime", "params", "recall"]
with open('res.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for res in get_all_results("result"):
try:
size = res.attrs["size"]
d = dict(res.attrs)
except:
size = res["size"][()].decode()
d = {k: return_h5_str(res, k) for k in columns}
if size not in true_I_cache:
true_I_cache[size] = get_groundtruth(size)
recall = get_recall(np.array(res["knns"]), true_I_cache[size], 10)
d['recall'] = recall
print(d["data"], d["algo"], d["params"], "=>", recall)
writer.writerow(d)
104 changes: 104 additions & 0 deletions eval/plot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# This is based on https://github.com/matsui528/annbench/blob/main/plot.py
import argparse
import csv
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import sys
from itertools import cycle

'''
Evaluation script copied from: https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation.git
'''

marker = cycle(('p', '^', 'h', 'x', 'o', 's', '*', '+', 'D', '1', 'X'))
linestyle = cycle((':', '-', '--'))

def draw(lines, xlabel, ylabel, title, filename, with_ctrl, width, height):
"""
Visualize search results and save them as an image
Args:
lines (list): search results. list of dict.
xlabel (str): label of x-axis, usually "recall"
ylabel (str): label of y-axis, usually "query per sec"
title (str): title of the result_img
filename (str): output file name of image
with_ctrl (bool): show control parameters or not
width (int): width of the figure
height (int): height of the figure
"""
plt.figure(figsize=(width, height))

for line in lines:
for key in ["xs", "ys", "label", "ctrls"]:
assert key in line

for line in lines:
plt.plot(line["xs"], line["ys"], label=line["label"], marker=next(marker), linestyle=next(linestyle))
if with_ctrl:
for x, y, ctrl in zip(line["xs"], line["ys"], line["ctrls"]):
plt.annotate(text=str(ctrl), xy=(x, y),
xytext=(x, y+50))

plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.grid(which="both")
plt.yscale("log")
plt.legend(bbox_to_anchor=(1.05, 1.0), loc="upper left")
plt.title(title)
plt.savefig(filename, bbox_inches='tight')
plt.cla()

def get_pareto_frontier(line):
data = sorted(zip(line["ys"], line["xs"], line["ctrls"]),reverse=True)
line["xs"] = []
line["ys"] = []
line["ctrls"] = []

cur = 0
for y, x, label in data:
if x > cur:
cur = x
line["xs"].append(x)
line["ys"].append(y)
line["ctrls"].append(label)

return line

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("csvfile")
parser.add_argument(
"--size",
default="100K"
)
args = parser.parse_args()

with open(args.csvfile, newline="") as csvfile:
reader = csv.DictReader(csvfile)
data = list(reader)

lines = {}
for res in data:
if res["size"] != args.size:
continue
dataset = res["data"]
algo = res["algo"]
label = dataset + algo
if label not in lines:
lines[label] = {
"xs": [],
"ys": [],
"ctrls": [],
"label": label,
}
lines[label]["xs"].append(float(res["recall"]))
lines[label]["ys"].append(10000/float(res["querytime"])) # FIX query size hardcoded
try:
run_identifier = res["params"].split("query=")[1]
except:
run_identifier = res["params"]
lines[label]["ctrls"].append(run_identifier)

draw([get_pareto_frontier(line) for line in lines.values()],
"Recall", "QPS (1/s)", "Result", f"result_{args.size}.png", False, 10, 8)
1 change: 0 additions & 1 deletion src/.gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
hnswlib.egg-info/
GraphHierarchy.egg-info/
build/
dist/
Expand Down
16 changes: 14 additions & 2 deletions src/GraphHierarchy/graph-hierarchy.hpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
#pragma once

/**
* @file graph-hierarchy.hpp
* @author Cole Foster (cole_foster@brown.edu)
* @date 2024-09-03
*
* Submission to the SISAP 2024 Indexing Challenge
* This implementation uses many of the performance optimizations from HNSWLIB (https://github.com/nmslib/hnswlib.git)
* - vectorization of distance computations in distances.h
* - graph and dataset stored together for better cache locality
* - cache prefetching in search function
*/
#include <omp.h>

#include <algorithm>
Expand Down Expand Up @@ -266,6 +278,7 @@ class GraphHierarchy {
}

// perform the hsp test to get the hsp neighbors of the node
// use maxK to constrain the hsp test to the maxK closest elements in the set
std::vector<uint> HSPTest(uint const query, std::vector<uint> const& set, int maxK = 0) const {
std::vector<uint> neighbors{};
char* queryPtr = getDataByIndex(query);
Expand Down Expand Up @@ -519,7 +532,7 @@ class GraphHierarchy {
|=======================================================================================================================
||
||
|| APPROXIMATE GRAPH CONSTRUCTION
|| TOP-DOWN GRAPH CONSTRUCTION
||
||
|=======================================================================================================================
Expand Down Expand Up @@ -983,7 +996,6 @@ class GraphHierarchy {
// - perform search using graph
std::priority_queue<std::pair<float, uint>> search(const void* query_ptr_v, int k = 1) {
char* query_ptr = (char*) query_ptr_v;
// if (num_levels_ == 0) std::runtime_error("No hierarchy initialized\n");

// find entry-point in graph by top-down traversal of hierarchical partitioning
// tStart = std::chrono::high_resolution_clock::now();
Expand Down
11 changes: 11 additions & 0 deletions src/GraphHierarchy/visited_list_pool.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
#pragma once

/*
|=======================================================================================================================
||
|| DISTANCE COMPUTATIONS
||
|| * DIRECTLY OBTAINED FROM HNSWLIB (https://github.com/nmslib/hnswlib.git)
|| * used as a visited list during graph search
|| * handles multithreading, not initializing more arrays than necessary
||
|=======================================================================================================================
*/
#include <mutex>
#include <string.h>
#include <deque>
Expand Down
1 change: 1 addition & 0 deletions src/python_bindings/bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ class Index {
}
alg_ = NULL;
num_threads_default = std::thread::hardware_concurrency();
printf(" * default num threads: %d\n", num_threads_default);
}

~Index() {
Expand Down

0 comments on commit beb5a84

Please sign in to comment.