-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
cole_foster@brown.edu
committed
Sep 3, 2024
1 parent
58d0be2
commit beb5a84
Showing
9 changed files
with
253 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,5 @@ | ||
data/ | ||
result/ | ||
results/ | ||
dev/ | ||
slurm* | ||
run.sh | ||
res.csv | ||
result_* | ||
run* | ||
res* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,37 @@ | ||
# Team HSP Submission to SISAP 2024 Indexing Challenge | ||
# Submission to the SISAP 2024 Indexing Challenge | ||
This repository is a submission to the [SISAP 2024 Indexing Challenge](https://sisap-challenges.github.io/)[1]. This implementation is submitted under the team name `HSP`, and accompanied by a short paper to appear in the SISAP Conference Proceedings: | ||
|
||
> Cole Foster, Edgar Chavez and Benjamin Kimia. "Top-Down Construction of Locally Monotonic Graphs for Similarity Search." In International Conference on Similarity Search and Applications, 2024. | ||
#### Overview | ||
This approach tests the performance of the HSP Graph, a monotonic graph, for graph-based similarity search. Since the exact HSP Graph has a quadratic construction complexity, we instead focus on preserving monotonicity locally, i.e., with a *locally monotonic graph*. This submission defines a top-down graph construction approach to build a high-quality approximation of the HSP Graph, where a graph on each layer in the hierarchy facilitates the construction of the graph on the layer below. The approach leverages a hierarchical partitioning for the dataset, where each layer contains a set of pivots sampled from the dataset according to some scaling factor `s`, and each layer is partitioned into the domains of the pivots on the layer above. The graph on each layer undergoes a distributed construction, where each node performs graph-search over the coarser-layer graph to identify `p` nearby domains and then selects their neighbors from that region. The hyperparameters `s` and `p` play a major role in the overall index construction time and graph quality. | ||
|
||
 | ||
|
||
|
||
#### The LAION Dataset | ||
The LAION dataset[2] is large-scale open source collection of text-image pairs. The challenge uses a subset of 100M text-image pairs from the English subset of the LAION dataset. It extracts 768D floating-point vectors to serve as the dataset, where the similarity is measured by the inner product. | ||
|
||
The dataset can be obtained from the [Indexing Challenge Website](https://sisap-challenges.github.io/2024/datasets/#public_queries__2). | ||
The example script `search.py` data automatically downloads the dataset and queries to the `data/` folder. This implementation uses the negative inner product as a distance measure, $d(x,y) = 1 - <x,y>$. | ||
|
||
#### Running | ||
The source code is written in C++, and leverages the skeleton of [hnswlib](https://github.com/nmslib/hnswlib.git) to maximize efficiency. The implementation also has Python bindings. Please see the example in `search/search.py` to see how to build the index and search over it. | ||
|
||
#### Evaluation | ||
The `search.py` script automatically saves the results of the search to the directory `result/`. The performance may be evaluated using the scripts in `eval/` directory. The scripts are copied from [here](https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation/tree/master). | ||
|
||
The recall for all results can be measured using the `eval/eval.py` script. This script prepares all results in a nice CSV file `res.csv`. | ||
~~~bash | ||
python3 eval/eval.py | ||
~~~ | ||
|
||
The script `eval/plot.py` will plot the recall-throughput performance for a single dataset size. It takes the result csv and the dataset size as an input and saves the plot as a png, e.g., `result_300K.png`. | ||
~~~bash | ||
python3 eval/plot.py res.csv --size 300K | ||
~~~ | ||
|
||
#### References | ||
> [1] Tellez, Eric S., Martin Aumüller, and Vladimir Mic. "Overview of the SISAP 2024 Indexing Challenge." In International Conference on Similarity Search and Applications. Cham: Springer Nature Switzerland, 2024. | ||
> | ||
> [2] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
import h5py | ||
import numpy as np | ||
import os | ||
import csv | ||
from pathlib import Path | ||
from urllib.request import urlretrieve | ||
|
||
''' | ||
Evaluation script copied from: https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation.git | ||
''' | ||
|
||
data_directory = "data" | ||
def download(src, dst): | ||
if not os.path.exists(dst): | ||
os.makedirs(Path(dst).parent, exist_ok=True) | ||
print('downloading %s -> %s...' % (src, dst)) | ||
urlretrieve(src, dst) | ||
|
||
def get_groundtruth(size="300K"): | ||
|
||
# public queries gt file | ||
url = f"http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize={size}--public-queries-2024-laion2B-en-clip768v2-n=10k.h5" | ||
|
||
# private queries gt file | ||
#url = f"http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize=100M--private-queries-2024-laion2B-en-clip768v2-n=10k-epsilon=0.2.h5" | ||
|
||
out_fn = os.path.join(data_directory, f"groundtruth-{size}.h5") | ||
download(url, out_fn) | ||
gt_f = h5py.File(out_fn, "r") | ||
true_I = np.array(gt_f['knns']) | ||
gt_f.close() | ||
return true_I | ||
|
||
def get_all_results(dirname): | ||
for root, _, files in os.walk(dirname): | ||
for fn in files: | ||
if os.path.splitext(fn)[-1] != ".h5": | ||
continue | ||
try: | ||
f = h5py.File(os.path.join(root, fn), "r") | ||
yield f | ||
f.close() | ||
except: | ||
print("Unable to read", fn) | ||
|
||
def get_recall(I, gt, k): | ||
assert k <= I.shape[1] | ||
assert len(I) == len(gt) | ||
|
||
n = len(I) | ||
recall = 0 | ||
for i in range(n): | ||
recall += len(set(I[i, :k]) & set(gt[i, :k])) | ||
return recall / (n * k) | ||
|
||
def return_h5_str(f, param): | ||
if param not in f: | ||
return 0 | ||
x = f[param][()] | ||
if type(x) == np.bytes_: | ||
return x.decode() | ||
return x | ||
|
||
|
||
if __name__ == "__main__": | ||
true_I_cache = {} | ||
|
||
columns = ["data", "size", "algo", "buildtime", "querytime", "params", "recall"] | ||
with open('res.csv', 'w', newline='') as csvfile: | ||
writer = csv.DictWriter(csvfile, fieldnames=columns) | ||
writer.writeheader() | ||
for res in get_all_results("result"): | ||
try: | ||
size = res.attrs["size"] | ||
d = dict(res.attrs) | ||
except: | ||
size = res["size"][()].decode() | ||
d = {k: return_h5_str(res, k) for k in columns} | ||
if size not in true_I_cache: | ||
true_I_cache[size] = get_groundtruth(size) | ||
recall = get_recall(np.array(res["knns"]), true_I_cache[size], 10) | ||
d['recall'] = recall | ||
print(d["data"], d["algo"], d["params"], "=>", recall) | ||
writer.writerow(d) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# This is based on https://github.com/matsui528/annbench/blob/main/plot.py | ||
import argparse | ||
import csv | ||
import matplotlib | ||
matplotlib.use('agg') | ||
import matplotlib.pyplot as plt | ||
import sys | ||
from itertools import cycle | ||
|
||
''' | ||
Evaluation script copied from: https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation.git | ||
''' | ||
|
||
marker = cycle(('p', '^', 'h', 'x', 'o', 's', '*', '+', 'D', '1', 'X')) | ||
linestyle = cycle((':', '-', '--')) | ||
|
||
def draw(lines, xlabel, ylabel, title, filename, with_ctrl, width, height): | ||
""" | ||
Visualize search results and save them as an image | ||
Args: | ||
lines (list): search results. list of dict. | ||
xlabel (str): label of x-axis, usually "recall" | ||
ylabel (str): label of y-axis, usually "query per sec" | ||
title (str): title of the result_img | ||
filename (str): output file name of image | ||
with_ctrl (bool): show control parameters or not | ||
width (int): width of the figure | ||
height (int): height of the figure | ||
""" | ||
plt.figure(figsize=(width, height)) | ||
|
||
for line in lines: | ||
for key in ["xs", "ys", "label", "ctrls"]: | ||
assert key in line | ||
|
||
for line in lines: | ||
plt.plot(line["xs"], line["ys"], label=line["label"], marker=next(marker), linestyle=next(linestyle)) | ||
if with_ctrl: | ||
for x, y, ctrl in zip(line["xs"], line["ys"], line["ctrls"]): | ||
plt.annotate(text=str(ctrl), xy=(x, y), | ||
xytext=(x, y+50)) | ||
|
||
plt.xlabel(xlabel) | ||
plt.ylabel(ylabel) | ||
plt.grid(which="both") | ||
plt.yscale("log") | ||
plt.legend(bbox_to_anchor=(1.05, 1.0), loc="upper left") | ||
plt.title(title) | ||
plt.savefig(filename, bbox_inches='tight') | ||
plt.cla() | ||
|
||
def get_pareto_frontier(line): | ||
data = sorted(zip(line["ys"], line["xs"], line["ctrls"]),reverse=True) | ||
line["xs"] = [] | ||
line["ys"] = [] | ||
line["ctrls"] = [] | ||
|
||
cur = 0 | ||
for y, x, label in data: | ||
if x > cur: | ||
cur = x | ||
line["xs"].append(x) | ||
line["ys"].append(y) | ||
line["ctrls"].append(label) | ||
|
||
return line | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("csvfile") | ||
parser.add_argument( | ||
"--size", | ||
default="100K" | ||
) | ||
args = parser.parse_args() | ||
|
||
with open(args.csvfile, newline="") as csvfile: | ||
reader = csv.DictReader(csvfile) | ||
data = list(reader) | ||
|
||
lines = {} | ||
for res in data: | ||
if res["size"] != args.size: | ||
continue | ||
dataset = res["data"] | ||
algo = res["algo"] | ||
label = dataset + algo | ||
if label not in lines: | ||
lines[label] = { | ||
"xs": [], | ||
"ys": [], | ||
"ctrls": [], | ||
"label": label, | ||
} | ||
lines[label]["xs"].append(float(res["recall"])) | ||
lines[label]["ys"].append(10000/float(res["querytime"])) # FIX query size hardcoded | ||
try: | ||
run_identifier = res["params"].split("query=")[1] | ||
except: | ||
run_identifier = res["params"] | ||
lines[label]["ctrls"].append(run_identifier) | ||
|
||
draw([get_pareto_frontier(line) for line in lines.values()], | ||
"Recall", "QPS (1/s)", "Result", f"result_{args.size}.png", False, 10, 8) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
hnswlib.egg-info/ | ||
GraphHierarchy.egg-info/ | ||
build/ | ||
dist/ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters