finalizing the repo

cole-foster · Sep 3, 2024 · beb5a84 · beb5a84
1 parent 58d0be2
commit beb5a84
Show file tree

Hide file tree

Showing 9 changed files with 253 additions and 9 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,5 @@
 data/
-result/
-results/
 dev/
 slurm*
-run.sh
-res.csv
-result_*
+run*
+res*
diff --git a/README.md b/README.md
@@ -1 +1,37 @@
-# Team HSP Submission to SISAP 2024 Indexing Challenge
+# Submission to the SISAP 2024 Indexing Challenge
+This repository is a submission to the [SISAP 2024 Indexing Challenge](https://sisap-challenges.github.io/)[1]. This implementation is submitted under the team name `HSP`, and accompanied by a short paper to appear in the SISAP Conference Proceedings:
+
+> Cole Foster, Edgar Chavez and Benjamin Kimia. "Top-Down Construction of Locally Monotonic Graphs for Similarity Search." In International Conference on Similarity Search and Applications, 2024.
+
+#### Overview
+This approach tests the performance of the HSP Graph, a monotonic graph, for graph-based similarity search. Since the exact HSP Graph has a quadratic construction complexity, we instead focus on preserving monotonicity locally, i.e., with a *locally monotonic graph*. This submission defines a top-down graph construction approach to build a high-quality approximation of the HSP Graph, where a graph on each layer in the hierarchy facilitates the construction of the graph on the layer below. The approach leverages a hierarchical partitioning for the dataset, where each layer contains a set of pivots sampled from the dataset according to some scaling factor `s`, and each layer is partitioned into the domains of the pivots on the layer above. The graph on each layer undergoes a distributed construction, where each node performs graph-search over the coarser-layer graph to identify `p` nearby domains and then selects their neighbors from that region. The hyperparameters `s` and `p` play a major role in the overall index construction time and graph quality. 
+
+![main diagram](assets/main-diagram.png)
+
+
+#### The LAION Dataset
+The LAION dataset[2] is large-scale open source collection of text-image pairs. The challenge uses a subset of 100M text-image pairs from the English subset of the LAION dataset. It extracts 768D floating-point vectors to serve as the dataset, where the similarity is measured by the inner product. 
+
+The dataset can be obtained from the [Indexing Challenge Website](https://sisap-challenges.github.io/2024/datasets/#public_queries__2). 
+The example script `search.py` data automatically downloads the dataset and queries to the `data/` folder. This implementation uses the negative inner product as a distance measure, $d(x,y) = 1 - <x,y>$.
+
+#### Running
+The source code is written in C++, and leverages the skeleton of [hnswlib](https://github.com/nmslib/hnswlib.git) to maximize efficiency. The implementation also has Python bindings. Please see the example in `search/search.py` to see how to build the index and search over it. 
+
+#### Evaluation
+The `search.py` script automatically saves the results of the search to the directory `result/`. The performance may be evaluated using the scripts in `eval/` directory. The scripts are copied from [here](https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation/tree/master).
+
+The recall for all results can be measured using the `eval/eval.py` script. This script prepares all results in a nice CSV file `res.csv`.  
+~~~bash
+python3 eval/eval.py
+~~~
+
+The script `eval/plot.py` will plot the recall-throughput performance for a single dataset size. It takes the result csv and the dataset size as an input and saves the plot as a png, e.g., `result_300K.png`. 
+~~~bash
+python3 eval/plot.py res.csv --size 300K
+~~~
+
+#### References
+> [1] Tellez, Eric S., Martin Aumüller, and Vladimir Mic. "Overview of the SISAP 2024 Indexing Challenge." In International Conference on Similarity Search and Applications. Cham: Springer Nature Switzerland, 2024.
+>
+> [2] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.
diff --git a/assets/main-diagram.png b/assets/main-diagram.png
diff --git a/eval/eval.py b/eval/eval.py
@@ -0,0 +1,84 @@
+import h5py
+import numpy as np
+import os
+import csv
+from pathlib import Path
+from urllib.request import urlretrieve
+
+'''
+    Evaluation script copied from: https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation.git
+'''
+
+data_directory = "data"
+def download(src, dst):
+    if not os.path.exists(dst):
+        os.makedirs(Path(dst).parent, exist_ok=True)
+        print('downloading %s -> %s...' % (src, dst))
+        urlretrieve(src, dst)
+
+def get_groundtruth(size="300K"):
+
+    # public queries gt file
+    url = f"http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize={size}--public-queries-2024-laion2B-en-clip768v2-n=10k.h5"
+
+    # private queries gt file
+    #url = f"http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize=100M--private-queries-2024-laion2B-en-clip768v2-n=10k-epsilon=0.2.h5"
+
+    out_fn = os.path.join(data_directory, f"groundtruth-{size}.h5")
+    download(url, out_fn)
+    gt_f = h5py.File(out_fn, "r")
+    true_I = np.array(gt_f['knns'])
+    gt_f.close()
+    return true_I
+
+def get_all_results(dirname):
+    for root, _, files in os.walk(dirname):
+        for fn in files:
+            if os.path.splitext(fn)[-1] != ".h5":
+                continue
+            try:
+                f = h5py.File(os.path.join(root, fn), "r")
+                yield f
+                f.close()
+            except:
+                print("Unable to read", fn)
+
+def get_recall(I, gt, k):
+    assert k <= I.shape[1]
+    assert len(I) == len(gt)
+
+    n = len(I)
+    recall = 0
+    for i in range(n):
+        recall += len(set(I[i, :k]) & set(gt[i, :k]))
+    return recall / (n * k)
+
+def return_h5_str(f, param):
+    if param not in f:
+        return 0
+    x = f[param][()]
+    if type(x) == np.bytes_:
+        return x.decode()
+    return x
+
+
+if __name__ == "__main__":
+    true_I_cache = {}
+
+    columns = ["data", "size", "algo", "buildtime", "querytime", "params", "recall"]
+    with open('res.csv', 'w', newline='') as csvfile:
+        writer = csv.DictWriter(csvfile, fieldnames=columns)
+        writer.writeheader()
+        for res in get_all_results("result"):
+            try:
+                size = res.attrs["size"]
+                d = dict(res.attrs)
+            except: 
+                size = res["size"][()].decode()
+                d = {k: return_h5_str(res, k) for k in columns}
+            if size not in true_I_cache:
+                true_I_cache[size] = get_groundtruth(size)
+            recall = get_recall(np.array(res["knns"]), true_I_cache[size], 10)
+            d['recall'] = recall
+            print(d["data"], d["algo"], d["params"], "=>", recall)
+            writer.writerow(d)
diff --git a/eval/plot.py b/eval/plot.py
@@ -0,0 +1,104 @@
+# This is based on https://github.com/matsui528/annbench/blob/main/plot.py
+import argparse
+import csv
+import matplotlib
+matplotlib.use('agg')
+import matplotlib.pyplot as plt
+import sys
+from itertools import cycle
+
+'''
+    Evaluation script copied from: https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation.git
+'''
+
+marker = cycle(('p', '^', 'h', 'x', 'o', 's', '*', '+', 'D', '1', 'X')) 
+linestyle = cycle((':', '-', '--'))
+
+def draw(lines, xlabel, ylabel, title, filename, with_ctrl, width, height):
+    """
+    Visualize search results and save them as an image
+    Args:
+        lines (list): search results. list of dict.
+        xlabel (str): label of x-axis, usually "recall"
+        ylabel (str): label of y-axis, usually "query per sec"
+        title (str): title of the result_img
+        filename (str): output file name of image
+        with_ctrl (bool): show control parameters or not
+        width (int): width of the figure
+        height (int): height of the figure
+    """
+    plt.figure(figsize=(width, height))
+
+    for line in lines:
+        for key in ["xs", "ys", "label", "ctrls"]:
+            assert key in line
+
+    for line in lines:
+        plt.plot(line["xs"], line["ys"], label=line["label"], marker=next(marker), linestyle=next(linestyle))
+        if with_ctrl:
+            for x, y, ctrl in zip(line["xs"], line["ys"], line["ctrls"]):
+                plt.annotate(text=str(ctrl), xy=(x, y),
+                             xytext=(x, y+50))
+
+    plt.xlabel(xlabel)
+    plt.ylabel(ylabel)
+    plt.grid(which="both")
+    plt.yscale("log")
+    plt.legend(bbox_to_anchor=(1.05, 1.0), loc="upper left")
+    plt.title(title)
+    plt.savefig(filename, bbox_inches='tight')
+    plt.cla()
+
+def get_pareto_frontier(line):
+    data = sorted(zip(line["ys"], line["xs"], line["ctrls"]),reverse=True)
+    line["xs"] = []
+    line["ys"] = []
+    line["ctrls"] = []
+
+    cur = 0
+    for y, x, label in data:
+        if x > cur:
+            cur = x
+            line["xs"].append(x)
+            line["ys"].append(y)
+            line["ctrls"].append(label)
+
+    return line
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("csvfile")
+    parser.add_argument(
+        "--size",
+        default="100K"
+    )
+    args = parser.parse_args()
+
+    with open(args.csvfile, newline="") as csvfile:
+        reader = csv.DictReader(csvfile)
+        data = list(reader)
+
+    lines = {}
+    for res in data:
+        if res["size"] != args.size:
+            continue
+        dataset = res["data"]
+        algo = res["algo"]
+        label = dataset + algo
+        if label not in lines:
+            lines[label] =  {
+                "xs": [],
+                "ys": [],
+                "ctrls": [],
+                "label": label,
+            }
+        lines[label]["xs"].append(float(res["recall"]))
+        lines[label]["ys"].append(10000/float(res["querytime"])) # FIX query size hardcoded
+        try:
+            run_identifier = res["params"].split("query=")[1]
+        except:
+            run_identifier = res["params"]
+        lines[label]["ctrls"].append(run_identifier)
+
+    draw([get_pareto_frontier(line) for line in lines.values()], 
+            "Recall", "QPS (1/s)", "Result", f"result_{args.size}.png", False, 10, 8)
diff --git a/src/.gitignore b/src/.gitignore
@@ -1,4 +1,3 @@
-hnswlib.egg-info/
 GraphHierarchy.egg-info/
 build/
 dist/

diff --git a/src/GraphHierarchy/graph-hierarchy.hpp b/src/GraphHierarchy/graph-hierarchy.hpp
@@ -1,4 +1,16 @@
 #pragma once
+
+/**
+ * @file graph-hierarchy.hpp
+ * @author Cole Foster (cole_foster@brown.edu)
+ * @date 2024-09-03
+ * 
+ * Submission to the SISAP 2024 Indexing Challenge
+ * This implementation uses many of the performance optimizations from HNSWLIB (https://github.com/nmslib/hnswlib.git)
+ *      - vectorization of distance computations in distances.h
+ *      - graph and dataset stored together for better cache locality
+ *      - cache prefetching in search function
+ */
 #include <omp.h>
 
 #include <algorithm>
@@ -266,6 +278,7 @@ class GraphHierarchy {
     }
 
     // perform the hsp test to get the hsp neighbors of the node
+    // use maxK to constrain the hsp test to the maxK closest elements in the set
     std::vector<uint> HSPTest(uint const query, std::vector<uint> const& set, int maxK = 0) const {
         std::vector<uint> neighbors{};
         char* queryPtr = getDataByIndex(query);
@@ -519,7 +532,7 @@ class GraphHierarchy {
 |=======================================================================================================================
 ||
 ||
-||                          APPROXIMATE GRAPH CONSTRUCTION
+||                          TOP-DOWN GRAPH CONSTRUCTION
 ||
 ||
 |=======================================================================================================================
@@ -983,7 +996,6 @@ class GraphHierarchy {
     // - perform search using graph
     std::priority_queue<std::pair<float, uint>> search(const void* query_ptr_v, int k = 1) {
         char* query_ptr = (char*) query_ptr_v;
-        // if (num_levels_ == 0) std::runtime_error("No hierarchy initialized\n");
 
         // find entry-point in graph by top-down traversal of hierarchical partitioning
         // tStart = std::chrono::high_resolution_clock::now();

diff --git a/src/GraphHierarchy/visited_list_pool.h b/src/GraphHierarchy/visited_list_pool.h
@@ -1,5 +1,16 @@
 #pragma once
 
+/*
+|=======================================================================================================================
+||
+||                      DISTANCE COMPUTATIONS
+||
+||  * DIRECTLY OBTAINED FROM HNSWLIB (https://github.com/nmslib/hnswlib.git)
+||  * used as a visited list during graph search
+||  * handles multithreading, not initializing more arrays than necessary
+||
+|=======================================================================================================================
+*/
 #include <mutex>
 #include <string.h>
 #include <deque>

diff --git a/src/python_bindings/bindings.cpp b/src/python_bindings/bindings.cpp
@@ -153,6 +153,7 @@ class Index {
         }
         alg_ = NULL;
         num_threads_default = std::thread::hardware_concurrency();
+        printf(" * default num threads: %d\n", num_threads_default);
     }
 
     ~Index() {