Research · 2025
FreeThrow
Learning-Augmented Vector Search · Harvard University · co-authored with Milad Rezaei Hajidehi
Advisor
Michael Mitzenmacher
Stack
C++ · Python · FAISS
Result
2–3× throughput · New Pareto frontier
Status
Preparing submission
HNSWfixed entry · arbitrary start
FreeThrowlearned portal · query-adaptive start

Both methods search the same HNSW graph. HNSW enters at a fixed global vertex (top layer) and descends through all layers — many hops. FreeThrow's MLP maps the query embedding to a portal: a node in the dense base layer adjacent to the answer, reducing traversal to a single hop.

§ 1 The Bottleneck
Problem
50–97%
Root
Cause
FreeThrow

Most retrieval systems work the same way: an embedding model turns a query into a vector, and a vector index finds the nearest stored neighbors. The embedding model is well-studied and heavily optimized. The index is usually an afterthought.

When I started profiling these pipelines, the numbers were striking — vector search was eating 50–97% of total runtime, sometimes 10× slower than the embedding step it was supposed to complement.

The culprit is how HNSW, the de facto standard, begins every search: from the same fixed node, every time, regardless of the query. It then walks the graph greedily until it converges. The walk is serial, can't use a GPU, and its length is entirely determined by how unlucky that starting point was.

The idea behind FreeThrow was simple: what if we just learned where to start? We train a small MLP to predict a good entry point — a portal — for each query, running it on the same GPU right after embedding. In practice it adds under 1% overhead and cuts traversal length by 2–3×.

Problem
Graph traversal dominates pipeline runtime
Index
HNSW — default in Qdrant, Weaviate, FAISS
Bottleneck
Serial greedy walk, can't use GPU
Pipeline runtime breakdown · MobileNet image embeddings (Figure 2)

Measured on an embedding-retrieval pipeline using MobileNet for image embeddings. Disk-based retrieval (Qdrant): 97.2% of runtime is vector search. In-memory retrieval (FAISS): 82.7% of runtime is vector search.

§ 2 Graph Traversal Race

Both methods search the same HNSW graph for the same query. HNSW starts from a fixed, arbitrary entry node. FreeThrow predicts a portal — a starting node near the query — cutting the greedy walk by 2–3×. The ball reaches the end when the traversal converges.

HNSW — Random Entry default · fixed start node
0 steps waiting
FreeThrow — Portal learned · query-adaptive start
0 steps waiting
Representative numbers — actual improvement varies by dataset and recall target
HNSW traversal path
FreeThrow traversal path
§ 3 Recall / Throughput Pareto Frontier
Image similarity search · ImageNet · hover to inspect

FreeThrow dominates HNSW across all recall levels, establishing a new Pareto frontier. Every point on the HNSW curve can be improved by 2–3× in throughput at no recall cost, or equivalently, achieve the same throughput at a higher recall. For very high recalls (>99%), both methods converge as search approaches exhaustive scan.

§ 4 Implementation
  • FAISS integration. FreeThrow is implemented inside FAISS's C++ HNSW engine. FAISS normally fixes the graph entry point at the global centroid; we expose a single parameter to override it with a caller-supplied portal index. Index structure is unchanged — M=32, efConstruction=80 — so the modification is a one-line patch to the search routine. The implementation is open-source and drop-in compatible with existing FAISS HNSW pipelines.
  • Zero-cost gate. The portal predictor is a two-layer MLP (Rd → Rh → RPmax) with batch normalization and ReLU. It runs on the same GPU immediately after embedding inference — no CPU round-trip, no extra batching. Measured prediction latency on ImageNet (MobileNetV3, 960-dim): 0.0072 ms/item vs. 0.2393 ms/item for the embedding model itself — a 33× ratio. On Wikipedia with MiniLM the ratio is 156×; with DistilBERT, 226×. End-to-end overhead is 0.45–3% of total search time.
  • Training data generation. Four strategies are evaluated for producing labeled (embedding, optimal-portal) pairs: (1) existing dataset vectors run through standard HNSW search to record the best-hop entry, (2) Gaussian perturbations of existing vectors to cover the neighborhood, (3) real query logs from the dataset when available, and (4) random vectors as a coverage baseline. Strategies 1–3 yield comparable predictor accuracy; random vectors underperform, confirming that proximity to the data manifold matters. Training completes in 10–30 minutes with top-5 accuracy exceeding 99% and top-1 exceeding 90%.
  • Why it works at scale. In high-dimensional spaces each portal serves a compact, well-separated region of the embedding manifold, so a small set of portals covers the entire dataset with high probability. On SIFT1M (128-dim), 10,000 portals with radius r=7 suffice — far fewer than the theoretical worst-case bound. Gains grow with dimensionality: ImageNet (960-dim) reaches 3.2× speedup at 50% recall and 1.73× at 90%, versus 2× on SIFT1M, because higher-dimensional neighborhoods are more geometrically regular and easier to predict.
  • Theoretical framing Optimal portal set selection is NP-Complete; we prove this via reduction from Set Cover. A greedy BFS heuristic provides a practical approximation with the guarantee HG(P) ≤ 3r − 1 hops from any portal to its nearest query, where r is the coverage radius. On SIFT1M the empirical r=7 is an order of magnitude below the theoretical worst-case, confirming the heuristic is effective in practice.
← Back