Both methods search the same HNSW graph. HNSW enters at a fixed global vertex (top layer) and descends through all layers — many hops. FreeThrow's MLP maps the query embedding to a portal: a node in the dense base layer adjacent to the answer, reducing traversal to a single hop.
Cause
Most retrieval systems work the same way: an embedding model turns a query into a vector, and a vector index finds the nearest stored neighbors. The embedding model is well-studied and heavily optimized. The index is usually an afterthought.
When I started profiling these pipelines, the numbers were striking — vector search was eating 50–97% of total runtime, sometimes 10× slower than the embedding step it was supposed to complement.
The culprit is how HNSW, the de facto standard, begins every search: from the same fixed node, every time, regardless of the query. It then walks the graph greedily until it converges. The walk is serial, can't use a GPU, and its length is entirely determined by how unlucky that starting point was.
The idea behind FreeThrow was simple: what if we just learned where to start? We train a small MLP to predict a good entry point — a portal — for each query, running it on the same GPU right after embedding. In practice it adds under 1% overhead and cuts traversal length by 2–3×.
Measured on an embedding-retrieval pipeline using MobileNet for image embeddings. Disk-based retrieval (Qdrant): 97.2% of runtime is vector search. In-memory retrieval (FAISS): 82.7% of runtime is vector search.
Both methods search the same HNSW graph for the same query. HNSW starts from a fixed, arbitrary entry node. FreeThrow predicts a portal — a starting node near the query — cutting the greedy walk by 2–3×. The ball reaches the end when the traversal converges.
FreeThrow dominates HNSW across all recall levels, establishing a new Pareto frontier. Every point on the HNSW curve can be improved by 2–3× in throughput at no recall cost, or equivalently, achieve the same throughput at a higher recall. For very high recalls (>99%), both methods converge as search approaches exhaustive scan.
- FAISS integration. FreeThrow is implemented inside FAISS's C++ HNSW engine. FAISS normally fixes the graph entry point at the global centroid; we expose a single parameter to override it with a caller-supplied portal index. Index structure is unchanged — M=32, efConstruction=80 — so the modification is a one-line patch to the search routine. The implementation is open-source and drop-in compatible with existing FAISS HNSW pipelines.
- Zero-cost gate. The portal predictor is a two-layer MLP (Rd → Rh → RPmax) with batch normalization and ReLU. It runs on the same GPU immediately after embedding inference — no CPU round-trip, no extra batching. Measured prediction latency on ImageNet (MobileNetV3, 960-dim): 0.0072 ms/item vs. 0.2393 ms/item for the embedding model itself — a 33× ratio. On Wikipedia with MiniLM the ratio is 156×; with DistilBERT, 226×. End-to-end overhead is 0.45–3% of total search time.
- Training data generation. Four strategies are evaluated for producing labeled (embedding, optimal-portal) pairs: (1) existing dataset vectors run through standard HNSW search to record the best-hop entry, (2) Gaussian perturbations of existing vectors to cover the neighborhood, (3) real query logs from the dataset when available, and (4) random vectors as a coverage baseline. Strategies 1–3 yield comparable predictor accuracy; random vectors underperform, confirming that proximity to the data manifold matters. Training completes in 10–30 minutes with top-5 accuracy exceeding 99% and top-1 exceeding 90%.
- Why it works at scale. In high-dimensional spaces each portal serves a compact, well-separated region of the embedding manifold, so a small set of portals covers the entire dataset with high probability. On SIFT1M (128-dim), 10,000 portals with radius r=7 suffice — far fewer than the theoretical worst-case bound. Gains grow with dimensionality: ImageNet (960-dim) reaches 3.2× speedup at 50% recall and 1.73× at 90%, versus 2× on SIFT1M, because higher-dimensional neighborhoods are more geometrically regular and easier to predict.
- Theoretical framing Optimal portal set selection is NP-Complete; we prove this via reduction from Set Cover. A greedy BFS heuristic provides a practical approximation with the guarantee HG(P) ≤ 3r − 1 hops from any portal to its nearest query, where r is the coverage radius. On SIFT1M the empirical r=7 is an order of magnitude below the theoretical worst-case, confirming the heuristic is effective in practice.