Research · 2025
LayoutDB
Disk-Resident Vector Search with In-Memory Performance · Harvard University
Stack
C++17 · AVX-512 · FAISS · mmap
Result
100× over IVF-Flat · 98% recall · 18% RAM
Dataset
2M vectors · 768-dim · Wikitext-2
Context
CS 265 Big Data Systems · 2025
IVF-FLATfull read · every candidate from disk
LAYOUTDBprefix prune · disk only for survivors

Both methods use IVF clustering to select nprobe clusters. IVF-Flat reads every candidate vector in full from NVMe. LayoutDB keeps a compact m-dimensional prefix of each vector in DRAM and checks it first — only vectors that pass the prefix filter trigger a disk read of the suffix. Click to replay.

§ 1 The Disk–Memory Problem
Scale
I/O
Cost
PQ
Tradeoff
Layout
DB

LLM applications depend on vector databases to retrieve relevant context and prevent hallucinations. A modern LLM like DeepSeek V3 is trained on trillions of tokens — storing embeddings for even its training set at 1600 dimensions would require roughly 380 TB of RAM. Purely in-memory indexes like HNSW are infeasible at that scale.

Disk-based solutions exist, but they suffer a fundamental I/O bottleneck. IVF-Flat keeps full vectors on NVMe and must read every candidate in a cluster in full. At 100 μs per disk access and a 1M dataset with nprobe=10 and 1000-vector clusters, a single query costs over 1 second just in data movement — before any distance computation.

Quantization-based methods like IVF-PQ compress vectors into 8–32-byte codes, trading storage for recall. A true neighbor can be missed by the approximate code distance and never reach the refinement stage. Practitioners must inflate nprobe or expansion factors to recover recall — incurring extra I/O without a principled guarantee of correctness.

The key observation: only the first few dozen dimensions are needed to accept or reject most candidates. If the partial distance on the prefix already exceeds the current k-th best distance, the full vector can never be a neighbor — no disk read required. LayoutDB stores just enough in DRAM to eliminate 98% of candidates before touching disk.

DeepSeek V3
~380 TB for 1600d embeddings
IVF-Flat
>1 s data movement per query on 1M dataset
IVF-PQ
Lossy codes cause false negatives
§ 2 Candidate Pruning Race

Both methods scan the same 1000 candidates across nprobe=10 clusters. IVF-Flat fetches every vector in full from NVMe — 1000 disk reads, nothing skipped. LayoutDB streams m-dimensional prefix tiles from DRAM: if the partial distance already exceeds the current k-th best, the candidate is eliminated instantly. Only ~20 survivors trigger a disk read.

IVF-Flat — Full Disk Scan baseline · every candidate read from NVMe
0 disk reads waiting
LayoutDB — Prefix Prune hybrid · DRAM prefix → NVMe suffix only for survivors
0 disk reads waiting
Close-query workload: IVF-Flat 157 ms · LayoutDB (Tiled+Snapshot) 23 ms
§ 3 Performance Results
Average query latency (ms) · Wikitext-768 · nprobe=1 · recall@1

LayoutDB (Tiled+Snapshot) is 6.8× faster than IVF-Flat on close-query workloads and 3× faster on normal queries, while maintaining identical recall@1. Prefix pruning is conservative: raw-prefix distance is a strict lower bound on full L2, guaranteeing zero false negatives. Snapshot mode uses a JLL-scaled projection (expected value matches full distance) with an empirically tuned scaling factor.

Workload Mode Avg. Latency Recall@1 vs IVF-Flat
Close IVF-Flat 157.1 ms 1.00
Close TiledStorage 42.2 ms 1.00 3.7×
Close Tiled+Snapshot 23.0 ms 1.00 6.8×
Normal IVF-Flat 159.0 ms 0.90
Normal TiledStorage 118.8 ms 0.90 1.3×
Normal Tiled+Snapshot 52.9 ms 0.90 3.0×
§ 4 Implementation
  • Header-only C++17, ~2k lines. AVX-512 SIMD kernels accelerate prefix distance computation. Zero-copy disk access via mmap — the OS page cache handles hot data automatically. No background compaction or training required after index construction.
  • IVF clustering via FAISS. An offline prepareData phase calls FAISS's k-means to partition the dataset into nlist Voronoi cells. Each cluster writes two binary files: tiles.bin (m-dimensional prefix tiles, laid out row-by-row for SIMD scanning) and rows.bin (the remaining d−m suffix dimensions). Tiles are sized to land on hardware page boundaries so a single page read fetches 16–64 tiles.
  • Two prefix modes. Raw-prefix stores the first m dimensions verbatim — the partial squared distance ‖q1:m − x1:m‖² is a strict lower bound on the full squared distance, guaranteeing zero false negatives. Snapshot mode stores an m×d orthonormal projection Px; the scaled distance ‖Pq − Px‖² · D/m has expected value equal to the full squared distance, enabling aggressive JLL-backed filtering at the cost of a small controllable false-negative rate.
  • Three configurable RAM knobs. tile_cache_size pins a fraction of prefix tiles in DRAM. anchor_cache_size keeps full hot vectors as in-memory "anchors" — zero disk traffic for the most-accessed data. optimize_query reorders the scan to process all in-memory vectors first, tightening the k-th best threshold before any disk is touched, which increases selectivity on disk-resident candidates.
  • At 18% RAM, 98% recall, 1400 QPS. With modest caching (18% of dataset floats in RAM), LayoutDB discards over 99.99% of candidates at prefix time, issues one small read for survivors, and achieves P99 latencies approaching an all-in-memory FAISS baseline — while scaling to datasets that far exceed available DRAM. This represents a 100× throughput improvement over IVF-Flat under simulated disk delays.
← Back