Both methods use IVF clustering to select nprobe clusters. IVF-Flat reads every candidate vector in full from NVMe. LayoutDB keeps a compact m-dimensional prefix of each vector in DRAM and checks it first — only vectors that pass the prefix filter trigger a disk read of the suffix. Click to replay.
Cost
Tradeoff
DB
LLM applications depend on vector databases to retrieve relevant context and prevent hallucinations. A modern LLM like DeepSeek V3 is trained on trillions of tokens — storing embeddings for even its training set at 1600 dimensions would require roughly 380 TB of RAM. Purely in-memory indexes like HNSW are infeasible at that scale.
Disk-based solutions exist, but they suffer a fundamental I/O bottleneck. IVF-Flat keeps full vectors on NVMe and must read every candidate in a cluster in full. At 100 μs per disk access and a 1M dataset with nprobe=10 and 1000-vector clusters, a single query costs over 1 second just in data movement — before any distance computation.
Quantization-based methods like IVF-PQ compress vectors into 8–32-byte codes, trading storage for recall. A true neighbor can be missed by the approximate code distance and never reach the refinement stage. Practitioners must inflate nprobe or expansion factors to recover recall — incurring extra I/O without a principled guarantee of correctness.
The key observation: only the first few dozen dimensions are needed to accept or reject most candidates. If the partial distance on the prefix already exceeds the current k-th best distance, the full vector can never be a neighbor — no disk read required. LayoutDB stores just enough in DRAM to eliminate 98% of candidates before touching disk.
Both methods scan the same 1000 candidates across nprobe=10 clusters. IVF-Flat fetches every vector in full from NVMe — 1000 disk reads, nothing skipped. LayoutDB streams m-dimensional prefix tiles from DRAM: if the partial distance already exceeds the current k-th best, the candidate is eliminated instantly. Only ~20 survivors trigger a disk read.
LayoutDB (Tiled+Snapshot) is 6.8× faster than IVF-Flat on close-query workloads and 3× faster on normal queries, while maintaining identical recall@1. Prefix pruning is conservative: raw-prefix distance is a strict lower bound on full L2, guaranteeing zero false negatives. Snapshot mode uses a JLL-scaled projection (expected value matches full distance) with an empirically tuned scaling factor.
| Workload | Mode | Avg. Latency | Recall@1 | vs IVF-Flat |
|---|---|---|---|---|
| Close | IVF-Flat | 157.1 ms | 1.00 | — |
| Close | TiledStorage | 42.2 ms | 1.00 | 3.7× |
| Close | Tiled+Snapshot | 23.0 ms | 1.00 | 6.8× |
| Normal | IVF-Flat | 159.0 ms | 0.90 | — |
| Normal | TiledStorage | 118.8 ms | 0.90 | 1.3× |
| Normal | Tiled+Snapshot | 52.9 ms | 0.90 | 3.0× |
- Header-only C++17, ~2k lines. AVX-512 SIMD kernels accelerate prefix distance
computation. Zero-copy disk access via
mmap— the OS page cache handles hot data automatically. No background compaction or training required after index construction. - IVF clustering via FAISS. An offline
prepareDataphase calls FAISS's k-means to partition the dataset into nlist Voronoi cells. Each cluster writes two binary files:tiles.bin(m-dimensional prefix tiles, laid out row-by-row for SIMD scanning) androws.bin(the remaining d−m suffix dimensions). Tiles are sized to land on hardware page boundaries so a single page read fetches 16–64 tiles. - Two prefix modes. Raw-prefix stores the first m dimensions verbatim — the partial squared distance ‖q1:m − x1:m‖² is a strict lower bound on the full squared distance, guaranteeing zero false negatives. Snapshot mode stores an m×d orthonormal projection Px; the scaled distance ‖Pq − Px‖² · D/m has expected value equal to the full squared distance, enabling aggressive JLL-backed filtering at the cost of a small controllable false-negative rate.
- Three configurable RAM knobs.
tile_cache_sizepins a fraction of prefix tiles in DRAM.anchor_cache_sizekeeps full hot vectors as in-memory "anchors" — zero disk traffic for the most-accessed data.optimize_queryreorders the scan to process all in-memory vectors first, tightening the k-th best threshold before any disk is touched, which increases selectivity on disk-resident candidates. - At 18% RAM, 98% recall, 1400 QPS. With modest caching (18% of dataset floats in RAM), LayoutDB discards over 99.99% of candidates at prefix time, issues one small read for survivors, and achieves P99 latencies approaching an all-in-memory FAISS baseline — while scaling to datasets that far exceed available DRAM. This represents a 100× throughput improvement over IVF-Flat under simulated disk delays.