RAM — Vlad Cainamisir

Research · 2025

Retrieval-Augmented Models

Dynamic LoRA Expert Retrieval for Continual Learning · Harvard University

Stack

Python · PyTorch · vLLM · HuggingFace PEFT

Result

+10% absolute · +25% relative over base model

Benchmark

CodeFlowBench · Codeforces 800-rated problems

Context

CS 2821 Final Project · 2025

PHASE 1embedding space · k-Means clustering

PHASE 2query → retrieve → attach LoRA → generate

Left (Phase 1 — offline): 200 training problems are embedded and partitioned into K = 10 clusters via k-Means. Stars mark the stored centroid of each cluster, which serves as the retrieval key for its LoRA expert. Right (Phase 2 — online): at inference time, the test input is embedded and its cosine similarity to all centroids is computed. The nearest expert is retrieved and its LoRA weights are merged into the frozen base model for generation. On an incorrect output, only that expert's weights are updated — all other experts and the base model remain unchanged. Click to replay.

§ 1 The Idea

Problem

PEFT

MoE
Limits

RAM

The starting point was a simple observation: LLMs are good generalists but poor specialists. Qwen 2.5-Coder is a strong model, but on Codeforces 800-level problems it solves only 39%. The domain has specific stylistic and algorithmic patterns that a generalist has no particular reason to internalize. Fine-tuning the whole 7B model is expensive, requires storing an entirely new copy of the weights, and risks catastrophic forgetting — adapting to new data by overwriting old knowledge.

LoRA makes fine-tuning cheap by keeping the base model completely frozen and learning only a pair of small low-rank matrices per layer. The adapters are a tiny fraction of the model's size, so you can realistically maintain a whole library of them simultaneously. That opens up a different question: instead of one fine-tuned model, what if you had many small experts and picked the right one per input?

Mixture of Experts does exactly that — route each input to a specialist. But in standard MoE architectures like Switch Transformer or Mixtral, the routing gate is learned end-to-end alongside the experts during pre-training. Once trained, it's fixed. You can't add a new expert without retraining the router, and correcting one expert risks disturbing everything else. That makes it a poor fit for continual learning, where the task distribution keeps evolving.

RAM replaces the learned gate with retrieval. We embed 200 training problems using text-embedding-3-large, cluster them into K = 10 groups via k-Means — each cluster representing a latent problem subtype — and train one LoRA expert per cluster. The centroid of each cluster becomes the retrieval key stored in the LoRA Store. At inference, we embed the test problem, find the nearest centroid by cosine similarity, and attach that expert to the frozen base model. If the output is wrong and a ground-truth answer exists, we run a single gradient update on the retrieved expert only — the base model and the other nine experts are untouched. This is how the system accumulates specialized knowledge over time without ever forgetting what it already knows.

Base Model

Qwen 2.5-Coder-7B-Instruct

Experts

K = 10 LoRA adapters

Retrieval

Cosine similarity · cluster centroids

Embedding

OpenAI text-embedding-3-large

§ 2 Results

Best-of-5 accuracy · CodeFlowBench · 200 Codeforces 800-rated problems

Evaluated on 200 Codeforces 800-level problems using Best-of-5 sampling — a problem is solved if at least one of 5 generated candidates passes all hidden test cases. The full RAM system with continual learning solved 98/200 (49%), a +10 percentage point / +25% relative improvement over the Qwen 2.5-Coder base. The ablations are telling: plugging in the worst LoRA — the one least similar to the query — still beats the base model by 6 points. This means the gains aren't purely from retrieval precision; Codeforces problems have specific stylistic and algorithmic patterns that any LoRA tuned on that domain picks up, regardless of fit. Retrieving the right expert adds another ~4 points on top, and the online continual learning update (a single gradient step on the retrieved expert when the output is wrong) adds another 2.5. These effects are separable and stack cleanly.

Policy	Solved	Acc (%)	Δ
Base Model (Qwen 2.5)	78	39.0%	—
RAM (Least similar LoRA)	90	45.0%	+6.0%
RAM (Median LoRA)	92	46.0%	+7.0%
RAM-No-Learning	93	46.5%	+7.5%
RAM (Problem Embedding)	97	48.5%	+9.5%
RAM-Learning (Ours)	98	49.0%	+10.0%

§ 3 Method

Phase 1 — building the expert library. We embed all 200 training problems into a high-dimensional semantic space using text-embedding-3-large and run k-Means to partition them into K = 10 clusters. The intuition is that each cluster captures a latent skill or subtype in the problem space — not an explicitly labeled category, just whatever structure the embedding model finds. For each cluster, we train a distinct LoRA adapter exclusively on that cluster's problems, using cross-entropy loss with the base model frozen throughout. The centroid of each cluster is stored as the retrieval key alongside the trained weights in the LoRA Store — a simple key-value structure where keys are centroid vectors and values are LoRA parameters.
Phase 2 — retrieval and inference. When a test problem arrives, we embed it with the same model, compute cosine similarity against all 10 stored centroids, and retrieve the adapter with the highest score: k* = argmax cos(v_test, μ_k). That adapter is attached to the frozen base model, and generation proceeds with Best-of-5 sampling. Attaching a LoRA is lightweight — it doesn't modify the base weights, just adds the low-rank deltas on top of them during the forward pass.
Continual learning via local updates. When the model gets a problem wrong and a ground-truth answer is available, we perform a single gradient update restricted to the retrieved expert: φ_k* ← φ_k* − η∇L_CE(y_gold, x_test). The base model and the other nine experts are never touched. We also update the centroid μ_k* to reflect the new problem being added to that cluster. Because updates are strictly local, catastrophic forgetting is prevented by construction — there's nothing to interfere with, since each expert owns a disjoint slice of the problem space. For the online phase we use batch size 1 and a single gradient step, and since we're fine-tuning an already-trained adapter rather than learning from scratch, it adapts without overfitting to a single example.
Unsupervised expert creation. A key property of this setup is that the clustering is entirely unsupervised — we never label what "type" a problem is. We just let the embedding model and k-Means find meaningful structure. The fact that even the least-similar LoRA outperforms the base model by 6% confirms the clustering is real: all 10 experts have internalized domain-specific patterns, and the retrieval step adds further precision on top of that baseline lift.
Centroids vs. raw embedding retrieval. We ran an ablation where retrieval uses the raw problem embedding nearest-neighbor instead of centroids. It scores 48.5% vs 49.0% for centroids. The gap is small but the direction is consistent with the hypothesis that centroids act as a regularizer: they represent the implicit class of the problem rather than its surface wording, so retrieval is less susceptible to keyword overlap between domains.
Infrastructure. The frozen base model runs through vLLM for maximum throughput. LoRA training, management, and dynamic swapping use HuggingFace Transformers and PEFT. Training ran on NVIDIA A100 40GB GPUs with multi-GPU sharding; the system is architecturally compatible with 70B+ models for future scaling.

← Back