CrystalCache: Borrowing from Memory Psychology to Manage the KV Cache
When a language model reads a long document, every token leaves a trace in the KV cache. The cache grows linearly, memory does not, and at some point we have to throw things away. The interesting question is what to throw away.
Most cache compression methods answer this with a single signal — attention weight, recency, or a learned importance score. CrystalCache asks a different question: when humans remember things from a long passage, what survives, and why? The answer from memory psychology is that survival has two independent reasons, and we can borrow that structure for KV cache management.
This post walks through the architecture — what we kept from the biological framework, what we deliberately threw away, and how the pieces fit together.
1. The two reasons a memory persists
The Structural Crystallization framework describes biological memory as having two independent survival dimensions:
- Encoding shock: a single intense experience locks a memory in. The flashbulb memory of a shocking event survives without rehearsal.
- Consolidation through association: repeated cross-referencing with other memories builds a structural backbone. Concepts you encounter from many angles become impossible to forget.
These two dimensions matter independently. A surprising one-time fact survives because of shock. A routine concept that shows up everywhere survives because of consolidation. Either path is enough.
This maps cleanly onto the problem of which KV entries to keep:
- A “needle” in a haystack —
the secret code is 7492buried in unrelated filler — is salient at encoding but isolated structurally. - A document backbone concept — a legal statute referenced throughout a court ruling — is unremarkable at any single point but is referenced everywhere.
Both should survive eviction, for completely different reasons. A scoring formula that only captures one will fail on the other.
2. What we borrowed, what we threw away
The biological framework comes with a lot of mechanisms. Some of them model real defects of biological memory — distortion on recall, fragility during reconsolidation, decay even of intense memories without rehearsal. KV vectors have none of these defects. They are exact digital storage. Importing the defect-modeling machinery would add complexity without any corresponding phenomenon to model.
So we took only the structural insights:
| Borrowed | Discarded |
|---|---|
| Two independent survival dimensions | Fidelity loss (F) — KV vectors don’t deform on access |
| Encoding shock as a one-shot variable | Labilization window (R) — there is no “internal recall” event for KV |
| Consolidation through cross-references | D × M_i coupling in dissolution — a high-shock token shouldn’t decay just because it lacks connections |
| Branch competition under resource limits | Access-mode classification — its only purpose was driving F and R |
The result is a leaner system that keeps the structural intuition from psychology without simulating biology’s failure modes.
3. The two dimensions, made concrete
M_i: encoding shock
M_i is set once at prefill and never changes. It captures how much a token group stands out from its surrounding context at the moment it was encoded.
It combines two signals:
Attention salience — how much attention this token receives from others during the prefill forward pass. We use received attention (column-sum of the attention matrix), averaged across the top-3 attention heads. Top-3 avoids dilution from indifferent heads while staying robust to single-head outliers.
Token uniqueness — a Von Restorff isolation effect. A token appearing once in 6000 tokens of context is far more distinctive than one appearing 200 times:
uniqueness(token) = 1 / (1 + log(1 + count_in_context))
A trunk’s M_i is the mean of its top-3 token values, then normalized to [0, 1] across all trunks in the context.
M_i = α × attention_salience + β × token_uniqueness
Default α = β = 0.5.
D: consolidation through association
In the original biological framework, D accumulates over time through repeated referenced access — many study sessions, spread across contexts. In one-shot prefill there is no temporal repetition, so we compute D from the spatial structure of cross-trunk attention. A trunk referenced by many different trunks has been “consolidated” through multiple independent associations, which is the structural analogue of repeated study.
The computation:
1. Build the trunk graph
For each pair of trunks (A, B):
cross_edges = co-attention edges between members of A and B
strength = mean(edge_weights) × sqrt(edge_count / (|A| × |B|))
if strength > 0.05: add_edge(A, B, strength)
2. Weighted degree per trunk
weighted_degree(T) = sum of strengths of edges incident to T
3. Map to D via z-scored sigmoid
D = sigmoid((weighted_degree − μ) / σ × steepness)
The sigmoid is doing real work here. A linear mapping would compress most trunks into a narrow band — most trunks have similar degree. The sigmoid amplifies the tails, so the highly connected backbone trunks get D close to 1.0 and the isolated trunks close to 0.0. Default steepness = 2.0.
Z-scoring before the sigmoid means D has meaningful variance regardless of the absolute degree distribution of any particular document.
Why these dimensions are independent
A few worked examples make the independence concrete:
"The secret code for Project Aurora is 7492"
M_i: HIGH (rare tokens, semantic outlier in filler)
D: LOW (no other trunk references it)
→ Survives via M_i path
"The defendant argued the statute violated the First Amendment"
M_i: LOW (legal language consistent with surroundings)
D: HIGH (referenced by evidence, ruling, precedent trunks)
→ Survives via D path
"The importance of understanding complex systems cannot be overstated"
M_i: LOW (common words, repeated phrasing)
D: LOW (self-repetition doesn't make diverse cross-trunk edges)
→ Evicted: both paths weak
Either path alone is enough to survive. Both weak means you go.
4. The score formula: max, not multiplication
score(trunk) = max(D, α × normalize(log(1 + M_i)))
The biological framework uses multiplication because biological memory really does work that way — an unconsolidated memory fades even if the initial encoding was intense. But the KV cache doesn’t need to simulate that defect.
Multiplication couples the dimensions: D = 0 kills a trunk regardless of how distinctive it is. That would kill every needle. We want exactly the opposite — a trunk should survive if it has any strong reason to live, not be killed because it lacks every reason at once.
max gives two independent paths. The α parameter controls the relative strength of the M_i path versus the D path: α > 1 favors needles, α < 1 favors document comprehension. Default 1.0.
The log(1 + M_i) then normalize step puts the M_i path on a comparable scale to D ∈ [0, 1] so the max makes sense.
5. Branch dissolution: trunks aren’t kept or evicted, they erode
Most cache compression methods make a binary decision per unit: keep or drop. CrystalCache does something more graduated, mapped from the multi-branch crystallization idea — under resource pressure, the weak branches of a memory detach first while the strong branches persist.
In KV terms: a sentence has function words and content words. Under pressure, the function words should drop first, leaving the proposition intact, even if the trunk as a whole is reduced.
This runs as two stages:
Stage 1: trunk-level budget allocation
Rank all trunks by score = max(D, α × normalize(log(1 + M_i)))
Top 30%: keep_ratio = 1.0 (fully preserved)
Middle 40%: keep_ratio = 0.5–0.8 (partial dissolution)
Bottom 30%: keep_ratio = 0.0 (fully evicted)
Tier boundaries shift to satisfy the global budget.
The boundaries are not fixed thresholds — at high budget pressure (20% retention), the bottom tier expands aggressively. At low pressure (50% retention), most trunks are top or middle tier.
Stage 2: trunk-local token selection
For each trunk with 0 < keep_ratio < 1:
k = max(min_keep, round(trunk.size × keep_ratio))
Rank member tokens by token-level M_i
Keep top-k, drop the rest
If k < min_keep (=3): evict the entire trunk instead of leaving fragments
A subtle but important detail: the within-trunk ranking is trunk-local, not global. A “weak” token in a high-M_i trunk (for in secret code for Aurora) might have higher absolute M_i than a “strong” token in a low-M_i trunk. Globally ranking would strip low-M_i trunks bare while leaving high-M_i trunks completely untouched — defeating the whole point of graduated dissolution.
A worked example, showing what survives at different budgets:
"The secret code for Project Aurora is 7492."
Token M_i:
The=0.3 secret=5.8 code=8.1 for=0.9
Project=2.3 Aurora=3.5 is=0.7 7492=6.4 .=0.3
keep_ratio = 0.5 (5 of 9):
Keep: code, 7492, secret, Aurora, Project
Surviving: "secret code Project Aurora 7492"
→ Core proposition preserved.
keep_ratio = 0.3 (3 of 9):
Keep: code, 7492, secret
Surviving: "secret code 7492"
→ Minimal but the key fact survives.
Sink tokens (the first few positions of the sequence) and the recent window are protected — never subject to dissolution or eviction, same as standard practice.
6. Decode-phase dynamics
Everything above happens at prefill. During decode, D evolves over time. This part of the system is mostly latent in short-decode benchmarks but becomes load-bearing in long-running scenarios.
D decay
Every decode step, every alive trunk’s D decays slightly:
decay_rate = β₀ × exp(−D × M_i / D_c)
D_new = D − decay_rate × D × dt
This is where we use the biological dissolution equation faithfully. High D and high M_i both suppress decay:
- A trunk with D = 0.9 and M_i = 8.0 barely decays (half-life ~270 steps)
- A trunk with D = 0.3 and M_i = 1.0 decays rapidly (half-life ~14 steps)
Over thousands of decode steps, this creates progressive differentiation — important trunks stay important, marginal ones fade.
Breath cycle: renewal via resonance
Every 64 decode steps:
- Detect activated trunks from key-norm η (an inverse-key-norm proxy for attention)
- Activated trunks:
D += renewal_rate × (1 − D) - BFS resonance from activated trunks along the trunk graph: neighbors get a smaller pulse of renewal
- Re-evaluate the budget and apply branch dissolution if over
The resonance step is the mechanism by which D updates structurally — when a trunk gets used, its neighbors in the association graph also recover some D. This is the closest the system gets to true online learning of the importance structure.
When this matters
In short-decode benchmarks (50–200 tokens), dissolution barely runs. The system is effectively a one-shot static scorer at prefill. Dissolution becomes load-bearing in:
- Multi-turn conversations (thousands of decode steps between turns)
- Streaming generation (continuous KV growth requiring periodic compression)
- Long-form reasoning (extended chain-of-thought with shifting relevance)
The architecture supports both regimes with no configuration change — dissolution is always running, it just doesn’t have time to produce visible effects in short-decode runs.
7. The full pipeline
Prefill
Phase 1 — Chunked forward + signal collection
For each 128-token chunk (eager attention):
Extract received attention → token attention_salience
Extract co-attention edges (top-8 per token, threshold 0.3)
Free attention tensors immediately
Phase 1.5 — M_i computation
For each token:
attention_salience: from Phase 1 (normalized)
token_uniqueness: 1/(1 + log(1 + count))
M_i(token) = α × salience + β × uniqueness
Phase 2 — Build trunks
Split by sentence boundaries
Merge adjacent sentences with strong co-attention
Split oversized trunks (> 32 tokens)
Trunk M_i = mean(top-3 of member token M_i values)
Phase 3 — Trunk graph + D
Aggregate co-attention edges into trunk-level edges
weighted_degree per trunk = sum of edge strengths
D = sigmoid((weighted_degree − μ) / σ × steepness)
Phase 4 — Branch dissolution + eviction
score = max(D, α × normalize(log(1 + M_i)))
Stage 1: rank trunks → assign keep_ratio
Stage 2: within each trunk, keep top-k tokens by token M_i
Sink + recent window untouched
KV compaction: index_select retained positions, update pos_map
Decode
Every step:
model(new_token, past_key_values, SDPA)
D decay for all alive trunks
Register new token position in pos_map
Every 64 steps (breath cycle):
Key-norm η → trunk activation detection
Activated trunks: D renewal
Resonance BFS: propagate D renewal to neighbors
Hebbian evolution: strengthen co-activated edges
Re-evaluate budget → branch dissolution if over
KV compaction + pos_map update
8. Parameters that matter
mi:
w_attention: 0.5 # weight for attention_salience
w_uniqueness: 0.5 # weight for Von Restorff
mi_top_k: 3 # top-k token M_i for trunk M_i
crystallization:
sigmoid_steepness: 2.0 # controls D distribution spread
min_edge_strength: 0.05 # trunk graph edge threshold
score:
alpha: 1.0 # M_i path weight in max(D, α × norm(log(1 + M_i)))
dissolution:
min_keep_tokens: 3 # below this, evict entire trunk
decode_dissolution:
beta_0: 0.05 # base decay rate
D_c: 2.0 # characteristic value
renewal_rate: 0.30 # D recovery on direct activation
resonance_rate: 0.15 # D recovery from resonance pulse
cache:
budget_ratio: 0.5
sink_tokens: 4
recent_window: 128
prefill_chunk_size: 128
decode_eviction_interval: 64
9. Closing thoughts
The interesting move in this design is not any individual mechanism — it’s what we didn’t import. It’s tempting, when borrowing from a rich theoretical framework, to take everything. The framework comes with a coherent story for why each piece exists, and removing pieces feels like throwing away signal.
But the story exists in service of modeling biological memory, with all its defects. KV cache management has a different problem. The vectors are exact, there is no recall-induced fragility, and there is no reason a high-shock token should decay just because nothing else points to it.
The structural insight is the borrowed part. Two independent dimensions, max instead of multiplication, graduated dissolution instead of binary keep-or-drop, weak branches detaching first under pressure. Everything else is engineering that fits the actual problem in front of us, not the original biological one.
Memory psychology gave us a good vocabulary. The implementation is its own thing.