CrystalCache: Borrowing from Memory Psychology to Manage the KV Cache

When a language model reads a long document, every token leaves a trace in the KV cache. The cache grows linearly, memory does not, and at some point we have to throw things away. The interesting question is what to throw away.

Most cache compression methods answer this with a single signal — attention weight, recency, or a learned importance score. CrystalCache asks a different question: when humans remember things from a long passage, what survives, and why? The answer from memory psychology is that survival has two independent reasons, and we can borrow that structure for KV cache management.

This post walks through the architecture — what we kept from the biological framework, what we deliberately threw away, and how the pieces fit together.

1. The two reasons a memory persists

The Structural Crystallization framework describes biological memory as having two independent survival dimensions:

Encoding shock: a single intense experience locks a memory in. The flashbulb memory of a shocking event survives without rehearsal.
Consolidation through association: repeated cross-referencing with other memories builds a structural backbone. Concepts you encounter from many angles become impossible to forget.

These two dimensions matter independently. A surprising one-time fact survives because of shock. A routine concept that shows up everywhere survives because of consolidation. Either path is enough.

This maps cleanly onto the problem of which KV entries to keep:

A “needle” in a haystack — the secret code is 7492 buried in unrelated filler — is salient at encoding but isolated structurally.
A document backbone concept — a legal statute referenced throughout a court ruling — is unremarkable at any single point but is referenced everywhere.

Both should survive eviction, for completely different reasons. A scoring formula that only captures one will fail on the other.

2. What we borrowed, what we threw away

The biological framework comes with a lot of mechanisms. Some of them model real defects of biological memory — distortion on recall, fragility during reconsolidation, decay even of intense memories without rehearsal. KV vectors have none of these defects. They are exact digital storage. Importing the defect-modeling machinery would add complexity without any corresponding phenomenon to model.

So we took only the structural insights:

Borrowed	Discarded
Two independent survival dimensions	Fidelity loss (F) — KV vectors don’t deform on access
Encoding shock as a one-shot variable	Labilization window (R) — there is no “internal recall” event for KV
Consolidation through cross-references	D × M_i coupling in dissolution — a high-shock token shouldn’t decay just because it lacks connections
Branch competition under resource limits	Access-mode classification — its only purpose was driving F and R

The result is a leaner system that keeps the structural intuition from psychology without simulating biology’s failure modes.

3. The two dimensions, made concrete

M_i: encoding shock

M_i is set once at prefill and never changes. It captures how much a token group stands out from its surrounding context at the moment it was encoded.

It combines two signals:

Attention salience — how much attention this token receives from others during the prefill forward pass. We use received attention (column-sum of the attention matrix), averaged across the top-3 attention heads. Top-3 avoids dilution from indifferent heads while staying robust to single-head outliers.

Token uniqueness — a Von Restorff isolation effect. A token appearing once in 6000 tokens of context is far more distinctive than one appearing 200 times:

uniqueness(token) = 1 / (1 + log(1 + count_in_context))

A trunk’s M_i is the mean of its top-3 token values, then normalized to [0, 1] across all trunks in the context.

M_i = α × attention_salience + β × token_uniqueness

Default α = β = 0.5.

D: consolidation through association

In the original biological framework, D accumulates over time through repeated referenced access — many study sessions, spread across contexts. In one-shot prefill there is no temporal repetition, so we compute D from the spatial structure of cross-trunk attention. A trunk referenced by many different trunks has been “consolidated” through multiple independent associations, which is the structural analogue of repeated study.

The computation:

1. Build the trunk graph
   For each pair of trunks (A, B):
       cross_edges = co-attention edges between members of A and B
       strength = mean(edge_weights) × sqrt(edge_count / (|A| × |B|))
       if strength > 0.05: add_edge(A, B, strength)

2. Weighted degree per trunk
   weighted_degree(T) = sum of strengths of edges incident to T

3. Map to D via z-scored sigmoid
   D = sigmoid((weighted_degree − μ) / σ × steepness)

The sigmoid is doing real work here. A linear mapping would compress most trunks into a narrow band — most trunks have similar degree. The sigmoid amplifies the tails, so the highly connected backbone trunks get D close to 1.0 and the isolated trunks close to 0.0. Default steepness = 2.0.

Z-scoring before the sigmoid means D has meaningful variance regardless of the absolute degree distribution of any particular document.

Why these dimensions are independent

A few worked examples make the independence concrete:

"The secret code for Project Aurora is 7492"
  M_i: HIGH  (rare tokens, semantic outlier in filler)
  D:   LOW   (no other trunk references it)
  → Survives via M_i path

"The defendant argued the statute violated the First Amendment"
  M_i: LOW   (legal language consistent with surroundings)
  D:   HIGH  (referenced by evidence, ruling, precedent trunks)
  → Survives via D path

"The importance of understanding complex systems cannot be overstated"
  M_i: LOW   (common words, repeated phrasing)
  D:   LOW   (self-repetition doesn't make diverse cross-trunk edges)
  → Evicted: both paths weak

Either path alone is enough to survive. Both weak means you go.

4. The score formula: max, not multiplication

score(trunk) = max(D, α × normalize(log(1 + M_i)))

The biological framework uses multiplication because biological memory really does work that way — an unconsolidated memory fades even if the initial encoding was intense. But the KV cache doesn’t need to simulate that defect.

Multiplication couples the dimensions: D = 0 kills a trunk regardless of how distinctive it is. That would kill every needle. We want exactly the opposite — a trunk should survive if it has any strong reason to live, not be killed because it lacks every reason at once.

max gives two independent paths. The α parameter controls the relative strength of the M_i path versus the D path: α > 1 favors needles, α < 1 favors document comprehension. Default 1.0.

The log(1 + M_i) then normalize step puts the M_i path on a comparable scale to D ∈ [0, 1] so the max makes sense.

5. Branch dissolution: trunks aren’t kept or evicted, they erode

Most cache compression methods make a binary decision per unit: keep or drop. CrystalCache does something more graduated, mapped from the multi-branch crystallization idea — under resource pressure, the weak branches of a memory detach first while the strong branches persist.

In KV terms: a sentence has function words and content words. Under pressure, the function words should drop first, leaving the proposition intact, even if the trunk as a whole is reduced.

This runs as two stages:

Stage 1: trunk-level budget allocation

Rank all trunks by score = max(D, α × normalize(log(1 + M_i)))

Top 30%:    keep_ratio = 1.0   (fully preserved)
Middle 40%: keep_ratio = 0.5–0.8  (partial dissolution)
Bottom 30%: keep_ratio = 0.0   (fully evicted)

Tier boundaries shift to satisfy the global budget.

The boundaries are not fixed thresholds — at high budget pressure (20% retention), the bottom tier expands aggressively. At low pressure (50% retention), most trunks are top or middle tier.

Stage 2: trunk-local token selection

For each trunk with 0 < keep_ratio < 1:
    k = max(min_keep, round(trunk.size × keep_ratio))
    Rank member tokens by token-level M_i
    Keep top-k, drop the rest
    If k < min_keep (=3): evict the entire trunk instead of leaving fragments

A subtle but important detail: the within-trunk ranking is trunk-local, not global. A “weak” token in a high-M_i trunk (for in secret code for Aurora) might have higher absolute M_i than a “strong” token in a low-M_i trunk. Globally ranking would strip low-M_i trunks bare while leaving high-M_i trunks completely untouched — defeating the whole point of graduated dissolution.

A worked example, showing what survives at different budgets:

"The secret code for Project Aurora is 7492."

Token M_i:
  The=0.3  secret=5.8  code=8.1  for=0.9
  Project=2.3  Aurora=3.5  is=0.7  7492=6.4  .=0.3

keep_ratio = 0.5 (5 of 9):
  Keep: code, 7492, secret, Aurora, Project
  Surviving: "secret code Project Aurora 7492"
  → Core proposition preserved.

keep_ratio = 0.3 (3 of 9):
  Keep: code, 7492, secret
  Surviving: "secret code 7492"
  → Minimal but the key fact survives.

Sink tokens (the first few positions of the sequence) and the recent window are protected — never subject to dissolution or eviction, same as standard practice.

6. Decode-phase dynamics

Everything above happens at prefill. During decode, D evolves over time. This part of the system is mostly latent in short-decode benchmarks but becomes load-bearing in long-running scenarios.

D decay

Every decode step, every alive trunk’s D decays slightly:

decay_rate = β₀ × exp(−D × M_i / D_c)
D_new = D − decay_rate × D × dt

This is where we use the biological dissolution equation faithfully. High D and high M_i both suppress decay:

A trunk with D = 0.9 and M_i = 8.0 barely decays (half-life ~270 steps)
A trunk with D = 0.3 and M_i = 1.0 decays rapidly (half-life ~14 steps)

Over thousands of decode steps, this creates progressive differentiation — important trunks stay important, marginal ones fade.

Breath cycle: renewal via resonance

Every 64 decode steps:

Detect activated trunks from key-norm η (an inverse-key-norm proxy for attention)
Activated trunks: D += renewal_rate × (1 − D)
BFS resonance from activated trunks along the trunk graph: neighbors get a smaller pulse of renewal
Re-evaluate the budget and apply branch dissolution if over

The resonance step is the mechanism by which D updates structurally — when a trunk gets used, its neighbors in the association graph also recover some D. This is the closest the system gets to true online learning of the importance structure.

When this matters

In short-decode benchmarks (50–200 tokens), dissolution barely runs. The system is effectively a one-shot static scorer at prefill. Dissolution becomes load-bearing in:

Multi-turn conversations (thousands of decode steps between turns)
Streaming generation (continuous KV growth requiring periodic compression)
Long-form reasoning (extended chain-of-thought with shifting relevance)

The architecture supports both regimes with no configuration change — dissolution is always running, it just doesn’t have time to produce visible effects in short-decode runs.

7. The full pipeline

Prefill

Phase 1 — Chunked forward + signal collection
  For each 128-token chunk (eager attention):
    Extract received attention → token attention_salience
    Extract co-attention edges (top-8 per token, threshold 0.3)
    Free attention tensors immediately

Phase 1.5 — M_i computation
  For each token:
    attention_salience: from Phase 1 (normalized)
    token_uniqueness: 1/(1 + log(1 + count))
  M_i(token) = α × salience + β × uniqueness

Phase 2 — Build trunks
  Split by sentence boundaries
  Merge adjacent sentences with strong co-attention
  Split oversized trunks (> 32 tokens)
  Trunk M_i = mean(top-3 of member token M_i values)

Phase 3 — Trunk graph + D
  Aggregate co-attention edges into trunk-level edges
  weighted_degree per trunk = sum of edge strengths
  D = sigmoid((weighted_degree − μ) / σ × steepness)

Phase 4 — Branch dissolution + eviction
  score = max(D, α × normalize(log(1 + M_i)))
  Stage 1: rank trunks → assign keep_ratio
  Stage 2: within each trunk, keep top-k tokens by token M_i
  Sink + recent window untouched
  KV compaction: index_select retained positions, update pos_map

Decode

Every step:
  model(new_token, past_key_values, SDPA)
  D decay for all alive trunks
  Register new token position in pos_map

Every 64 steps (breath cycle):
  Key-norm η → trunk activation detection
  Activated trunks: D renewal
  Resonance BFS: propagate D renewal to neighbors
  Hebbian evolution: strengthen co-activated edges
  Re-evaluate budget → branch dissolution if over
  KV compaction + pos_map update

8. Parameters that matter

mi:
  w_attention: 0.5         # weight for attention_salience
  w_uniqueness: 0.5        # weight for Von Restorff
  mi_top_k: 3              # top-k token M_i for trunk M_i

crystallization:
  sigmoid_steepness: 2.0   # controls D distribution spread
  min_edge_strength: 0.05  # trunk graph edge threshold

score:
  alpha: 1.0               # M_i path weight in max(D, α × norm(log(1 + M_i)))

dissolution:
  min_keep_tokens: 3       # below this, evict entire trunk

decode_dissolution:
  beta_0: 0.05             # base decay rate
  D_c: 2.0                 # characteristic value
  renewal_rate: 0.30       # D recovery on direct activation
  resonance_rate: 0.15     # D recovery from resonance pulse

cache:
  budget_ratio: 0.5
  sink_tokens: 4
  recent_window: 128

prefill_chunk_size: 128
decode_eviction_interval: 64

9. Closing thoughts

The interesting move in this design is not any individual mechanism — it’s what we didn’t import. It’s tempting, when borrowing from a rich theoretical framework, to take everything. The framework comes with a coherent story for why each piece exists, and removing pieces feels like throwing away signal.

But the story exists in service of modeling biological memory, with all its defects. KV cache management has a different problem. The vectors are exact, there is no recall-induced fragility, and there is no reason a high-shock token should decay just because nothing else points to it.

The structural insight is the borrowed part. Two independent dimensions, max instead of multiplication, graduated dissolution instead of binary keep-or-drop, weak branches detaching first under pressure. Everything else is engineering that fits the actual problem in front of us, not the original biological one.

Memory psychology gave us a good vocabulary. The implementation is its own thing.