Your AI Remembers Everything Except the Thing You Keep Telling It
Every AI agent starts with a system prompt.
It might be a few sentences instructing the model to respond formally, or thousands of tokens of business context, product knowledge, and behavioral guardrails. Either way, every single request your application sends includes it. Word for word, token for token, every time.
And every single time, the GPU recomputes it from scratch.
If you're running a support bot handling ten thousand conversations per day, you're paying to recompute the same system prompt ten thousand times. A five hundred token prefix becomes five million tokens of repeated inference work that produces the exact same result every time. The model already “understands” that context. The infrastructure just isn't allowed to reuse it.
Prefix caching addresses this specific problem, and it's already in many production inference engines. For certain workloads it works remarkably well. The catch is that most real AI usage doesn't look like the workload prefix caching was optimized for, and the gap between where it works and where people assume it works is costing teams more than they're measuring yet.
What prefix caching is good at
During inference, an LLM processes tokens sequentially and builds a KV cache representing the attention state for everything it has read so far. If two requests begin with an identical token sequence, that KV state can be reused instead of recomputed.
The system prompt is a great example. Compute the KV cache for those tokens once, store the resulting blocks, and subsequent requests skip directly to the part of the prompt that's actually new.
For enterprise RAG deployments the savings are even more dramatic. When many users query the same document or knowledge base, large portions of their request context are identical. Without prefix caching, every request pays full recomputation cost for that shared material. With it, you effectively pay once.
Engines like vLLM and SGLang already implement this optimization. For static shared prefixes, it is a clean and effective solution.
The block alignment problem
KV cache state isn't stored as one continuous object per session. It's divided into fixed-size token blocks (for example, 16-token blocks is the default in vLLM). Prefix caching works by hashing these blocks and checking whether a matching hash already exists in cache.
For static prefixes like system prompts, reuse is nearly perfect. Tokens and block hashes remain stable and cache hits are consistent.
For a growing conversation, it breaks almost every turn.
As users add turns, the total token sequence shifts. Conversations rarely end cleanly on block boundaries. Partial blocks at the end of a turn can't be cached because their contents will change as soon as new tokens arrive. When the next turn is added, those tokens move into the middle of the sequence, changing how downstream blocks are grouped and hashed. Previously reusable blocks may no longer align, causing misses to cascade forward.
This means prefix caching often delivers lower hit rates for conversational workloads than teams expect, especially as sessions become longer and more dynamic.
Conversations don't behave like cache workloads
Prefix caching works best when the beginning of a request remains stable. Conversations do the opposite.
Each turn reshapes the token sequence. Boundaries shift. Reuse opportunities get smaller and smaller over time.
Many benchmarks are designed to maximize reuse by repeating identical prompts or using large static prefixes. Real conversations accumulate meaning turn by turn. Context evolves, and reuse becomes structurally harder.
That makes conversational inference fundamentally different from traditional caching problems. Reuse isn't just a function of block size or hashing strategy. It depends on how state is structured and carried forward across a session.
Prefix caching solved the easy case: shared static context.
The harder (and more common) case is long-lived interaction. Supporting those workloads efficiently requires infrastructure that understands conversational structure, not just token sequences.
For decades, cache systems have treated data as opaque blobs. If two byte sequences matched, reuse was possible. If they didn't, recomputation was inevitable.
Conversational inference turns that upside down. The cost of serving AI is increasingly determined not just by how much data is processed, but by how that data evolves over time.
Prefix caching reduced the cost of repeating the past. The next generation of infrastructure will need to manage the cost of remembering it.
At Momento, we're actively researching how conversational state impacts inference efficiency at scale. If you're running long-lived AI workloads and seeing similar challenges, we'd love to compare notes.