3 min read

KV Cache Isn't a Caching Problem

KV Cache Isn't a Caching Problem

The industry is debating where to store KV cache. That's the wrong debate.

You step away from a conversation with your AI assistant to grab a coffee. Ten minutes later you come back, ask a follow-up question, and notice it feels slower. That spinner runs a little longer than usual. The model seems to be thinking harder than it should for what felt like a simple question.

It is thinking harder. It forgot everything while you were gone, and now it's recomputing from scratch. That wasted work costs real money, and the leading proposed solution in tiered KV cache storage is solving the wrong problem.

Why LLM KV cache is nothing like traditional caching

Caching infrastructure is generally built around small objects. One kilobyte, four kilobytes. Millions of transactions per second. The engineers who built systems like DynamoDB or Valkey spent years optimizing for that profile: minimize per-object latency, maximize IOPS, keep tiny things moving insanely fast.

That experience is a genuinely useful background for thinking about KV cache. The first principles around utilization, eviction policy, and hit rate all still apply. But the workload characteristics are so different that the solutions don't transfer cleanly, and assuming they do is where the current thinking goes wrong.

A single user's conversation context can run to multiple gigabytes. For larger models it's larger still. When many users are querying the same document or knowledge base, you're moving the same enormous objects repeatedly, often to the same destinations. The transaction rate drops dramatically because the object size ballooned.

At that scale the bottleneck shifts away from IOPS entirely. You're trying to saturate network throughput and move large objects as efficiently as possible. Your network card becomes the constraint, and the choice between RAM, SSD, and remote storage starts to matter far less than most people assume going in. The muscle memory from traditional cache tuning points you toward the wrong optimizations.

We’re optimizing for the wrong variable

A GPU can’t do anything with a multi-GB KV cache object until the entire object has arrived.

This means time to last byte (TTLB) is a metric worth optimizing for, and many teams aren't tracking it.

Instead, the focus tends to land on storage tier selection because that's the familiar lever. RAM is fast, SSD is slower, remote storage is slower still. Pick the fastest tier you can afford and call it solved. But when you're moving objects measured in gigabytes over a network, the read latency of your storage medium gets swamped by transfer time. A faster storage tier shaves microseconds off the read. The network transfer takes milliseconds regardless.

This is why remote storage options like object stores become genuinely viable for KV cache even though they'd be disqualifying for traditional low-latency cache workloads. Throughput-bound is throughput-bound. Once you accept that, the storage debate largely resolves itself.

What doesn't resolve itself is prefetch timing. If you can predict which KV cache objects a GPU will need next and begin loading them before that GPU is free, the transfer time effectively disappears from the critical path. By the time the GPU finishes its current request, the context is already sitting there ready. A hundred milliseconds of load time is invisible to a human but devastatingly expensive to a GPU billing by the hour. Compress that window through intelligent prefetching and you've reclaimed utilization that no storage tier selection would have recovered.

GPU utilization is a prefetch problem

Building infrastructure that operates at sub-100 microsecond latencies means going below the application layer into OS tuning, network stack configuration, and hardware selection. One thing that process teaches quickly is that the bottleneck is almost never where you expect it to be. You optimize the obvious thing, measure, and discover the constraint has moved somewhere you weren't watching.

KV cache has the same character. The industry landed on storage tiering as the primary lever because it's visible and tractable. Pick better hardware, spend more, go faster. That logic works up to a point and then stops working, because the actual constraint was never the storage medium.

Getting GPU utilization right comes down to knowing which context to load and having it waiting before the GPU ever goes idle. Storage tier is an input to that problem, not the answer to it.

At Momento, we've spent years solving latency problems that live below the application layer. KV cache for LLMs is the next one. If you're working on GPU infrastructure and want to compare notes, we'd like to hear from you.