3 min read

What Hyperscale Caching Taught Us About GPU Utilization

Lessons from ultra-low-latency systems are reshaping LLM inference.

There's a quiet revolution happening at the intersection of two worlds that don't often talk to each other: high-performance caching systems and large language model inference.

At Momento, we've built the world's fastest hyperscale cache, engineered to respond in under 100 microseconds. Now, we're translating decades of experience with low-latency distributed systems to address one of AI's most pressing infrastructure challenges — keeping GPUs busy and bills from spiraling out of control.

Caching's Hidden Superpower

Before diving into GPUs, it's worth appreciating what caching actually does at a systems level. Good caching improves database utilization, helps users get answers faster, and reduces the total number of database servers you need to run. It's good for the user experience and the balance sheet.

But when you move from databases to AI inference, that same principle becomes dramatically more consequential. In the world of inference, utilization determines whether you have enough money, GPUs, and power to operate your product at all.

The KV Cache Problem

Here's where things get technically interesting. When a large language model processes your prompt and its surrounding context, it doesn't just read the text — it transforms it. The model decodes that input into tensors called a KV (key-value) cache. A prompt that might be less than a megabyte of plain text explodes into gigabytes of computed tensor data. That computation takes real GPU cycles.

Now consider what happens in a multi-turn conversation. You ask a question, get an answer, and come back with a follow-up. From the GPU's perspective, unless your new request lands on the exact same GPU that handled your previous message — and unless that GPU still has your KV cache sitting in its local memory — it has to recompute all of those tensors from scratch. Every time. It's the equivalent of a database that forgets every query the moment you look away.

If the node has that precomputed KV cache stored on local DRAM or NVMe, retrieving it is dramatically cheaper than regenerating it on the GPU. So far, so good. But this optimization introduces a new problem.

The Load Imbalance Trap

Forcing every request from the same user to route to the same GPU — the naive solution to keeping KV caches accessible — creates serious load imbalance. Some GPUs get hammered while others sit idle. This is the same problem that plagued early web server architectures, and it has the same symptoms: degraded performance, higher time-to-first-token, and unpredictable user experience.

The instinct is to do load leveling — redirect some requests to less-loaded GPUs. But here's the catch: if your inference fleet is already running hot trying to keep up with incoming requests, asking a new GPU to recompute those tensors from scratch can push the system into congested failure. You've solved the load problem by creating a compute problem.

This is the loop that global KV caching is designed to break.

Global KV Caching: The Missing Piece

The insight is straightforward but the engineering is not: instead of recomputing KV caches on a new GPU, give that GPU the ability to fetch the precomputed tensors from wherever they already exist in the fleet — quickly enough that it's worth doing.

This is where Momento's experience with sub-100-microsecond caching becomes directly applicable. The same three principles that power high-performance distributed caches turn out to be exactly what inference fleets need:

1. Smart routing. Route each request to the best available node based on real-time signals — not just load, but also data locality. Where does the relevant KV cache already live?

2. Intelligent placement. Put data where it's most likely to be needed. This requires an intimate understanding of how your specific workloads behave and how data moves through the system.

3. Fast data movement. None of this works if moving a KV cache between nodes takes too long. The transfer has to be fast enough to be worth doing — faster than recomputing from scratch. At gigabytes per KV cache, that's a real engineering challenge.

In traditional caching, these principles eliminate database bottlenecks. In inference, they eliminate idle GPUs.

Rust, Benchmarks, and Early Results

To rigorously validate these ideas, we produced new benchmarking tools in Rust — the language of choice for low-latency / high-performance systems. The early numbers from our experiments are promising. GPUs are staying busier. Time-to-first-token (TTFT) is down by more than 50%.

The broader implication for the industry are significant. So far, AI infrastructure investments have been entirely focused on making bigger, faster chips or deploying specialized, expensive networking hardware. Meanwhile, the data movement layer that orchestrates how KV caches get placed, routed, and transferred across a fleet remains underexplored. That's the gap this work is targeting.

Why This Matters

GPU time is among the most expensive compute on the planet right now. Any improvement in utilization has outsized financial and environmental impact at scale. Fortunately, the proven expertise to close that gap already exists in the form of classic distributed systems engineering.

The 100-microsecond world of caching and the multi-second world of LLM inference look very different on the surface. But underneath, they're solving the same problem: get the right data to the right place at the right time, fast enough to matter.

We've spent decades obsessing over cache placement, routing logic, and sub-millisecond transfer speeds. Now, we're excited to bring this expertise to inference!