3 min read

GPUs are the most expensive resource in tech. We’re using them badly.

GPUs are the most expensive resource in tech. We’re using them badly.

GPUs cost $2-4/hour and AI fleets run hundreds of them. With sticky session routing, you're probably wasting half of them.

Every time you send a message to an AI assistant, somewhere a GPU wakes up and gets to work.

GPUs weren't built for this. They were designed to render video game frames — massively parallel math machines built to push millions of pixels simultaneously. But the matrix multiplication at the heart of graphics is the same math that powers neural networks. The most important piece of hardware in the AI era is essentially a repurposed graphics card.

A very, very expensive repurposed graphics card.

A single H100, NVIDIA's current workhorse for AI inference, runs $2–4 per hour in the cloud. Serious deployments run hundreds of them. For the largest AI providers, GPU infrastructure is the number on the income statement that keeps the CFO up at night.

So you'd expect we'd be pretty good at using them efficiently.

We're not.

What happens inside a GPU when you chat with AI

When you open a conversation and type your first message, the GPU doesn't just process those words in isolation. It processes them in context, meaning it understands each word in relation to every other word. That's what makes modern AI feel coherent rather than a fancy autocomplete.

The architecture behind every major LLM does this through an attention mechanism. For every token in your conversation, the model computes a set of numbers called Keys and Values. These capture the meaning and relationships of that token within the full context. When generating the next word, the model consults all of them.

This is expensive to compute. The longer your conversation gets, the more expensive it becomes.

So the serving infrastructure does the sensible thing: it saves those Key and Value matrices rather than recomputing them on every message. This saved state is called the KV cache. It lives in the GPU's high-bandwidth memory (HBM).

When you send your next message, the system picks up where it left off. Fast, efficient, no redundant work. 

Until it isn't.

0:00
/0:08

Why GPU utilization is uneven across a fleet

The KV cache created for your conversation lives on a specific GPU. Not in a shared pool or in a database. On one particular piece of hardware in one particular datacenter.

When you send your next message, the infrastructure routes it to that exact GPU. Otherwise it has to recompute everything from scratch, burning time and money to rebuild context it already had.

This is called sticky session routing. Route returning sessions to the GPU that already has their context cached. Avoid redundant computation. It sounds completely reasonable.

Now consider what this looks like across a fleet of hundreds of GPUs serving millions of users.

Some users have long, active conversations. Their GPU is loaded with gigabytes of KV cache, processing frequent requests, and running hot. Other GPUs are holding cached context for users who started a conversation at 9am and haven't returned. Those GPUs are warm but idle, memory occupied, unavailable for new work.

Human behavior is fundamentally unpredictable. A conversation can go quiet for an hour and explode back into activity. A single deep technical thread can consume 10GB or more of GPU memory, committing nearly half a GPU’s capacity to one user’s conversation history.

The problem compounds further when you consider how AI is actually deployed in enterprise. Many users ask questions about the same large document, knowledge base, or dataset. Every one of those users forces the GPU to reprocess the same source material from scratch. The context isn’t changing between users, but the computation is run again anyway.

The infrastructure team can't rebalance without either evicting the cache, meaning losing all that expensive computation, or paying full recomputation cost to rebuild it elsewhere. Neither option is good. So the hotspots persist.

0:00
/0:05

This is a harder problem than it looks

Strip away the GPUs and the transformer math and this is something the infrastructure world has seen before. A stateful system where work gets routed based on where data lives, creating hotspots and imbalance, with no clean way to rebalance without either moving large objects or discarding work.

Distributed databases, session stores, and CDNs have all had to solve versions of this. The tooling matured, the patterns became well understood, and eventually the problem got boring in the best possible way.

The KV cache problem isn't there yet. The scale is different, individual cache objects are measured in gigabytes, living in memory that costs more per byte than anything else in the datacenter. But the shape of the problem is familiar to anyone who has spent time in the infrastructure space.

The research community is actively working on this, and the infrastructure world is starting to pay attention. Those two things meeting in the middle is usually when hard problems stop being hard. Until then, the hotspots persist.