Inference Is No Longer a Single Execution Model — It's a Routing Problem

For three years, every LLM serving system has been built on the same assumption: a person types a prompt, the model streams back tokens, and the system’s job is to make that feel fast. vLLM, TGI, SGLang — they’re all optimized for one execution model. Low time-to-first-token. Low inter-token latency. One kind of workload, one kind of serving.

That assumption is now wrong.

Agents broke the serving model

The fastest-growing inference workload isn’t a human in a chat window. It’s an autonomous agent running on a 30-minute heartbeat, working through a multi-step task while its owner sleeps. OpenClaw hit 247,000 GitHub stars in months. Gartner says 40% of enterprise apps will include AI agents by the end of this year. Major cloud providers report that batch and offline inference already accounts for the majority of total serving capacity.

Agent workloads are structurally different from chat:

Most tokens don’t need low latency. An agent heartbeat doesn’t care about time-to-first-token. It cares about throughput and cost. But when the human checks in, that same session needs to feel responsive. One session, two completely different requirements — separated by time, not by type.

Contexts are large and growing. An agent that starts at 2K tokens may reach 64K over a day of accumulated work. A hundred concurrent agents at 32K context each generate 100 GB of KV cache for a 70B model. That doesn’t fit on one GPU.

The volume is always on. This isn’t bursty traffic. It’s sustained, 24/7 inference — and most of it is running on expensive GPU hardware that is wildly overprovisioned for what the workload actually needs.

Pool isolation is a dead end

The industry’s current answer is pool isolation: one cluster for interactive requests, another for batch. Two copies of the same model, on the same hardware, running the same engine, at two different price points.

This breaks in predictable ways. The interactive pool is oversized for its actual load. The batch pool can’t handle an urgent request. Sessions can’t move between pools. And the server CPUs sitting next to those GPUs — high-end Xeons and EPYCs running at 1–5% utilization — are completely wasted.

The problem isn’t scheduling. The problem is that every serving system runs one execution model. The same algorithm, on the same hardware, for every session. vLLM, SGLang, and TGI cannot express the idea that different sessions should be served with different strategies. Existing schedulers can reorder requests. They cannot change how inference is executed.

The shift the industry needs is from monolithic serving to heterogeneous execution. Not a better scheduler — a new abstraction layer.

Execution strategy is the new primitive

Think about what happened to compute scheduling. Before Kubernetes, you deployed one application on one server. Kubernetes introduced the abstraction of scheduling workloads across heterogeneous resources based on their requirements. It didn’t make servers faster — it made the mapping between workloads and resources intelligent.

Inference needs the same shift. Not faster kernels on the same architecture, but an abstraction that routes sessions to fundamentally different execution strategies based on what each session actually needs.

We’ve built this. We call it Weave.

Weave is not a scheduling layer on top of existing inference engines. It’s a new serving model where execution strategy is first-class. When a session arrives, Weave doesn’t just decide where to run it — it decides how. Which parallelism scheme. Which hardware class. How computation is distributed across devices. These aren’t different queues backed by the same code. They’re structurally different strategies with different performance profiles — selected per-session, in microseconds, based on SLA requirements and context.

A latency-sensitive interactive session runs standard single-GPU decode — fast, familiar, optimized for responsiveness. A throughput-tolerant background agent session runs on a distributed ring of CPU nodes using a completely different execution path. Same model, same weights, identical outputs. Different execution strategy, matched to what the workload requires.

The industry already accepted that prefill and decode should be disaggregated because they have different compute profiles. Weave extends this to a second axis: sessions with different service-level requirements should run on different strategies. This is disaggregation applied to a new dimension.

Technical depth: Our Weave whitepaper describes the execution strategy taxonomy, the routing model, and how strategies map to hardware. Read the Weave whitepaper →

Distributed CPU decode: the missing execution strategy

One of the execution strategies Weave enables is distributed decode on CPU nodes — the idle server CPUs already sitting alongside your GPUs. This is the strategy that fills the biggest gap in today’s serving landscape: what do you do with throughput-tolerant workloads whose aggregate KV cache exceeds single-node memory?

The key building block is a variant of ring attention where KV caches stay stationary and queries rotate — an approach recently demonstrated for GPU inference by Yang et al. at Meta. We’ve extended this idea to a different hardware regime entirely: server CPUs on commodity 10 Gbps networks.

Why does this work? During decode, the attention computation under GQA has high arithmetic intensity — it’s compute-bound on CPU, not bandwidth-bound. That means we can pipeline head-by-head and hide the ring communication behind the computation. The math works out: on Xeon/EPYC hardware with 10 Gbps networking, there’s a 5–6× margin between compute time and transfer time per head group. The ring communication effectively disappears into the computation.

The result is a decode strategy that turns idle CPUs into productive throughput capacity for background agent workloads — at near-zero marginal cost. We’ll publish a detailed technical paper on our CPU-optimized distributed decode design in the coming weeks.

What this unlocks

Idle CPUs become productive. Those Xeons and EPYCs at 1–5% utilization on your GPU servers can run background agent decode at near-zero marginal cost. This is capacity you’re already paying for.

GPUs focus on what matters. Interactive sessions get dedicated GPU capacity without competing against background work. Utilization goes up because you stop reserving GPU headroom for traffic that doesn’t need it.

Large contexts become routine. Distributed decode spreads KV across nodes. A 256K-token context that won’t fit on a single GPU is handled by adding ring nodes. The handoff from prefill is faster, not slower, because the scatter is parallelized.

Hardware diversity becomes an asset. Different GPU generations, CPU-only nodes, mixed configurations — Weave routes to what’s available. You stop waiting for new GPU shipments to scale your agent workloads.

Where this is going

The age of “every token needs low latency” is ending. The workloads that will dominate inference over the next two years — autonomous agents, background processing, always-on multi-step workflows — are structurally different from the chatbot traffic that today’s serving systems were designed for.

The systems that serve this era will not be faster versions of today’s monolithic engines. They will be heterogeneous execution platforms that route workloads to strategies the way Kubernetes routes containers to nodes — based on what each workload needs, across whatever hardware is available.

This is how inference systems will be built going forward. Weave is our first step toward that architecture — with more execution strategies, including our CPU-optimized distributed decode, coming soon.

Read more:

Weave: Execution Strategies for Heterogeneous LLM Inference (whitepaper)
OpenInfer