NVIDIA shipped Dynamo, and it’s built on the same idea as OpenInfer: inference isn’t just a runtime anymore, it’s a control plane. Disaggregated prefill and decode, cache-aware routing, planning across a fleet. That’s the problem now, and the biggest name in compute building for it says the category is real.
Dynamo makes a bet, though, about the fleet underneath it: NVIDIA GPUs in a cluster you can grow, a uniform pool in the cloud or a datacenter with room to spare. That’s a real and common setup, and on it Dynamo is excellent. It’s just not the infrastructure a lot of enterprises actually have. Theirs is a mix of CPUs, GPUs, and NPUs, spread across private datacenters, factory floors, and air-gapped rooms, and a lot of it sits idle. OpenInfer is built for that one. It comes in two parts: the OpenInfer runtime, which runs inference on your hardware, and Weave, the control plane that runs the fleet.
The runtime: built for the hardware you already own
The OpenInfer runtime runs inference across whatever you’ve got (CPU, GPU, NPU) and routes each workload to the hardware that serves it best at that moment. No new hardware to buy, no cloud to depend on, and your data never leaves your infrastructure. It drops in behind what you already run. Call it from a framework like LangChain, or put it where a runtime like Ollama or vLLM sits today. You change an endpoint, not your code. No rewrites.
In practice that means putting hardware to work that usually sits idle: spare CPUs, a mix of accelerators, even an air-gapped room with no cloud account in it.
Dynamo is optimized for NVIDIA GPUs and deploys as a Kubernetes framework you stand up and run. On a uniform NVIDIA GPU cluster, that’s a clean fit. On a mixed fleet that’s already in the building, getting full use out of every box is harder.
Put your idle compute to work
Most enterprises sit on more compute than they use. The gap isn’t hardware, it’s the software to use it.
When a model needs more room and you have capacity, Weave can scale it out on Kubernetes like any other workload, the same thing Dynamo does. The difference shows up when the fleet is full. To feed a busy model from an idle one, Dynamo shifts whole workers: it drains the idle model’s replicas to free the GPUs, then starts the busy model’s on them. That works, but every shift is a teardown and a cold start.
The OpenInfer runtime doesn’t move workers, it rebalances compute. Each node keeps several models resident at once and shifts the GPU between them on demand, so a busy model draws more while an idle one eases back. Nothing is torn down, nothing cold-starts, and nothing is unloaded, so there’s no hit to whatever is still serving. Weave drives this across the whole fleet: capacity follows the models that need it, the moment they need it, and idle metal turns into capacity without a restart.
A control plane that runs on your terms
Weave is light on purpose. There’s no central cache index and no streaming event plane to keep in sync across the fleet. That’s exactly what lets it run over heterogeneous nodes you don’t centrally manage, in rooms with no cloud and no outside connection at all.
And it’s more than routing. Weave is the operating surface for your inference. It’s multi-tenant, so teams and orgs share one fleet with their own models and policies, and it puts model and fleet management in one place. You also get full observability: every request traced, stored, and searchable, all on infrastructure you control and data that never leaves it. Dynamo hands that back to you: Kubernetes manifests, Grafana dashboards you wire up yourself, tenants split by namespace. That’s a serving framework doing its job and leaving the rest to someone else. Weave does that job for you, so the fleet is one place to run, govern, and watch instead of a pile of namespaces and dashboards.
Or run our runtime under Dynamo
The runtime and Weave aren’t welded together. The OpenInfer runtime is a backend, so it runs under Dynamo just as well as under Weave. If you’ve standardized on Dynamo, keep it, and let the OpenInfer runtime broaden the hardware your inference runs on: execution across CPUs, GPUs, and NPUs, each workload routed to the device that serves it best, all under NVIDIA’s control plane.
So it isn’t all or nothing. Run the whole OpenInfer stack, run the runtime under NVIDIA’s control plane, or start with one and add the other later. Adopt it on your schedule, not ours.
Bottom line
If you run a uniform GPU cluster in the cloud and grow it on demand, Dynamo is a strong choice, and a sign the whole category is real. If you run your own mixed hardware (on-prem, air-gapped, multi-tenant, half of it idle), OpenInfer is built for that. Same problem, two honest answers, aimed at different fleets.
We built ours for the fleet you already have. If that’s the one you’re running, we’d like to talk.