Run the small models in parallel on the CPU

Chain it all to the GPU Small models in parallel on the CPU

Frontier model — GPU

one large model, kept at the matmul

98%

Frontier throughput

The small models, in parallel on the CPU

Input classifier

Router

Tool-call augmenter

Safety guardrail

Conversation steering

Draft / speculative

Summarizer

Reranker

A tier of small models per turn. Run them in parallel on the CPU, not chained to the GPU — or the big model's throughput collapses.