Chain it all to the GPU Small models in parallel on the CPU
Frontier model — GPU
one large model, kept at the matmul
98%
Frontier throughput
The small models, in parallel on the CPU
Input classifier
Router
Tool-call augmenter
Safety guardrail
Conversation steering
Draft / speculative
Summarizer
Reranker
A tier of small models per turn. Run them in parallel on the CPU, not chained to the GPU — or the big model's throughput collapses.