Small is not weak
A 7 billion parameter model in 2026 outperforms what a 70 billion parameter model could do in 2024. The progress is mostly about better data and better training, not raw scale. For a lot of product surface area, small is now the right answer.
Where small wins
- Routing. Pick which agent or tool handles a request.
- Classification. Tag, score, and bucket inputs.
- Extraction. Pull structured fields out of messy text.
- Summarisation. Compress a thread or document.
- Rewriting. Tone changes, grammar, light editing.
Production pattern: Use a small model to decide what to do, and only call a large model for the part that truly needs reasoning. The cost savings are dramatic and quality often improves.
Where small still loses
- Multi-step reasoning over long context.
- Open-ended writing that needs taste.
- Tool-heavy agents with branching plans.
How we deploy them
For server workloads we run small models on a single GPU with vLLM and batch heavily. For on-device we use Ollama on Mac, llama.cpp on Linux servers, and ONNX runtime on Windows. For mobile we lean on Phi and Gemma quantised to 4-bit.
A simple rule
If you can write the task spec on a sticky note, a small model can probably do it. If you cannot, you need a bigger model or a smarter pipeline.
Related reading
For the model shortlist, see Open-Source LLMs That Matter. For the production architecture, see Designing AI Applications That Survive Production.




