Back to Insights
Small LLMsEdge AIPerformanceAI

Small LLMs Are Eating the Edge

Mubashir
Written byMubashir
25 April 2026
5 min read
Small LLMs Are Eating the Edge

Small is not weak

A 7 billion parameter model in 2026 outperforms what a 70 billion parameter model could do in 2024. The progress is mostly about better data and better training, not raw scale. For a lot of product surface area, small is now the right answer.

Where small wins

  • Routing. Pick which agent or tool handles a request.
  • Classification. Tag, score, and bucket inputs.
  • Extraction. Pull structured fields out of messy text.
  • Summarisation. Compress a thread or document.
  • Rewriting. Tone changes, grammar, light editing.

Production pattern: Use a small model to decide what to do, and only call a large model for the part that truly needs reasoning. The cost savings are dramatic and quality often improves.

Where small still loses

  • Multi-step reasoning over long context.
  • Open-ended writing that needs taste.
  • Tool-heavy agents with branching plans.

How we deploy them

For server workloads we run small models on a single GPU with vLLM and batch heavily. For on-device we use Ollama on Mac, llama.cpp on Linux servers, and ONNX runtime on Windows. For mobile we lean on Phi and Gemma quantised to 4-bit.

A simple rule

If you can write the task spec on a sticky note, a small model can probably do it. If you cannot, you need a bigger model or a smarter pipeline.

For the model shortlist, see Open-Source LLMs That Matter. For the production architecture, see Designing AI Applications That Survive Production.


Mubashir

DevOps & Cloud Engineer

Runs Kiebot’s CI/CD, Kubernetes, and observability stack. Writes about pragmatic DevOps for small engineering orgs.

View full profile →