Why open weights matter again
Two years ago "open" meant slow, dumb, and a research toy. In 2026 the top open models clear the bar for most production tasks. They also unlock three things hosted models cannot: full data privacy, predictable cost, and the ability to fine-tune.
The shortlist
| Model family | Where it wins | Where it loses |
|---|---|---|
| Llama 3.x | General-purpose chat, tool use | Reasoning on hard math |
| Mistral and Mixtral | Cost-effective inference, multilingual | Long-context tasks |
| Qwen | Multilingual workloads, especially Chinese | Niche English idioms |
| DeepSeek | Code, reasoning | Open-ended creative writing |
| Phi | Small, on-device tasks | Complex multi-step reasoning |
| Gemma | Embeddable in client apps | Agentic workloads |
When to actually pick open
- Hard privacy. Customer data cannot leave your VPC. Open weights end the conversation.
- Predictable scale. You will serve millions of low-margin requests, and the per-token price of a hosted API does not work.
- Specialised domain. You have proprietary data worth fine-tuning on. Open weights make that possible.
- Edge deployment. You need inference on a phone, a kiosk, or a factory floor without a network.
Where hosted still wins
For frontier reasoning tasks and for fast iteration during a prototype, a hosted model is still the cheaper, faster choice. Pick open when you have a real reason to pay the operations tax.
Serving them safely
Pick vLLM or TGI for GPU-backed inference at scale. Use Ollama for laptops and lightweight servers. For multi-tenant production, isolate workloads at the GPU level, set strict context-length limits, and rate-limit per tenant. Open weights remove some risks; they add others.
Related reading
For the smaller-end of the spectrum, see Small LLMs Are Eating the Edge. For the rest of the stack, see The LLM Stack in 2026.




