The stack at a glance
An LLM call is one HTTPS request. An LLM application is everything around that one call. The boring layers are what keep your product alive after launch.
| Layer | Purpose | Common choices in 2026 |
|---|---|---|
| 1. Model | The reasoning engine | Claude Sonnet 4.6, GPT, Gemini, Llama, Mistral |
| 2. Orchestration | Sequence calls, handle tools | LangGraph, custom code |
| 3. Retrieval | Bring private context to the model | pgvector, Pinecone, Weaviate, Qdrant |
| 4. Evals | Measure quality over time | Custom eval harness, Braintrust, LangSmith |
| 5. Observability | Trace and debug every call | Langfuse, OpenTelemetry |
| 6. Guardrails | Safety, policy, cost limits | Structured outputs, regex, classifier models |
Why people pick the wrong model
Teams pick the model that scored best on a benchmark. That model often has the wrong latency, the wrong cost curve, or the wrong tool-use semantics for the job. The right move is to define your three or four "must pass" scenarios, run them across two or three candidates, and pick the model that fits your shape, not the leaderboard.
Practical tip: Keep the model choice swappable behind a thin interface. Six months from now there will be a better cheaper option, and you will not want to rewrite the world.
Observability is not optional
If you cannot replay a failed conversation in seconds, you cannot debug your product. Capture the full request, the resolved context, the tool calls, the responses, the costs, and the eval scores. Then make all of that searchable. The teams that ship reliable LLM features all share this discipline.
Related reading
For the design pattern, see Designing AI Applications That Survive Production. For the retrieval layer, see RAG in Production.




