The 80/20 of production AI
Almost any team can wire an LLM to a chat box and demo it on a Friday. The hard part starts on Monday, when real users send unexpected inputs, costs balloon, and someone in legal asks where the data goes.
After shipping multiple AI applications into production, I keep returning to the same five-layer architecture. It is not glamorous. It works.
The five layers
| Layer | Purpose |
|---|---|
| 1. Intent | Decide what the user is asking for. Classify, route, refuse if out of scope. |
| 2. Context | Fetch the right private data via RAG, structured queries, or tool calls. |
| 3. Reasoning | Call the LLM with a structured prompt. Constrain the output format strictly. |
| 4. Verify | Check the output: schema valid, no PII, no hallucinated entities, within policy. |
| 5. Act | Return to the user, call a tool, or escalate to a human. Log everything. |
Evals are your unit tests
In deterministic software, unit tests are how you change code without fear. In AI applications, evals do the same job. An eval is a small dataset of input/expected-output pairs that grades a new prompt, model, or chain.
Build the eval set before the prompt. Treat it as a permanent artifact. Run it in CI on every PR that touches a prompt, a model, or a chain.
📊 What a good eval set looks like
Cover three categories: happy path, edge cases (empty, ambiguous, adversarial), and refuse cases (out-of-scope, unsafe). Aim for 50–200 examples to start, grow as you ship.
Guardrails, not gates
A guardrail is something the system applies on every request. A gate is a one-time approval. Production AI lives or dies on guardrails.
- Input guardrails: PII redaction, length limits, prompt-injection patterns.
- Output guardrails: schema validation, refusal detection, jailbreak detection.
- Cost guardrails: token caps per user, per session, per day. Yes, every layer.
- Human-in-the-loop: require approval for irreversible actions (delete, send, pay).
Graceful degradation
Your model provider will have an outage. Your eval scores will drop. Plan for it.
- Fall back to a cheaper or smaller model when the primary times out.
- Cache common answers with a vector-similarity cache; users tolerate "served from cache" far better than "service unavailable."
- Show a clear "AI temporarily unavailable" state instead of a half-broken UI.
Architect tip: Treat the LLM as an unreliable external service from sprint one. The teams who do not, find out at 3 AM when their primary provider rate-limits them.
Observability that helps
For every AI request, log: the input, the resolved context, the prompt template version, the model name, the output, the eval scores, the cost. When something goes wrong (and it will), the trace from input to output is what saves you.
Related reading
For the bigger paradigm shift, see Thinking in AI. For the developer-tool side, see The AI Pair Programmer.



