Why RAG, in one paragraph
Large language models know everything in their training data and nothing about your company. RAG fixes that. You retrieve the relevant chunks of your data, you stuff them into the prompt, and the model answers the question with grounded context. Done well, a RAG system feels like a colleague who has read every document you own.
What RAG is not
RAG is not fine-tuning. Fine-tuning teaches a model new behaviour. RAG gives the model new knowledge at runtime. You almost always want RAG first; you reach for fine-tuning when the model needs to think in a domain dialect.
The five layers of a serious RAG system
- Ingestion. Pull in your documents, normalise them, chunk them well. The biggest quality wins live here, not in the model.
- Embeddings and index. Vectorise each chunk and store it in a vector database with metadata you can filter on.
- Retrieval. Take the user query, retrieve the top candidates with hybrid search (vector plus keyword), then rerank.
- Synthesis. Send the question and retrieved chunks to the model with a tight, structured prompt.
- Evaluation. Run an eval set every time anything changes. RAG quality silently drifts when nobody looks.
Chunking is where most RAG systems die
Naïve fixed-size chunking shreds your data into context-free strips. Use structure-aware chunking. Split on headings for documents, on functions for code, on rows for tables. Aim for 300 to 800 tokens per chunk and keep meaningful overlap between neighbours.
Engineering tip: Tag every chunk with source, section, version, and timestamp. When a user asks "where did that come from", the citation should be a one-click answer.
Hybrid retrieval and reranking
Pure vector search misses exact strings. Pure keyword search misses paraphrasing. Run both, merge the candidates, then rerank with a cross-encoder for the final top five. This single change usually lifts answer quality more than swapping models.
When RAG fails, here is why
| Symptom | Root cause | Fix |
|---|---|---|
| "I do not know" on questions you know are in the docs | Bad chunking or low recall | Inspect retrieved chunks; widen top_k; add hybrid keyword search |
| Confidently wrong answers | Retrieved chunks were irrelevant | Add a reranker; tighten the prompt to "answer only from context" |
| Stale answers | Index is not refreshing | Schedule re-embedding; track updated_at per chunk |
| High cost per query | Too much context being sent | Cap context tokens; cache common answers via semantic cache |
Related Kiebot products
Kiework, our conversational HR platform, runs a RAG pipeline over your policies and HR documents. Open Agent takes the same pattern further by chaining retrieval with tool calls and human-in-the-loop checkpoints.
Where to read next
For the broader architecture, see Designing AI Applications That Survive Production. For the testing side, see Eval-Driven QA.


