Why RAG, in one paragraph

Large language models know everything in their training data and nothing about your company. RAG fixes that. You retrieve the relevant chunks of your data, you stuff them into the prompt, and the model answers the question with grounded context. Done well, a RAG system feels like a colleague who has read every document you own.

What RAG is not

RAG is not fine-tuning. Fine-tuning teaches a model new behaviour. RAG gives the model new knowledge at runtime. You almost always want RAG first; you reach for fine-tuning when the model needs to think in a domain dialect.

The five layers of a serious RAG system

Ingestion. Pull in your documents, normalise them, chunk them well. The biggest quality wins live here, not in the model.
Embeddings and index. Vectorise each chunk and store it in a vector database with metadata you can filter on.
Retrieval. Take the user query, retrieve the top candidates with hybrid search (vector plus keyword), then rerank.
Synthesis. Send the question and retrieved chunks to the model with a tight, structured prompt.
Evaluation. Run an eval set every time anything changes. RAG quality silently drifts when nobody looks.

Chunking is where most RAG systems die

Naïve fixed-size chunking shreds your data into context-free strips. Use structure-aware chunking. Split on headings for documents, on functions for code, on rows for tables. Aim for 300 to 800 tokens per chunk and keep meaningful overlap between neighbours.

Engineering tip: Tag every chunk with source, section, version, and timestamp. When a user asks "where did that come from", the citation should be a one-click answer.

Hybrid retrieval and reranking

Pure vector search misses exact strings. Pure keyword search misses paraphrasing. Run both, merge the candidates, then rerank with a cross-encoder for the final top five. This single change usually lifts answer quality more than swapping models.

When RAG fails, here is why

Symptom	Root cause	Fix
"I do not know" on questions you know are in the docs	Bad chunking or low recall	Inspect retrieved chunks; widen `top_k`; add hybrid keyword search
Confidently wrong answers	Retrieved chunks were irrelevant	Add a reranker; tighten the prompt to "answer only from context"
Stale answers	Index is not refreshing	Schedule re-embedding; track `updated_at` per chunk
High cost per query	Too much context being sent	Cap context tokens; cache common answers via semantic cache

Kiework, our conversational HR platform, runs a RAG pipeline over your policies and HR documents. Open Agent takes the same pattern further by chaining retrieval with tool calls and human-in-the-loop checkpoints.

Where to read next

For the broader architecture, see Designing AI Applications That Survive Production. For the testing side, see Eval-Driven QA.

RAG in Production: A Practical Guide for 2026

Why RAG, in one paragraph

What RAG is not

The five layers of a serious RAG system

Chunking is where most RAG systems die

Hybrid retrieval and reranking

When RAG fails, here is why

Where to read next

Sujith PS

Want to ship something like this on your product?

Table of Contents

Related Insights

Reactive Programming: A Deep Dive into Modern Architecture

User Experience (UX) Testing

Gemini CLI vs Claude Code: An Engineer's Comparison

Related Insights

Reactive Programming: A Deep Dive into Modern Architecture

User Experience (UX) Testing

Gemini CLI vs Claude Code: An Engineer's Comparison

RAG in Production: A Practical Guide for 2026

Why RAG, in one paragraph

What RAG is not

The five layers of a serious RAG system

Chunking is where most RAG systems die

Hybrid retrieval and reranking

When RAG fails, here is why

Related Kiebot products

Where to read next

Sujith PS

Want to ship something like this on your product?

Table of Contents

Related Insights

Reactive Programming: A Deep Dive into Modern Architecture

User Experience (UX) Testing

Gemini CLI vs Claude Code: An Engineer's Comparison

Related Insights

Reactive Programming: A Deep Dive into Modern Architecture

User Experience (UX) Testing

Gemini CLI vs Claude Code: An Engineer's Comparison