What Is RAG (Retrieval-Augmented Generation)?
RAG combines retrieval and generation: it fetches relevant context from your data first, then uses that context when the model writes an answer.
Answer-first: formula, assumptions, example
Question: What is RAG and how does it change AI unit cost?
Formula: effective_input_tokens = base_prompt_tokens + retrieved_chunks * tokens_per_chunk
- Assumption: retrieval quality controls how many chunks you need per request.
- Assumption: retrieved context tokens are a major cost driver and must be measured explicitly.
- Assumption: chunking and reranking settings affect both quality and unit economics.
Example: base prompt 500 + (6 chunks * 200) gives 1,700 effective input tokens before output.
How RAG Works
- Embed and index your knowledge base in a vector store.
- Retrieve the most relevant chunks for each user query.
- Build the prompt with instructions + retrieved context.
- Generate an answer, then optionally rerank or cite sources.
When To Use It
- You need answers grounded in private or frequently changing data.
- You want fresher outputs without retraining a model.
- You need better answer reliability than prompt-only approaches.
Common Mistakes
- Ignoring retrieval token overhead and only tracking model output tokens.
- Adding too many chunks without validating quality gains.
- Skipping cache strategy and paying full cost for repeated requests.
Open companion tool: RAG Cost per User
Related reads: RAG cost components explained, what is reranking in RAG, chunk size and chunk count.