What Is RAG (Retrieval-Augmented Generation)?

RAG combines retrieval and generation: it fetches relevant context from your data first, then uses that context when the model writes an answer.

Answer-first: formula, assumptions, example

Question: What is RAG and how does it change AI unit cost?

Formula: effective_input_tokens = base_prompt_tokens + retrieved_chunks * tokens_per_chunk

  • Assumption: retrieval quality controls how many chunks you need per request.
  • Assumption: retrieved context tokens are a major cost driver and must be measured explicitly.
  • Assumption: chunking and reranking settings affect both quality and unit economics.

Example: base prompt 500 + (6 chunks * 200) gives 1,700 effective input tokens before output.

How RAG Works

  1. Embed and index your knowledge base in a vector store.
  2. Retrieve the most relevant chunks for each user query.
  3. Build the prompt with instructions + retrieved context.
  4. Generate an answer, then optionally rerank or cite sources.

When To Use It

  • You need answers grounded in private or frequently changing data.
  • You want fresher outputs without retraining a model.
  • You need better answer reliability than prompt-only approaches.

Common Mistakes

  • Ignoring retrieval token overhead and only tracking model output tokens.
  • Adding too many chunks without validating quality gains.
  • Skipping cache strategy and paying full cost for repeated requests.

Open companion tool: RAG Cost per User

Related reads: RAG cost components explained, what is reranking in RAG, chunk size and chunk count.