What Is RAG?
RAG (Retrieval-Augmented Generation) combines retrieval and generation: it fetches relevant context from your data first, then uses that context when the model writes an answer. In many agent workflows, RAG is the retrieval layer that feeds planning, tool use, or synthesis.
Question
What is RAG inside an AI agent and how does it change unit cost?
Quick answer
Formula: effective_input_tokens = base_prompt_tokens + retrieved_chunks * tokens_per_chunk
- Assumption: retrieval quality controls how many chunks you need per request.
- Assumption: retrieved context tokens are a major cost driver and must be measured explicitly.
- Assumption: chunking and reranking settings affect both quality and unit economics.
Example: base prompt 500 + (6 chunks * 200) gives 1,700 effective input tokens before output.
How RAG Works
- Embed and index your knowledge base in a vector store.
- Retrieve the most relevant chunks for each user query.
- Build the prompt with instructions + retrieved context.
- Generate an answer, then optionally rerank or cite sources.
When To Use It
- You need answers grounded in private or frequently changing data.
- You want fresher outputs without retraining a model.
- You need better answer reliability than prompt-only approaches.
RAG Inside Agent Workflows
- RAG often supplies the context window before an agent calls tools or synthesizes a final answer.
- Retrieval depth and chunk size can dominate cost even when the agent takes multiple steps.
- Better retrieval quality can reduce how often an agent needs expensive retries or extra tool calls.
Common Mistakes
- Ignoring retrieval token overhead and only tracking model output tokens.
- Adding too many chunks without validating quality gains.
- Skipping cache strategy and paying full cost for repeated requests.
Open companion tools: AI Workflow Cost, Retrieval Cost, Rerank Cost
Related reads: What Is an AI Agent?, RAG Cost Components Explained, What Is Reranking in RAG?, How To Choose Chunk Size and Chunk Count