Context Bloat in RAG
Context bloat is non-task prompt overhead that raises token spend without improving user outcomes.
Answer-first: formula, assumptions, example
Question: How do I trim context bloat in RAG without hurting answer quality?
Formula: candidate_input_tokens = task_tokens + target_non_task_tokens
Formula: cost_delta = cost_candidate - cost_baseline
- Assumption: task tokens stay constant between baseline and candidate runs.
- Assumption: non-task tokens are measured on production-like requests.
- Assumption: quality is validated with sampled prompts before global rollout.
Example: baseline input 1800 with 850 non-task tokens trimmed to 300 lowers candidate input to 1250 before pricing is applied.
Where Bloat Usually Comes From
- Repeated system instructions copied into every request.
- Long stale conversation history with low relevance to the active task.
- Boilerplate retrieval snippets that are never used by the model output.
Practical Trim Sequence
- Measure baseline non-task tokens on real traffic slices.
- Remove one overhead block at a time and replay eval prompts.
- Ship only changes that lower cost without quality regressions.
Back to calculators: Context Window Cost Calculator, RAG Cost per User