Context Bloat in RAG

Context bloat is non-task prompt overhead that raises token spend without improving user outcomes.

Answer-first: formula, assumptions, example

Question: How do I trim context bloat in RAG without hurting answer quality?

Formula: candidate_input_tokens = task_tokens + target_non_task_tokens

Formula: cost_delta = cost_candidate - cost_baseline

  • Assumption: task tokens stay constant between baseline and candidate runs.
  • Assumption: non-task tokens are measured on production-like requests.
  • Assumption: quality is validated with sampled prompts before global rollout.

Example: baseline input 1800 with 850 non-task tokens trimmed to 300 lowers candidate input to 1250 before pricing is applied.

Where Bloat Usually Comes From

  • Repeated system instructions copied into every request.
  • Long stale conversation history with low relevance to the active task.
  • Boilerplate retrieval snippets that are never used by the model output.

Practical Trim Sequence

  1. Measure baseline non-task tokens on real traffic slices.
  2. Remove one overhead block at a time and replay eval prompts.
  3. Ship only changes that lower cost without quality regressions.

Back to calculators: Context Window Cost Calculator, RAG Cost per User