Context Bloat in RAG

Context bloat is non-task prompt overhead that raises token spend without improving user outcomes in RAG systems and agent workflows.

Question

How do I trim context bloat in an agent or RAG workflow without hurting answer quality?

Quick answer

Formula: candidate_input_tokens = task_tokens + target_non_task_tokens

Formula: cost_delta = cost_candidate - cost_baseline

  • Assumption: task tokens stay constant between baseline and candidate runs.
  • Assumption: non-task tokens are measured on production-like requests.
  • Assumption: quality is validated with sampled prompts before global rollout.

Example: baseline input 1800 with 850 non-task tokens trimmed to 300 lowers candidate input to 1250 before pricing is applied.

Where Bloat Usually Comes From

  • Repeated system instructions copied into every request.
  • Long stale conversation history with low relevance to the active task.
  • Boilerplate retrieval snippets that are never used by the model output.

Practical Trim Sequence

  1. Measure baseline non-task tokens on real traffic slices.
  2. Remove one overhead block at a time and replay eval prompts.
  3. Ship only changes that lower cost without quality regressions.

Back to calculators: Prompt Overhead, AI Workflow Cost