RAG or Long Prompt

Which is cheaper for this workflow: retrieval or a larger context window?

How this tool works

This calculator runs the deterministic economics model twice: once for a baseline RAG workflow and once for a candidate long-context design that removes retrieval, reranking, vector lookups, and embedding refresh so you can see when larger windows actually beat the retrieval stack.

How It Works

  1. Set the current RAG baseline model and the candidate long-context model.
  2. Keep shared workload assumptions explicit, then set long-context prompt size.
  3. Compare cost, margin, break-even price, and retrieval-stack deltas before changing architecture.

Formula

cost_delta = cost_long_context - cost_rag

request_input_tokens_delta = long_context_input_tokens - (base_prompt_tokens + retrieved_chunks * tokens_per_chunk)

Assumptions and Units

  • Currency: USD
  • Token unit: token
  • Long-context candidate removes retrieval, reranking, vector-query, and embedding-refresh terms
  • Pricing source: daily pricing snapshot in repo, no runtime scraping

Example Scenario

If a deep-research workflow currently uses wide retrieval but a long-context prompt could replace the stack, compare whether the extra prompt tokens are cheaper than retrieval, reranking, and monthly refresh overhead.

Related resources: Retrieval Cost, Rerank Cost, Cache Savings, Prompt Overhead, Indexing Cost, Deep Research Cost per Seat, RAG vs Long-Context: Cost Tradeoffs, Context Bloat in RAG, How Many Tokens Per Request?, What Is RAG?.

Pricing snapshot: 2026-04-29Baseline: OpenAI / GPT-5 MiniCandidate: OpenAI / GPT-5.2

Step 1 Baseline and Candidate Models

Set the current RAG model as baseline and the long-context alternative as candidate.

Step 2 Quick Mode

Set shared workload assumptions, then size the long-context prompt.

Compare wide docs retrieval against a larger single prompt before changing search architecture.

Optional Advanced assumptions

Tune the baseline retrieval stack and candidate cache/infra only after Quick Mode is close.
Show advanced inputs

Scenario actions

Copy scenario URL

Paste into ChatGPT or Claude, or share with a teammate.

Save and track this scenario

Track pricing drift on this scenario and get an email if the latest result changes.

How tracking works

After you click Save and track, we carry this exact calculator state into the tracked-scenarios page so you can sign in and confirm the save.

We save your assumptions and the pricing snapshot used for this result.

When a newer pricing snapshot lands, we recompute the same scenario, show what changed, and email you if the latest result moved.

1 tracked scenario free, then $12/mo or $120/yr for up to 25 tracked scenarios.

Decision Signal

Long-context path lowers modeled cost

Switching to GPT-5.2 and removing the retrieval stack changes cost per user by -$0.2397 and margin by +0.8%.

RAG request input tokens

2,220

Long-context prompt tokens

2,600

Retrieval-stack delta

-$0.9807

Monthly cost impact

-$191.73

Top Cost Drivers

Baseline sensitivity: cost change when one variable is increased by 10%.
Requests Per User Month10.0%
Rerank Docs9.2%
Cache Hit Rate-6.1%

Totals

Baseline vs candidate totals under the same task assumptions.

Monthly gross profit impact uses 800 active users.

Cost per request
RAG baseline
$0.0108
Long context
$0.00681
Delta
-$0.00399
Cost per user / month
RAG baseline
$0.648
Long context
$0.4084
Delta
-$0.2397
Gross margin %
RAG baseline
97.8%
Long context
98.6%
Delta
+0.8%
Break-even price
RAG baseline
$0.648
Long context
$0.4084
Delta
-$0.2397
Monthly gross profit impact
Delta
+$191.73
MetricRAG baselineLong contextDelta
Cost per request$0.0108$0.00681-$0.00399
Cost per user / month$0.648$0.4084-$0.2397
Gross margin %97.8%98.6%+0.8%
Break-even price$0.648$0.4084-$0.2397
Monthly gross profit impact+$191.73

Component Breakdown

Each component is computed independently, then compared across both architectures.
GenerationModel input/output token spend for requests.
RAG baseline
$0.0435
Long context
$0.483
Delta
+$0.4395
RetrievalExtra model input spend from retrieved context chunks.
RAG baseline
$0.0198
Long context
$0
Delta
-$0.0198
RerankingReranker cost based on docs scored per request.
RAG baseline
$0.96
Long context
$0
Delta
-$0.96
Embeddings IngestionAmortized per-user share of the fixed monthly corpus embedding refresh cost.
RAG baseline
$0
Long context
$0
Delta
-$0
Vector DbVector database query cost across all requests.
RAG baseline
$0.0009
Long context
$0
Delta
-$0.0009
CacheSavings from cache hits. Negative means lower total cost.
RAG baseline
$-0.3972
Long context
$-0.0896
Delta
+$0.3075
InfraNon-model infra overhead per request.
RAG baseline
$0.021
Long context
$0.015
Delta
-$0.006
ComponentRAG baselineLong contextDelta
GenerationModel input/output token spend for requests.$0.0435$0.483+$0.4395
RetrievalExtra model input spend from retrieved context chunks.$0.0198$0-$0.0198
RerankingReranker cost based on docs scored per request.$0.96$0-$0.96
Embeddings IngestionAmortized per-user share of the fixed monthly corpus embedding refresh cost.$0$0-$0
Vector DbVector database query cost across all requests.$0.0009$0-$0.0009
CacheSavings from cache hits. Negative means lower total cost.$-0.3972$-0.0896+$0.3075
InfraNon-model infra overhead per request.$0.021$0.015-$0.006
Sensitivity RankingBaseline sensitivity: cost change when one variable is increased by 10%.
VariableDelta cost %
Requests Per User MonthUser activity level per month.10.0%
Rerank DocsDocs reranked per request.9.2%
Cache Hit RateFraction of requests served by cache.-6.1%
Output TokensGenerated tokens per request.0.3%
Retrieved ChunksRetrieved chunk count per request.0.2%
Tokens Per ChunkAverage chunk size in tokens.0.2%
Input TokensPrompt-side tokens per request.0.1%
Vector Queries Per RequestVector query count per request.0.0%
Monthly Active UsersActive-user estimate used to amortize fixed monthly embedding refresh.-0.0%

Assumptions and Units

Explicit assumptions to keep architecture comparisons reproducible and auditable.
  • CurrencyUSD
  • Token unittoken
  • Pricing snapshot2026-04-29
  • Baseline modelOpenAI / GPT-5 Mini
  • Candidate modelOpenAI / GPT-5.2
  • Architecture ruleLong-context candidate removes retrieval, reranking, vector-query, and embedding-refresh terms
  • Volume basisBusiness totals and fixed monthly terms use monthly active users as the denominator

Recommended Next Step

Use this section to turn the architecture comparison into the next implementation checks.

If the long-context path still looks viable after quality and latency checks, review infra constraints next.

Sources and Snapshot

Pricing comes from the current dated snapshot.

Active Pricing Rows

RAG baseline

OpenAI / GPT-5 Mini

  • Input tokens$0.25 / 1M
  • Output tokens$2 / 1M

Long context

OpenAI / GPT-5.2

  • Input tokens$1.75 / 1M
  • Output tokens$14 / 1M

Shared retrieval defaults

  • Embedding input$0.02 / 1M
  • Rerank docs$1 / 1K