Use case · RAG
RAG system cost estimator
RAG systems bill across embeddings, vector search infrastructure, and chat completions. Teams that only model chat underestimate true spend.
Token usage patterns in RAG
Each query might retrieve five chunks of six hundred tokens plus a system prompt. Re-querying without caching repeats that cost.
RAG scenarios
| Scenario | Prompt tokens | Output tokens | Model (est.) | Cost / request |
|---|---|---|---|---|
| Support knowledge base | 3800 | 350 | GPT-4o mini | $0.0008 |
| Legal research assistant | 9500 | 700 | GPT-4o | $0.0308 |
| Developer docs bot | 5200 | 500 | Claude 3.5 Haiku | $0.0062 |
Figures use rates from config/models.php; confirm against your provider before billing decisions.
Monthly estimates
-
Production RAG assistant
7,000 questions per weekday.
- Per request
- $0.0009
- Monthly (7000 req/day × 22 days)
- $140.91
Infrastructure considerations
Vector index size, replication factor, and embedding dimensions drive hosting bills independent of chat tokens.
Model recommendations
Mini tiers suffice when retrieval quality is high; upgrade models when answers need synthesis across conflicting chunks.
Optimization recommendations
Tune top-k, use rerankers sparingly, compress chunks, and deduplicate sources at index time.
ROI examples
RAG wins when it reduces time-to-answer for specialists—quantify that time and error reduction.
Budget guidance
Separate one-time ingestion from steady-state query costs; ingestion spikes during content migrations.
FAQ: RAG operating costs
Short answers mirror the structured data on this page for search engines and readers.
- How expensive are embedding refreshes?
- They scale with corpus size and dimensions—schedule incremental updates.
- Does reranking add tokens?
- Cross-encoder rerankers may use GPUs instead of LLM tokens; account separately.
- What chunk size is cheapest?
- Smaller chunks reduce per-query tokens but can hurt answer quality—test retrieval metrics.
- Can caching help RAG?
- Yes, for repeated FAQs—cache final answers with invalidation rules.