Use case · RAG

RAG system cost estimator

RAG systems bill across embeddings, vector search infrastructure, and chat completions. Teams that only model chat underestimate true spend.

Token usage patterns in RAG

Each query might retrieve five chunks of six hundred tokens plus a system prompt. Re-querying without caching repeats that cost.

RAG scenarios

Scenario Prompt tokens Output tokens Model (est.) Cost / request
Support knowledge base 3800 350 GPT-4o mini $0.0008
Legal research assistant 9500 700 GPT-4o $0.0308
Developer docs bot 5200 500 Claude 3.5 Haiku $0.0062

Figures use rates from config/models.php; confirm against your provider before billing decisions.

Monthly estimates

  • Production RAG assistant

    7,000 questions per weekday.

    Per request
    $0.0009
    Monthly (7000 req/day × 22 days)
    $140.91

Infrastructure considerations

Vector index size, replication factor, and embedding dimensions drive hosting bills independent of chat tokens.

Model recommendations

Mini tiers suffice when retrieval quality is high; upgrade models when answers need synthesis across conflicting chunks.

Optimization recommendations

Tune top-k, use rerankers sparingly, compress chunks, and deduplicate sources at index time.

ROI examples

RAG wins when it reduces time-to-answer for specialists—quantify that time and error reduction.

Budget guidance

Separate one-time ingestion from steady-state query costs; ingestion spikes during content migrations.

FAQ: RAG operating costs

Short answers mirror the structured data on this page for search engines and readers.

How expensive are embedding refreshes?
They scale with corpus size and dimensions—schedule incremental updates.
Does reranking add tokens?
Cross-encoder rerankers may use GPUs instead of LLM tokens; account separately.
What chunk size is cheapest?
Smaller chunks reduce per-query tokens but can hurt answer quality—test retrieval metrics.
Can caching help RAG?
Yes, for repeated FAQs—cache final answers with invalidation rules.

Estimate RAG chat token costs

Set prompt tokens near your average retrieved chunk total plus instructions.

Prefilled for this page’s scenario. Pricing loads from config/models.php and /api/pricing.

Calculator

Cost = (prompt ÷ 1000 × Pin) + (completion ÷ 1000 × Pout), then × requests.

Usage presets

Multi-model comparison

Toggle models to compare the same workload. The cheapest option is highlighted.

Monthly cost simulator

Project from average daily requests (uses tokens above).

Uses primary model rates for projections.

Token estimator

Rough heuristic: ~4 characters ≈ 1 token for Latin text (indicative only).

Estimated tokens: 0 · Cost @ primary:

API budget planner

Set a monthly cap to see how many identical requests fit (primary model).

Max requests (approx):

Prompt optimization analyzer

Collapse whitespace and tighten wording to preview savings at the primary model.

Suggested shorter form:


                    

Token delta: 0 · Est. savings / 1k calls:

Fine-tuning cost sketch

Order-of-magnitude helper: training tokens × epochs × rate + storage.

Est. training + 1 mo storage:

Team usage calculator

Multiply per-person daily volume by team size (primary model).

Team monthly (22d):

Cost per feature

Price a single product surface (e.g., one chat turn or one generated article).

Uses prompt & completion tokens from the calculator for one invocation.

Cost per use: · Monthly @ that cadence:

Share & export

Serialize inputs in the URL hash or copy a text summary.

Calculation history

Stored in your browser only (LocalStorage).