Should I include tax?

Depends on geography and procurement rules. Label pre-tax and post-tax views separately to avoid debates.

How granular should routes be?

Finer granularity exposes outliers but costs engineering time. Start per service, then subdivide top spenders.

What about free retries?

They are rarely truly free to the business. Model worst-case retry policies explicitly.

Can I use log averages?

Use percentiles too. Means hide fat tails that dominate invoices.

How do I fold embeddings in?

Treat embedding jobs as separate line items with their own token histograms.

What documentation should I keep?

Keep the versioned prompt template, tokenizer counts, rate card date, and exchange rate snapshot for each major forecast.

FAQ guide

How to calculate LLM inference cost

Quick answer

Start with prompt tokens and completion tokens for a single successful call, multiply each bucket by its published per-million or per-thousand dollar rate, sum the two sides, then multiply by successful requests per day and working days per month. Adjust for retries, caching discounts, and currency. Divide by success rate if partial failures trigger repeat calls so you forecast end-user actions, not just HTTP 200s.

Inference cost is the steady-state tax on shipped features. Unlike training, it repeats every time a user clicks a button. That makes it predictable once you measure token distributions honestly. Many teams start with a spreadsheet, but spreadsheets hide assumptions; calculators make them explicit.

This walkthrough targets engineering leads who need a reproducible method finance can audit. Pair the math with the AI Token Cost Calculator so you can iterate on model choice and token caps quickly.

Step one: normalize one representative call

Pick a happy-path example that matches production logging. Strip PII for documentation but keep lengths realistic. Record tokenizer counts for prompt and completion separately because rates differ.

If you have multiple user personas, build three archetypes instead of one mythical average. Weight them by traffic share when you roll up monthly numbers.

Step two: scale to traffic and time

Multiply per-call dollars by expected successes per hour, then by hours per day. Seasonality matters for consumer apps; B2B tools may be weekday heavy. Do not forget cron jobs and background summarizers.

Add growth headroom explicitly instead of hiding it in rounding. Finance prefers transparent buffers they can trim.

Worked style example (illustrative numbers)

Suppose a call uses eight hundred prompt tokens at two dollars per million and three hundred completion tokens at six dollars per million. Prompt side is zero point zero zero one six dollars, completion side is zero point zero zero one eight dollars, totaling roughly zero point zero zero three four dollars per success. At fifty thousand successes per month, that is about one hundred seventy dollars before discounts.

Monthly ≈ per_call_usd × successes_per_month × (1 + retry_overhead_fraction).

Common mistakes

Using character counts divided by four instead of tokenizer counts.
Forgetting output tokens when the UI only displays prompt length.
Modeling retries as free when clients automatically resend full prompts.
Mixing currencies without documenting exchange assumptions.
Averaging token counts across wildly different endpoints into one useless number.

Tips for credible forecasts

Snapshot tokenizer outputs alongside latency metrics in observability.
Version your assumptions when model upgrades change defaults.
Publish per-feature unit costs internally to align PM and finance.
Cross-check calculator outputs against one week of invoices monthly.
Separate sandbox spend from production in tagging.

Continue exploring

Internal links connect calculators, blog guides, and related FAQ articles for stronger topical coverage.

Core tools

Blog & related FAQs

Turn these ideas into concrete dollars

Compare models, simulate monthly traffic, and export shareable estimates in seconds. Numbers follow your config/models.php rates so you can mirror vendor tables before you commit to architecture.

Open calculator OpenAI view Claude view