FAQ guide

How to calculate LLM inference cost

Quick answer

Start with prompt tokens and completion tokens for a single successful call, multiply each bucket by its published per-million or per-thousand dollar rate, sum the two sides, then multiply by successful requests per day and working days per month. Adjust for retries, caching discounts, and currency. Divide by success rate if partial failures trigger repeat calls so you forecast end-user actions, not just HTTP 200s.

Introduction

Inference cost is the steady-state tax on shipped features. Unlike training, it repeats every time a user clicks a button. That makes it predictable once you measure token distributions honestly. Many teams start with a spreadsheet, but spreadsheets hide assumptions; calculators make them explicit.

This walkthrough targets engineering leads who need a reproducible method finance can audit. Pair the math with the AI Token Cost Calculator so you can iterate on model choice and token caps quickly.

Step one: normalize one representative call

Pick a happy-path example that matches production logging. Strip PII for documentation but keep lengths realistic. Record tokenizer counts for prompt and completion separately because rates differ.

If you have multiple user personas, build three archetypes instead of one mythical average. Weight them by traffic share when you roll up monthly numbers.

Step two: scale to traffic and time

Multiply per-call dollars by expected successes per hour, then by hours per day. Seasonality matters for consumer apps; B2B tools may be weekday heavy. Do not forget cron jobs and background summarizers.

Add growth headroom explicitly instead of hiding it in rounding. Finance prefers transparent buffers they can trim.

Worked style example (illustrative numbers)

Suppose a call uses eight hundred prompt tokens at two dollars per million and three hundred completion tokens at six dollars per million. Prompt side is zero point zero zero one six dollars, completion side is zero point zero zero one eight dollars, totaling roughly zero point zero zero three four dollars per success. At fifty thousand successes per month, that is about one hundred seventy dollars before discounts.

Monthly ≈ per_call_usd × successes_per_month × (1 + retry_overhead_fraction).

Common mistakes

  • Using character counts divided by four instead of tokenizer counts.
  • Forgetting output tokens when the UI only displays prompt length.
  • Modeling retries as free when clients automatically resend full prompts.
  • Mixing currencies without documenting exchange assumptions.
  • Averaging token counts across wildly different endpoints into one useless number.

Tips for credible forecasts

  • Snapshot tokenizer outputs alongside latency metrics in observability.
  • Version your assumptions when model upgrades change defaults.
  • Publish per-feature unit costs internally to align PM and finance.
  • Cross-check calculator outputs against one week of invoices monthly.
  • Separate sandbox spend from production in tagging.

Related questions

Structured for clarity and aligned with on-page FAQ schema for search features.

Continue exploring

Internal links connect calculators, blog guides, and related FAQ articles for stronger topical coverage.

Turn these ideas into concrete dollars

Compare models, simulate monthly traffic, and export shareable estimates in seconds. Numbers follow your config/models.php rates so you can mirror vendor tables before you commit to architecture.