FAQ guide
What is input versus output token?
Quick answer
Input tokens are everything you send in the prompt context the model reads before answering. Output tokens are the new text the model generates, including visible content and sometimes hidden formatting tokens stop sequences interact with. Providers meter them separately because generation demands autoregressive steps that accumulate compute. Price lists name both columns because their marginal costs differ in production.
Introduction
Think of input tokens as the question packet and output tokens as the streamed reply. Both pass through tokenizer stages, but the API logs them under distinct counters so finance can see whether spend is prompt bloat or verbose answers. Monitoring both helps squads fix the right bottleneck.
Tool calling and JSON modes still produce output tokens even when the user only sees a short bubble. Those structured completions may be shorter than prose yet remain the billed completion side.
Mechanical differences
During input processing the model attends across the provided context in parallel within engineering limits. During output generation each new token typically depends on prior tokens already chosen, which serializes work at inference.
Stop sequences halt generation early, trimming completion tokens and saving money when used thoughtfully.
Pricing implications
When output price exceeds input price, rewriting prompts to elicit shorter grounded answers can beat paying for sprawling completions. When input price dominates, audit attachments, logs, and few-shot examples baked into every call.
Some platforms offer discounts on cached input segments that repeat, which targets prompt-side spend rather than fresh completions.
Embeddings and classification
Endpoints that return vectors instead of free text still count input tokens generally and may lack a separate completion meter.
Illustrative split
A five-hundred-token system plus user message and a two-hundred-token assistant reply incur five hundred input tokens and two hundred output tokens for billing classification, before any caching adjustments.
Common confusions
- Labeling prior assistant turns as output of the current call; history is usually input on the next turn.
- Thinking hidden chain-of-thought is billable on providers that forbid it; policy and product vary widely.
- Assuming whitespace trimming on the client removes billing tokens if the server reconstructs chat templates differently.
- Treating discounted cached input tokens as if they were free instead of reading the effective rate on your invoice.
- Comparing bills across providers without normalizing how each labels system, developer, and tool-bearing segments.
Operational tips
- Plot input versus output over time to see whether compression should target prompts or answers.
- Teach product teams that every extra paragraph in the UI spec may balloon completion tokens.
- Use maximum token caps in risky endpoints to prevent runaway generation bills.
- Compare rates whenever you adopt a new model; spreads between input and output can flip.
- Store usage objects from each response in your observability pipeline for audits.
Related questions
Structured for clarity and aligned with on-page FAQ schema for search features.
Continue exploring
Internal links connect calculators, blog guides, and related FAQ articles for stronger topical coverage.
Core tools
Blog & related FAQs
Turn these ideas into concrete dollars
Compare models, simulate monthly traffic, and export shareable estimates in seconds. Numbers follow your config/models.php rates so you can mirror vendor tables before you commit to architecture.