Do retries double output charges?

If a client abandons a partial stream and issues a fresh request, you generally pay for each attempt according to what the server processed. Implement idempotency and backoff carefully.

Are reasoning traces billed as output?

Products that expose extended thinking may bill those segments as additional tokens. Read model-specific documentation rather than assuming parity with standard chat models.

Why is my input huge with RAG?

Retrieval augmented generation injects many document chunks into the prompt, inflating input tokens. Tune chunk sizes and top-k selections explicitly for cost and quality.

Can I send empty output to save money?

You cannot avoid output charges by requesting zero tokens in a chat completion; the model must emit something unless the API rejects the call. Classification endpoints may return compact labels.

Do tool results count as input next turn?

Yes, tool outputs become part of later context and are tokenized as input on subsequent steps in an agent loop.

How do batch files treat splits?

Batch APIs still classify tokens; they mainly change throughput and sometimes price. Inspect examples in official batch guides for your provider.

FAQ guide

What is input versus output token?

Quick answer

Input tokens are everything you send in the prompt context the model reads before answering. Output tokens are the new text the model generates, including visible content and sometimes hidden formatting tokens stop sequences interact with. Providers meter them separately because generation demands autoregressive steps that accumulate compute. Price lists name both columns because their marginal costs differ in production.

Think of input tokens as the question packet and output tokens as the streamed reply. Both pass through tokenizer stages, but the API logs them under distinct counters so finance can see whether spend is prompt bloat or verbose answers. Monitoring both helps squads fix the right bottleneck.

Tool calling and JSON modes still produce output tokens even when the user only sees a short bubble. Those structured completions may be shorter than prose yet remain the billed completion side.

Mechanical differences

During input processing the model attends across the provided context in parallel within engineering limits. During output generation each new token typically depends on prior tokens already chosen, which serializes work at inference.

Stop sequences halt generation early, trimming completion tokens and saving money when used thoughtfully.

Pricing implications

When output price exceeds input price, rewriting prompts to elicit shorter grounded answers can beat paying for sprawling completions. When input price dominates, audit attachments, logs, and few-shot examples baked into every call.

Some platforms offer discounts on cached input segments that repeat, which targets prompt-side spend rather than fresh completions.

Embeddings and classification

Endpoints that return vectors instead of free text still count input tokens generally and may lack a separate completion meter.

Illustrative split

A five-hundred-token system plus user message and a two-hundred-token assistant reply incur five hundred input tokens and two hundred output tokens for billing classification, before any caching adjustments.

Common confusions

Labeling prior assistant turns as output of the current call; history is usually input on the next turn.
Thinking hidden chain-of-thought is billable on providers that forbid it; policy and product vary widely.
Assuming whitespace trimming on the client removes billing tokens if the server reconstructs chat templates differently.
Treating discounted cached input tokens as if they were free instead of reading the effective rate on your invoice.
Comparing bills across providers without normalizing how each labels system, developer, and tool-bearing segments.

Operational tips

Plot input versus output over time to see whether compression should target prompts or answers.
Teach product teams that every extra paragraph in the UI spec may balloon completion tokens.
Use maximum token caps in risky endpoints to prevent runaway generation bills.
Compare rates whenever you adopt a new model; spreads between input and output can flip.
Store usage objects from each response in your observability pipeline for audits.

Continue exploring

Internal links connect calculators, blog guides, and related FAQ articles for stronger topical coverage.

Core tools

Blog & related FAQs

Turn these ideas into concrete dollars

Compare models, simulate monthly traffic, and export shareable estimates in seconds. Numbers follow your config/models.php rates so you can mirror vendor tables before you commit to architecture.

Open calculator OpenAI view Claude view