FAQ guide
What is the difference between prompt and completion tokens?
Quick answer
Prompt tokens measure all text the model processes as context for a request, including system and developer instructions and any history you attach. Completion tokens measure newly generated text emitted after that context up to the stop condition. APIs bill them on separate counters with potentially different rates. The names align with ChatML-style roles but map cleanly to input and output in usage objects.
Introduction
Older documentation used prompt and completion language while newer dashboards may say input and output. Conceptually they refer to the same split unless a provider documents an exception for niche endpoints. Consistency in your internal wiki reduces onboarding friction.
Function calling introduces structured snippets that belong to the prompt side before tools return results, which become additional prompt material in subsequent steps.
Lifecycle of a chat call
You assemble messages, convert them to tokens, and submit them. The model samples completion tokens until hitting a limit or stop sequence. Logged totals should reconcile with tokenizer rehearsals plus any server-side additions.
When you store transcripts, label whether each segment originated from the user, tool, or assistant to debug future token growth.
Billing nuances
If output pricing exceeds input pricing, compressing verbose answers yields outsized savings. If prompt pricing dominates due to retrieval, invest in evidence selection quality.
Partial completions still bill for emitted tokens even if the client disconnects mid-stream unless policies state otherwise.
Minimal illustration
With a two-thousand-token prompt and a three-hundred-token reply, usage charts should display two thousand prompt tokens and three hundred completion tokens for that interaction.
Common mix-ups
- Counting assistant messages in history as completions for the current billable call.
- Forgetting tool outputs feed the next prompt, not the prior completion bucket.
- Mislabeling embeddings calls which may only expose an input-style metric.
- Bookmarking dashboards that combine generations across unrelated keys and then misattributing completion spend.
- Ignoring that partial streams still bill completion tokens for text emitted before cancellation fired.
Operational tips
- Mirror official usage field names in your telemetry to ease support tickets.
- Build dashboards that compare prompt versus completion across environments.
- Educate PMs with concrete examples from your own logs, not generic diagrams.
- Automate anomaly detection on sudden prompt inflation.
- Version templates to know which copy correlated with token spikes.
Related questions
Structured for clarity and aligned with on-page FAQ schema for search features.
Continue exploring
Internal links connect calculators, blog guides, and related FAQ articles for stronger topical coverage.
Core tools
Turn these ideas into concrete dollars
Compare models, simulate monthly traffic, and export shareable estimates in seconds. Numbers follow your config/models.php rates so you can mirror vendor tables before you commit to architecture.