Why your LLM bill keeps surprising you, where the expensive tokens are actually hiding, and the trade-offs worth thinking about before you ship a feature that hits the API on every page view.

Most LLM cost surprises come from one mistake: assuming a token is roughly a word. It isn't. A 500-word English email is around 625 tokens. A 500-character base64-encoded image fragment is also around 625 tokens. A 500-word JSON blob with whitespace and repeated field names? Closer to 1,200. Same wall-clock typing time. Very different bills at the end of the month.

If your feature hits an LLM on every page view and you haven't actually counted what each call costs in tokens, you don't have a cost forecast — you have a budget that explodes when traffic shows up. Here are the five places where token counts surprise developers, what the numbers look like at scale, and what to do about each before it bites.

1. Tokens aren't words. They're not characters either.

Tokenizers split text into sub-word units, and the rules depend on the model. The string "GPT-4" is 4 tokens in OpenAI's tokenizer ("G", "PT", "-", "4") and 3 in Anthropic's. The string "supercalifragilisticexpialidocious" is 9 tokens in cl100k (OpenAI) and 11 in Claude's tokenizer. Same characters, different costs.

What this means in practice:

Run your actual prompts through the actual tokenizer for the actual model you're using before you ship. The token counter handles the major models. Anthropic also publishes their tokenizer behaviour in their model documentation.

2. System prompts are paid for on every single request

This is the one that catches everyone. You write a beautiful 2,000-token system prompt with examples, edge cases, and tone guidance. You ship the feature. It works. Then someone in finance asks why the API bill is €4,200 this month.

Math: 2,000 tokens × 50,000 requests/day × 30 days = 3 billion input tokens/month. At Claude Sonnet 4.6 pricing of $3/million input tokens, that's $9,000/month — just for the system prompt. The user's actual question is the cheap part.

Three things to do, in order of effort:

3. JSON in / JSON out has hidden whitespace cost

If you're round-tripping JSON through the API — sending structured input, asking for structured output — whitespace and indentation get billed at full token rates.

{
  "user_id": 12345,
  "preferences": {
    "language": "en",
    "timezone": "UTC"
  }
}     ← pretty-printed, ~30 tokens

{"user_id":12345,"preferences":{"language":"en","timezone":"UTC"}}
       ← minified, ~22 tokens (-27%)

27% might not sound much, but compound it across a million calls. Worse: field names repeat. If you have an array of 500 objects each with the same eight keys, those eight key strings get tokenized 500 times. Abbreviated field names cost less. "timezone" is 2 tokens; "tz" is 1.

If you're using tool-use mode (function calling), the schema you provide also counts. Trim descriptions to what's necessary for the model to use the tool correctly, not what would help a human read it. For converting between schema formats, the JSON Schema to Pydantic converter handles a common pattern.

4. Vision tokens are not free

Multimodal calls add a per-image cost most teams forget to model. Anthropic's vision pricing charges roughly 1,568 tokens for a 1092×1092 image — same as ~1,500 tokens of prose. OpenAI's scales by detail level. Both are billed at input rates.

If you've built a "user uploads photo of receipt, we extract line items" feature:

10,000 receipts/day × 3,000 tokens × Claude Sonnet rates = ~$900/day in API costs alone. Compare that to your per-user revenue from the feature. The economics often demand caching, batching, or fallback to a cheaper OCR pipeline before LLM only on hard cases.

5. Conversations bloat fast

In a chat-style application, every turn includes the entire prior conversation as input. Turn 10 sends the previous 9 turns plus the new question. Cost grows roughly quadratically with conversation length.

Mitigations:

If you're A/B testing different prompts to see which one your costs prefer, prompt diff shows what actually changed and where.

The discipline

Token counting isn't an optimization concern that you address after you've shipped. It's a planning concern that goes in before you write the first prompt. Three habits that pay back the time spent learning them:

  1. Always know your per-request token count before shipping. Run a real example through a tokenizer, write the number in your design doc, multiply by expected traffic.
  2. Log token counts per request alongside latency and error rate. Anthropic and OpenAI both return usage fields. Push them into your metrics like any other operational signal.
  3. Set a hard monthly budget in your provider's dashboard. The "I'll watch it" approach has put a lot of indie devs into surprise five-figure bills.

The tooling we ship at Toolhub is browser-only, no LLM in the loop — but if your team is shipping LLM-backed features, the token counter is the first stop. Get the cost picture before the bill picture.

← All articles