Cost & Scaling

How Prompt Caching Saves You Money

EFElena FischerMay 29, 20264 min read

If your application sends the same long preamble over and over, you are paying to process those identical tokens again on every single call. Prompt caching is the technique that fixes this, and understanding it is one of the highest-leverage cost optimizations available to an LLM developer.

This article explains what caching does, where the savings come from, and how to structure prompts on Model Database to take advantage of it.

The repeated-context problem

Most production prompts have two parts: a large stable section and a small variable section. The stable part might be a system prompt, a style guide, few-shot examples, or a chunk of reference documentation. The variable part is the user's actual question.

Imagine a support assistant whose system prompt is 3,000 tokens of policies and examples. If the user message is only 50 tokens, then 98% of your input on every request is identical to the last one. Without caching, you process all 3,050 tokens every time.

What caching actually saves

Caching lets the provider reuse the already-processed representation of that stable prefix instead of recomputing it. The practical effect: repeated input tokens are billed at a reduced rate compared with fresh input tokens. You still pay full price for the variable suffix and for output tokens, but the big static block becomes much cheaper after the first call.

Illustrative math: suppose you make 50,000 calls a day that each share a 3,000-token prefix. That is 150 million repeated input tokens daily. Even a partial discount on that volume is a meaningful line item, and it costs you nothing but a small change in how you order your messages.

Structure prompts for cache hits

Caching works on prefixes, so the golden rule is: put the stable content first and the variable content last. Anything that changes between requests should live at the end of the prompt.

{
  "model": "anthropic/claude-sonnet-4-6",
  "messages": [
    {"role": "system", "content": "<large stable policy + examples>"},
    {"role": "user", "content": "<short variable question>"}
  ]
}

If you interleave a timestamp, a request ID, or a random token early in the prompt, you break the prefix and lose the hit. Keep volatile values out of the cached region.

Confirm the savings with the charge headers

You do not have to take caching on faith. Model Database returns the exact cost of each call, so you can send the same request twice and watch the second one come in cheaper.

curl -sD - https://modeldatabase.com/v1/chat/completions \
  -H "Authorization: Bearer mdb_live_..." \
  -H "Content-Type: application/json" \
  -d @request.json | grep -i x-mdb-charged
# first call:  X-MDB-Charged-USD: 0.0041
# second call: X-MDB-Charged-USD: 0.0017

Log X-MDB-Charged-USD across a sample of traffic and you can measure your real cache hit rate as a dollar figure, not a guess.

Design patterns that maximize hits

When caching does not help

Caching is worthless if every request is unique. A one-shot creative writing tool with no shared system prompt has nothing to reuse. The technique shines specifically when a large block of context repeats, so spend your effort on the endpoints where that is true and measure the rest with the charge headers.

Want to see caching pay off in real dollars? Send a couple of repeated requests and compare the headers, then track the trend on your dashboard. Model and rate details are on the pricing page.

← All articles Get your API key →