The cheapest LLM call is the one you never make. A caching layer in front of your model is often the single highest-return optimization in an LLM application: it cuts cost, slashes latency, and reduces load on the API, all at once. This article shows how to build one that works on Model Database.
Application caching versus prompt caching
These are complementary, not the same thing. Prompt caching happens at the model and discounts a repeated prompt prefix, but you still make the call. Application caching stores the full response in your own store and skips the call entirely on a hit. This article is about the latter, which you control completely.
Exact-match caching
The simplest cache keys on the exact request. Normalize the inputs, hash them, and store the response. On a repeat, return the stored answer instantly for zero cost.
import hashlib, json, openai
client = openai.OpenAI(base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...")
def cache_key(model, messages):
blob = json.dumps({"m": model, "msgs": messages}, sort_keys=True)
return "llm:" + hashlib.sha256(blob.encode()).hexdigest()
def cached_chat(store, model, messages):
key = cache_key(model, messages)
hit = store.get(key)
if hit:
return hit # zero cost, instant
r = client.chat.completions.create(model=model, messages=messages)
out = r.choices[0].message.content
store.set(key, out, ttl=86400)
return out
This alone removes all duplicate traffic. For FAQ bots, repeated document questions, and shared prompts, the hit rate can be substantial.
Semantic caching for near-duplicates
Users rarely phrase the same question identically. "How do I reset my password?" and "I forgot my password, what now?" deserve the same answer. A semantic cache embeds the query, looks for a stored entry within a similarity threshold, and reuses it.
- Embed the incoming query to a vector.
- Search your vector store for the nearest cached query.
- If similarity exceeds a threshold, return the cached response.
- Otherwise call the model and store the new query, vector, and answer.
Set the threshold carefully: too loose and you serve wrong answers, too tight and you lose hits. Tune it on real traffic.
Choose a sensible TTL
Cached answers can go stale. Match the time-to-live to how fast the underlying truth changes:
- Stable knowledge (definitions, policies): long TTL, days or weeks.
- Slowly changing (product docs): hours.
- Volatile or personalized (account-specific, time-sensitive): short TTL or do not cache at all.
Never cache anything that mixes one user's private data into a key another user could hit. Scope personalized entries by user ID.
Measure the win with the headers
Prove the cache is paying off by tracking cost with and without it. Every miss carries an X-MDB-Charged-USD value; every hit costs zero. Your savings are simply the charges you avoided.
if hit:
metrics.incr("llm.cache.hit")
else:
metrics.incr("llm.cache.miss")
metrics.add("llm.cost",
float(resp.headers["X-MDB-Charged-USD"]))
Illustrative math: if 40% of 100,000 daily calls hit the cache, you avoid 40,000 model calls every day, and the dollar value of those skipped X-MDB-Charged-USD charges is your direct saving.
Watch the failure modes
- Stale answers: bound them with TTL and offer a way to bypass the cache.
- Cache stampede: when a popular entry expires, many requests miss at once. Use a short lock or single-flight so only one request refills it.
- Non-determinism: creative outputs vary by design; cache only where a stable answer is acceptable.
- Privacy: key personalized responses by user and never leak across accounts.
Start small
You do not need a vector database on day one. Begin with an exact-match cache on your highest-volume endpoint, measure the hit rate and avoided charges, and add semantic matching only where the data shows it would pay off. A few dozen lines of code often returns the biggest cost win in the whole system.
Add a cache, then watch your avoided charges add up on your dashboard. Compare model rates on the pricing page.