Cost & Scaling

Adding a Caching Layer to Your LLM App

EFElena FischerDec 20, 20254 min read

The cheapest LLM call is the one you never make. A caching layer in front of your model is often the single highest-return optimization in an LLM application: it cuts cost, slashes latency, and reduces load on the API, all at once. This article shows how to build one that works on Model Database.

Application caching versus prompt caching

These are complementary, not the same thing. Prompt caching happens at the model and discounts a repeated prompt prefix, but you still make the call. Application caching stores the full response in your own store and skips the call entirely on a hit. This article is about the latter, which you control completely.

Exact-match caching

The simplest cache keys on the exact request. Normalize the inputs, hash them, and store the response. On a repeat, return the stored answer instantly for zero cost.

import hashlib, json, openai
client = openai.OpenAI(base_url="https://modeldatabase.com/v1",
                       api_key="mdb_live_...")

def cache_key(model, messages):
    blob = json.dumps({"m": model, "msgs": messages}, sort_keys=True)
    return "llm:" + hashlib.sha256(blob.encode()).hexdigest()

def cached_chat(store, model, messages):
    key = cache_key(model, messages)
    hit = store.get(key)
    if hit:
        return hit                     # zero cost, instant
    r = client.chat.completions.create(model=model, messages=messages)
    out = r.choices[0].message.content
    store.set(key, out, ttl=86400)
    return out

This alone removes all duplicate traffic. For FAQ bots, repeated document questions, and shared prompts, the hit rate can be substantial.

Semantic caching for near-duplicates

Users rarely phrase the same question identically. "How do I reset my password?" and "I forgot my password, what now?" deserve the same answer. A semantic cache embeds the query, looks for a stored entry within a similarity threshold, and reuses it.

Set the threshold carefully: too loose and you serve wrong answers, too tight and you lose hits. Tune it on real traffic.

Choose a sensible TTL

Cached answers can go stale. Match the time-to-live to how fast the underlying truth changes:

Never cache anything that mixes one user's private data into a key another user could hit. Scope personalized entries by user ID.

Measure the win with the headers

Prove the cache is paying off by tracking cost with and without it. Every miss carries an X-MDB-Charged-USD value; every hit costs zero. Your savings are simply the charges you avoided.

if hit:
    metrics.incr("llm.cache.hit")
else:
    metrics.incr("llm.cache.miss")
    metrics.add("llm.cost",
                float(resp.headers["X-MDB-Charged-USD"]))

Illustrative math: if 40% of 100,000 daily calls hit the cache, you avoid 40,000 model calls every day, and the dollar value of those skipped X-MDB-Charged-USD charges is your direct saving.

Watch the failure modes

Start small

You do not need a vector database on day one. Begin with an exact-match cache on your highest-volume endpoint, measure the hit rate and avoided charges, and add semantic matching only where the data shows it would pay off. A few dozen lines of code often returns the biggest cost win in the whole system.

Add a cache, then watch your avoided charges add up on your dashboard. Compare model rates on the pricing page.

← All articles Get your API key →