Getting an LLM feature to work is one thing. Getting it to work at millions of requests a day, reliably and affordably, is a different engineering problem. At that scale, small inefficiencies multiply into large bills and small failure rates turn into constant fires.
This article walks through the architecture and habits that let you scale on Model Database without losing control of cost or reliability.
At scale, per-request waste is the enemy
A penny of overhead per call is nothing in a demo and a real budget at a million calls. Before scaling out, scale down the unit cost: trim prompts, cap max_tokens, route easy traffic to cheaper models like openai/gpt-4o-mini or google/gemini-2.0-flash, and cache repeated work. Every optimization you make to a single request is multiplied by your entire volume.
Separate real-time from background work
Not all requests need to be instant. Split your traffic into two lanes:
- Interactive: user is waiting. Prioritize latency, stream responses, keep prompts lean.
- Asynchronous: jobs like enrichment, summarization, and indexing. Push these through a queue at a controlled rate so they fill the gaps rather than competing with user traffic.
This separation lets you size each lane independently and prevents a batch job from starving live users.
Build on a queue and a worker pool
A durable queue in front of a pool of workers is the backbone of scale. It gives you buffering for spikes, controlled concurrency, natural retry points, and a place to enforce priority.
import asyncio, openai
client = openai.AsyncOpenAI(base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...")
sem = asyncio.Semaphore(50)
async def worker(queue):
while True:
job = await queue.get()
async with sem:
try:
r = await client.chat.completions.create(**job.payload)
await job.complete(r)
except Exception:
await job.requeue_with_backoff()
queue.task_done()
Scale throughput by adjusting the semaphore and the number of workers, not by rewriting logic.
Cache aggressively
At scale, duplicate and near-duplicate requests are guaranteed. A shared cache, keyed on the normalized prompt and model, removes a real fraction of calls entirely. Even a modest hit rate on millions of requests is a large saving, and cache reads are far faster than model calls, so latency improves too.
Instrument every call with the headers
You cannot manage spend at scale from a monthly invoice. Log X-MDB-Charged-USD on every response, tagged by feature and model, and aggregate it continuously.
resp = client.chat.completions.with_raw_response.create(...)
emit_metric("llm.cost", float(resp.headers["X-MDB-Charged-USD"]),
tags={"feature": feature, "model": model})
Watch X-MDB-Balance-USD as a fleet-wide fuel gauge. Since a depleted balance returns HTTP 402, automate top-ups or alerts well before you reach zero so prepaid credit never becomes an outage.
Plan for failure as routine
At a million requests, a 0.1% failure rate is a thousand failures a day. That is normal, so make recovery automatic: retries with exponential backoff and jitter, idempotent jobs keyed by input so retries never double-charge, dead-letter handling for poison inputs, and graceful degradation when a model or balance is unavailable.
Let the cost cap protect the fleet
One malformed prompt replicated across thousands of workers can do real damage. The per-request cost cap on Model Database stops any single call from running away, which at scale is a critical safety rail rather than a nicety. Combine it with input validation so anomalies are blocked early and cheaply.
Scale in steps and measure
Do not jump from a thousand to a million requests overnight. Increase load in stages, and at each step check three numbers: cost per request from the headers, error rate, and latency. If cost per request holds steady as volume grows, your architecture is sound. If it creeps up, find the inefficiency before you scale further.
Ready to grow? Top up credit and monitor your fleet on your dashboard, and compare model rates on the pricing page.