Cost & Scaling

Designing for Rate Limits and Backoff

EFElena FischerFeb 18, 20264 min read

Rate limits are not an obstacle to engineer around grudgingly. They are a contract that keeps a shared service fast and fair for everyone, and a well-designed client treats them as a normal part of operation. The applications that scale smoothly are the ones that expect limits and back off gracefully.

Here is how to design a client that stays healthy under load on Model Database.

Why limits exist

Any high-traffic API caps how fast a single client can send requests so that one noisy caller cannot degrade the service for others. When you exceed your allowance you typically receive an HTTP 429 Too Many Requests response. The correct reaction is never to retry immediately, which only makes congestion worse.

Respect the response, do not fight it

A 429 is information, not failure. It is telling you to slow down. Many APIs include a Retry-After header indicating how long to wait; honor it when present, and otherwise fall back to exponential backoff.

import time, random, openai
client = openai.OpenAI(base_url="https://modeldatabase.com/v1",
                       api_key="mdb_live_...")

def call_with_backoff(**kw):
    for attempt in range(6):
        try:
            return client.chat.completions.create(**kw)
        except openai.RateLimitError as e:
            wait = getattr(e, "retry_after", None) or (2 ** attempt)
            time.sleep(wait + random.random())
    raise RuntimeError("exhausted retries")

Exponential backoff with jitter

Two rules make backoff work. Exponential growth means each retry waits longer: 1s, 2s, 4s, 8s. Jitter means adding a small random amount so that many clients retrying at once do not synchronize into repeating waves. Without jitter, a fleet of workers can fall into lockstep and hammer the API on the same schedule forever.

Throttle proactively, not just reactively

Backoff handles limits after you hit them. Better is to avoid hitting them in the first place by capping your own send rate. A client-side semaphore or token-bucket limiter keeps you comfortably under the ceiling.

import asyncio
sem = asyncio.Semaphore(8)   # max concurrent in-flight

async def guarded(coro_fn):
    async with sem:
        return await coro_fn()

Tune the limit empirically: raise it until you start seeing occasional 429s, then settle just below that point. Most workloads have a sweet spot where throughput is high and rate errors are rare.

Use a queue for bursty traffic

Real traffic is spiky. Rather than letting a burst slam the API, put work on a queue and drain it at a controlled rate. This smooths peaks into a steady stream, keeps you under your limit, and gives you a natural place to apply priorities, so user-facing requests jump ahead of background jobs.

Distinguish your error types

Not every error should be retried the same way. Build a small policy:

Conflating these wastes money and hides real bugs. A 402 in particular is a billing signal, not a transient blip, so treat it differently from a 429.

Keep an eye on the headers

The X-MDB-Charged-USD and X-MDB-Balance-USD headers help here too. A falling balance combined with rising 429s tells a story about load, and watching both lets you scale workers and top up credit before either becomes a user-visible problem.

The payoff

A client built with backoff, jitter, proactive throttling, a queue, and typed error handling barely notices rate limits. It absorbs spikes, recovers from blips, and keeps spend predictable, which is exactly what you want when traffic grows.

Build it, then watch it stay smooth under load on your dashboard. Review limits and rates on the pricing page.

← All articles Get your API key →