Designing for Rate Limits and Backoff

Rate limits are not an obstacle to engineer around grudgingly. They are a contract that keeps a shared service fast and fair for everyone, and a well-designed client treats them as a normal part of operation. The applications that scale smoothly are the ones that expect limits and back off gracefully.

Here is how to design a client that stays healthy under load on Model Database.

Why limits exist

Any high-traffic API caps how fast a single client can send requests so that one noisy caller cannot degrade the service for others. When you exceed your allowance you typically receive an HTTP 429 Too Many Requests response. The correct reaction is never to retry immediately, which only makes congestion worse.

Respect the response, do not fight it

A 429 is information, not failure. It is telling you to slow down. Many APIs include a Retry-After header indicating how long to wait; honor it when present, and otherwise fall back to exponential backoff.

import time, random, openai
client = openai.OpenAI(base_url="https://modeldatabase.com/v1",
                       api_key="mdb_live_...")

def call_with_backoff(**kw):
    for attempt in range(6):
        try:
            return client.chat.completions.create(**kw)
        except openai.RateLimitError as e:
            wait = getattr(e, "retry_after", None) or (2 ** attempt)
            time.sleep(wait + random.random())
    raise RuntimeError("exhausted retries")

Exponential backoff with jitter

Two rules make backoff work. Exponential growth means each retry waits longer: 1s, 2s, 4s, 8s. Jitter means adding a small random amount so that many clients retrying at once do not synchronize into repeating waves. Without jitter, a fleet of workers can fall into lockstep and hammer the API on the same schedule forever.

Throttle proactively, not just reactively

Backoff handles limits after you hit them. Better is to avoid hitting them in the first place by capping your own send rate. A client-side semaphore or token-bucket limiter keeps you comfortably under the ceiling.

import asyncio
sem = asyncio.Semaphore(8)   # max concurrent in-flight

async def guarded(coro_fn):
    async with sem:
        return await coro_fn()

Tune the limit empirically: raise it until you start seeing occasional 429s, then settle just below that point. Most workloads have a sweet spot where throughput is high and rate errors are rare.

Use a queue for bursty traffic

Real traffic is spiky. Rather than letting a burst slam the API, put work on a queue and drain it at a controlled rate. This smooths peaks into a steady stream, keeps you under your limit, and gives you a natural place to apply priorities, so user-facing requests jump ahead of background jobs.

Enqueue every request instead of calling directly.
Workers pull from the queue at a bounded concurrency.
Failed items go back with a delay rather than blocking the line.

Distinguish your error types

Not every error should be retried the same way. Build a small policy:

429 rate limit: back off and retry, it will likely succeed.
402 zero balance: retrying will not help until you top up credit, so pause and alert.
4xx bad request: the request itself is wrong; fix it, do not retry.
5xx server error: retry with backoff, but cap the attempts.

Conflating these wastes money and hides real bugs. A 402 in particular is a billing signal, not a transient blip, so treat it differently from a 429.

Keep an eye on the headers

The X-MDB-Charged-USD and X-MDB-Balance-USD headers help here too. A falling balance combined with rising 429s tells a story about load, and watching both lets you scale workers and top up credit before either becomes a user-visible problem.

The payoff

A client built with backoff, jitter, proactive throttling, a queue, and typed error handling barely notices rate limits. It absorbs spikes, recovers from blips, and keeps spend predictable, which is exactly what you want when traffic grows.

Build it, then watch it stay smooth under load on your dashboard. Review limits and rates on the pricing page.

Designing for Rate Limits and Backoff

Why limits exist

Respect the response, do not fight it

Exponential backoff with jitter

Throttle proactively, not just reactively

Use a queue for bursty traffic

Distinguish your error types

Keep an eye on the headers

The payoff

More in Cost & Scaling

10 Ways to Cut Your LLM Costs

How Prompt Caching Saves You Money

Token Budgeting for Predictable Bills