Cost & Scaling

Batching and Concurrency for Throughput

EFElena FischerMar 30, 20264 min read

When you need to process thousands of items through an LLM, the difference between a job that finishes in minutes and one that crawls for hours is how you handle batching and concurrency. Done well, you maximize throughput without tripping rate limits or wasting money on retries.

This article covers practical patterns for pushing volume through Model Database efficiently.

Batching versus concurrency

These two terms get conflated, so let us separate them:

Concurrency is where most of your throughput gains come from, because LLM calls are dominated by waiting on the network and the model, not on your CPU.

Run requests concurrently

Firing requests one at a time leaves your throughput bound by latency. With async concurrency you keep many calls in flight at once.

import asyncio, openai
client = openai.AsyncOpenAI(base_url="https://modeldatabase.com/v1",
                           api_key="mdb_live_...")

sem = asyncio.Semaphore(10)  # cap in-flight requests

async def worker(item):
    async with sem:
        r = await client.chat.completions.create(
            model="openai/gpt-4o-mini",
            max_tokens=200,
            messages=[{"role":"user","content":item}])
        return r.choices[0].message.content

async def run(items):
    return await asyncio.gather(*(worker(i) for i in items))

The semaphore is the important part. It caps how many requests run at once so you stay within rate limits instead of flooding the API and triggering errors.

Pack multiple items into one prompt, carefully

For very short items you can sometimes process several in a single call by asking for a structured list back. This amortizes the system prompt across many items.

{
  "model": "google/gemini-2.0-flash",
  "messages": [{"role":"user","content":
    "Classify each line as spam or ham. Return JSON array.\n1. ...\n2. ...\n3. ..."}]
}

Use this judiciously. Oversized batches risk truncated output, harder error recovery (one bad item can spoil the whole batch), and bumping into the per-request cost cap. For most workloads, modest concurrency of single-item calls is simpler and more robust.

Tune the concurrency level with the headers

There is an ideal number of in-flight requests for your account and workload. Find it empirically: raise the semaphore limit until throughput stops improving or errors appear, then back off. The charge headers let you confirm you are not paying more per item as you scale up.

X-MDB-Charged-USD: 0.0003
X-MDB-Balance-USD: 92.41

Make retries cheap and safe

At volume, occasional failures are normal. Wrap each call in retry logic with exponential backoff and jitter so a transient error does not become a thundering herd.

async def with_retry(coro_fn, tries=4):
    for n in range(tries):
        try:
            return await coro_fn()
        except Exception:
            await asyncio.sleep((2 ** n) + random.random())
    raise

Make your work items idempotent and key your results by input, so a retry never double-charges you for output you already have.

Mind the cost of parallel failure

Concurrency multiplies mistakes. If a bug sends a malformed prompt, ten parallel workers send it ten times. Validate inputs before dispatch, keep max_tokens tight, and rely on the per-request cost cap as a backstop so no single call in the batch can run away.

A simple recipe

For most batch jobs: use single-item calls, a semaphore around 5 to 20 in flight, retries with backoff, a result cache keyed on input, and a tight max_tokens. That combination gets you high throughput, predictable cost, and clean recovery without much code.

Spin up a batch job and watch throughput and spend together on your dashboard. Compare model rates first on the pricing page.

← All articles Get your API key →