Cost & Scaling

10 Ways to Cut Your LLM Costs

EFElena FischerJun 18, 20264 min read

LLM bills have a way of creeping up quietly. A prototype that cost a few cents a day becomes a production feature serving thousands of requests, and suddenly finance is asking questions. The good news: most LLM spend is controllable once you know where it goes.

Here are ten practical levers you can pull today on Model Database, the single OpenAI-compatible API that proxies hundreds of models with prepaid credit.

1. Pick the cheapest model that passes your eval

The biggest cost lever is model choice. A frontier model like anthropic/claude-opus-4-8 is overkill for classification or short rewrites. Start with something small such as openai/gpt-4o-mini or google/gemini-2.0-flash, run your evaluation set, and only step up if quality demands it.

2. Trim your prompts

Every token in the prompt is billed on every call. Long system prompts, redundant instructions, and pasted boilerplate add up fast. Audit your prompt and cut anything that does not change the output.

3. Cap the output length

Output tokens usually cost more than input tokens. Set max_tokens to a realistic ceiling so a model cannot ramble into an expensive 2,000-token essay when you needed three sentences.

{
  "model": "openai/gpt-4o-mini",
  "max_tokens": 256,
  "messages": [{"role": "user", "content": "Summarize in 2 sentences."}]
}

4. Read the charge headers on every response

Model Database returns the exact cost of each call so you never have to guess. Log these and you have per-request cost telemetry for free.

curl -sD - https://modeldatabase.com/v1/chat/completions \
  -H "Authorization: Bearer mdb_live_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}' \
  | grep -i x-mdb
# X-MDB-Charged-USD: 0.0000123
# X-MDB-Balance-USD: 49.872

5. Cache repeated work

If two users ask the same question, do not pay twice. A simple key-value cache keyed on the normalized prompt can eliminate a surprising share of traffic for FAQ-style features.

6. Set a per-request cost cap

Runaway prompts, accidental loops, and adversarial inputs can balloon a single call. Model Database enforces a per-request cost cap so one bad request cannot drain your balance. Keep it tight for predictable spend.

7. Batch where latency allows

Background jobs such as tagging, enrichment, and summarization do not need instant responses. Group them and run them with controlled concurrency so you smooth out load and avoid retry storms that cost money.

8. Stream to fail fast

With "stream": true you can stop generation the moment you have what you need, instead of paying for tokens you discard.

{
  "model": "anthropic/claude-sonnet-4-6",
  "stream": true,
  "messages": [{"role": "user", "content": "Draft a reply"}]
}

9. Use retrieval instead of stuffing context

Dumping an entire document into every prompt is expensive. Retrieve only the relevant passages and pass those. You cut input tokens and often improve accuracy at the same time.

10. Route by difficulty

Send easy requests to a cheap model and escalate only the hard ones. A lightweight classifier or even a length heuristic can route most traffic to google/gemini-2.0-flash while reserving anthropic/claude-opus-4-8 for the genuinely complex cases.

Putting it together

None of these tricks require a rewrite. Start by logging X-MDB-Charged-USD for a day, find your most expensive endpoints, and apply the levers that fit. Suppose a feature makes 100,000 calls a month at 1,500 input and 300 output tokens each; halving the prompt and dropping to a smaller model can cut that line item dramatically without touching your product logic.

Ready to tune your spend? Top up and watch your usage on your dashboard, or compare model rates on the pricing page.

← All articles Get your API key →