Engineering

A Practical Guide to Retrieval-Augmented Generation

DPDevon PrattJun 14, 20264 min read

Retrieval-Augmented Generation (RAG) is the most reliable way to ground a language model in facts it was never trained on: your docs, your tickets, your product catalog. Instead of fine-tuning, you retrieve the relevant context at query time and hand it to the model alongside the question.

This guide walks through a working RAG pipeline against the Model Database API, which is OpenAI-compatible, so you only change the base URL and your model ID.

The core loop

Every RAG system, no matter how elaborate, is the same four steps:

The retrieval quality, not the model, is usually what makes or breaks the answer. A great model fed irrelevant context produces confident nonsense.

Chunking that respects meaning

Naive fixed-size chunking splits sentences mid-thought. Prefer splitting on structure (headings, paragraphs) and aim for 300–800 tokens per chunk with a small overlap so context isn't severed at boundaries. Store metadata (source, title, URL) with every chunk so you can cite it later.

Generation with retrieved context

Once you have the top-k chunks, build a prompt that clearly separates instructions, context, and the question. Tell the model to answer only from the context and to say when it doesn't know.

from openai import OpenAI

client = OpenAI(
    base_url="https://modeldatabase.com/v1",
    api_key="mdb_live_...",
)

def answer(question, chunks):
    context = "\n\n".join(
        f"[{c['title']}]\n{c['text']}" for c in chunks
    )
    system = (
        "You answer strictly from the provided context. "
        "If the context does not contain the answer, say "
        "'I don't have that information.' Cite the [title] you used."
    )
    resp = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        temperature=0.1,
        messages=[
            {"role": "system", "content": system},
            {"role": "user",
             "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return resp.choices[0].message.content

print(answer("What is the refund window?", retrieved_chunks))

The low temperature keeps the model from improvising. The explicit "say I don't know" instruction is what turns a hallucination-prone setup into a trustworthy one.

Retrieval tactics that matter

rewrite = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": f"Rewrite this as a clear search query: {raw_query}",
    }],
).choices[0].message.content

Citations and trust

Because you control the context, you can require the model to cite which chunk it used. Pass stable identifiers in the context and parse them out of the response to render source links. If a claim has no citation, treat it as suspect. This is the single biggest difference between a demo and something you'd put in front of customers.

Honest limitations

RAG does not eliminate hallucination; it constrains it. If retrieval returns nothing relevant, a model may still guess. Mitigate by setting a similarity threshold and returning a graceful "no answer found" when nothing clears it. RAG also adds latency (an embedding call plus a generation call) and the context window caps how much you can include, so retrieval precision is not optional.

Ready to build? Grab an API key from your dashboard and read the full request reference in the docs to wire retrieval into your own stack.

← All articles Get your API key →