A Practical Guide to Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is the most reliable way to ground a language model in facts it was never trained on: your docs, your tickets, your product catalog. Instead of fine-tuning, you retrieve the relevant context at query time and hand it to the model alongside the question.

This guide walks through a working RAG pipeline against the Model Database API, which is OpenAI-compatible, so you only change the base URL and your model ID.

The core loop

Every RAG system, no matter how elaborate, is the same four steps:

Ingest: split source documents into chunks and store them.
Index: embed each chunk into a vector and keep it in a vector store.
Retrieve: embed the user query, find the nearest chunks.
Generate: stuff those chunks into the prompt and ask the model.

The retrieval quality, not the model, is usually what makes or breaks the answer. A great model fed irrelevant context produces confident nonsense.

Chunking that respects meaning

Naive fixed-size chunking splits sentences mid-thought. Prefer splitting on structure (headings, paragraphs) and aim for 300–800 tokens per chunk with a small overlap so context isn't severed at boundaries. Store metadata (source, title, URL) with every chunk so you can cite it later.

Generation with retrieved context

Once you have the top-k chunks, build a prompt that clearly separates instructions, context, and the question. Tell the model to answer only from the context and to say when it doesn't know.

from openai import OpenAI

client = OpenAI(
    base_url="https://modeldatabase.com/v1",
    api_key="mdb_live_...",
)

def answer(question, chunks):
    context = "\n\n".join(
        f"[{c['title']}]\n{c['text']}" for c in chunks
    )
    system = (
        "You answer strictly from the provided context. "
        "If the context does not contain the answer, say "
        "'I don't have that information.' Cite the [title] you used."
    )
    resp = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        temperature=0.1,
        messages=[
            {"role": "system", "content": system},
            {"role": "user",
             "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return resp.choices[0].message.content

print(answer("What is the refund window?", retrieved_chunks))

The low temperature keeps the model from improvising. The explicit "say I don't know" instruction is what turns a hallucination-prone setup into a trustworthy one.

Retrieval tactics that matter

Hybrid search: combine vector similarity with keyword (BM25) matching. Pure vectors miss exact identifiers like SKUs or error codes.
Re-ranking: retrieve 20 candidates, then ask a model to score and keep the best 5. This cheaply boosts precision.
Query rewriting: expand or rephrase vague queries before embedding them. You can use a fast, cheap model for this.

rewrite = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": f"Rewrite this as a clear search query: {raw_query}",
    }],
).choices[0].message.content

Citations and trust

Because you control the context, you can require the model to cite which chunk it used. Pass stable identifiers in the context and parse them out of the response to render source links. If a claim has no citation, treat it as suspect. This is the single biggest difference between a demo and something you'd put in front of customers.

Honest limitations

RAG does not eliminate hallucination; it constrains it. If retrieval returns nothing relevant, a model may still guess. Mitigate by setting a similarity threshold and returning a graceful "no answer found" when nothing clears it. RAG also adds latency (an embedding call plus a generation call) and the context window caps how much you can include, so retrieval precision is not optional.

Ready to build? Grab an API key from your dashboard and read the full request reference in the docs to wire retrieval into your own stack.

A Practical Guide to Retrieval-Augmented Generation

The core loop

Chunking that respects meaning

Generation with retrieved context

Retrieval tactics that matter

Citations and trust

Honest limitations

More in Engineering

Function Calling and Tool Use, Explained

Getting Reliable JSON Out of LLMs

Building Agents That Actually Work