Upgrading Site Search With LLMs

Keyword search punishes users for not knowing the exact term on the page. "How do I get my money back" returns nothing when your docs say "refund policy." Semantic, LLM-powered search understands intent and answers in plain language. This article upgrades site search using Model Database, combining semantic retrieval with a generated answer.

We'll build hybrid search (keywords plus embeddings) and add an answer layer that responds directly while still linking to sources.

The two-layer design

Modern site search has a retrieval layer and an optional answer layer:

Retrieval finds the most relevant pages using semantic similarity, ideally blended with keyword matching.
Answering uses an LLM to synthesize a direct response from the top results, with citations back to the source pages.

Retrieval makes search forgiving; the answer layer turns a list of links into an actual answer.

Indexing content for semantic search

Split each page into passages and store them with embeddings in a vector database. Keep the URL and title so you can link back. Refresh the index when pages change. The embedding and vector store are your choice; Model Database powers the query understanding and answer generation.

def to_passages(page):
    return [
        {"url": page["url"], "title": page["title"], "text": p}
        for p in split(page["body"], size=800, overlap=120)
    ]

Understanding the query

Short, messy queries get better results if you expand them first. A fast model can rewrite a query into a cleaner search string and extract filters.

from openai import OpenAI
import json

client = OpenAI(
    base_url="https://modeldatabase.com/v1",
    api_key="mdb_live_...",
)

def parse_query(q):
    resp = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[
            {"role": "system", "content":
             "Rewrite the search query for clarity and extract intent. "
             "Return JSON: {\"search\": str, \"intent\": str}."},
            {"role": "user", "content": q},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

This step is cheap and runs on every query, so a small model like openai/gpt-4o-mini is the right call.

Hybrid retrieval

Combine keyword scores (BM25) and vector similarity for the best of both worlds: keywords nail exact terms and product names, embeddings catch paraphrases. A simple weighted blend works well.

def hybrid_rank(keyword_hits, vector_hits, alpha=0.5):
    scores = {}
    for doc, s in keyword_hits:
        scores[doc] = scores.get(doc, 0) + (1 - alpha) * s
    for doc, s in vector_hits:
        scores[doc] = scores.get(doc, 0) + alpha * s
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

The answer layer

Take the top passages and have the model write a concise answer that cites its sources. Grounding it in retrieved text is what keeps the answer accurate.

def answer(query, passages):
    ctx = "\n\n".join(
        f"[{i+1}] {p['title']} ({p['url']}): {p['text']}"
        for i, p in enumerate(passages)
    )
    resp = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        messages=[
            {"role": "system", "content":
             "Answer the query from the passages only, in 2-4 sentences. "
             "Cite sources as [1], [2]. If unsure, say so and list the "
             "most relevant links. Never invent facts or URLs."},
            {"role": "user", "content":
             f"Query: {query}\n\nPassages:\n{ctx}"},
        ],
        temperature=0.1,
    )
    return resp.choices[0].message.content

Always render the underlying results too. The generated answer is a convenience; the links are the ground truth, and users should be able to click through to verify.

Performance and cost

Cache popular queries: many searches repeat. Cache the answer keyed by the normalized query to cut latency and tokens.
Stream the answer: set stream=True so text appears immediately while the user scans the results list.
Make answers optional: only invoke the answer layer when retrieval confidence is high; otherwise just show results and skip the model call entirely.
Tune the blend: adjust the keyword/vector weight per content type and measure click-through.

Choosing models

Query parsing is cheap work for a small model, while the answer layer benefits from stronger synthesis, so anthropic/claude-sonnet-4-6 is a good default there. Build a set of real queries with expected top results and evaluate as you tune. Because every model lives behind the same Model Database endpoint, you can shift the answer layer to a cheaper model and compare quality with a one-line change, paying only for what you use.

Start building with a key and credit from your dashboard, and see streaming and model details in the docs.

Upgrading Site Search With LLMs

The two-layer design

Indexing content for semantic search

Understanding the query

Hybrid retrieval

The answer layer

Performance and cost

Choosing models

More in Use Cases

Building a Customer Support Assistant

A Content Generation Pipeline That Scales

Automating Code Review With LLMs