Keyword search punishes users for not knowing the exact term on the page. "How do I get my money back" returns nothing when your docs say "refund policy." Semantic, LLM-powered search understands intent and answers in plain language. This article upgrades site search using Model Database, combining semantic retrieval with a generated answer.
We'll build hybrid search (keywords plus embeddings) and add an answer layer that responds directly while still linking to sources.
The two-layer design
Modern site search has a retrieval layer and an optional answer layer:
- Retrieval finds the most relevant pages using semantic similarity, ideally blended with keyword matching.
- Answering uses an LLM to synthesize a direct response from the top results, with citations back to the source pages.
Retrieval makes search forgiving; the answer layer turns a list of links into an actual answer.
Indexing content for semantic search
Split each page into passages and store them with embeddings in a vector database. Keep the URL and title so you can link back. Refresh the index when pages change. The embedding and vector store are your choice; Model Database powers the query understanding and answer generation.
def to_passages(page):
return [
{"url": page["url"], "title": page["title"], "text": p}
for p in split(page["body"], size=800, overlap=120)
]
Understanding the query
Short, messy queries get better results if you expand them first. A fast model can rewrite a query into a cleaner search string and extract filters.
from openai import OpenAI
import json
client = OpenAI(
base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...",
)
def parse_query(q):
resp = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[
{"role": "system", "content":
"Rewrite the search query for clarity and extract intent. "
"Return JSON: {\"search\": str, \"intent\": str}."},
{"role": "user", "content": q},
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(resp.choices[0].message.content)
This step is cheap and runs on every query, so a small model like openai/gpt-4o-mini is the right call.
Hybrid retrieval
Combine keyword scores (BM25) and vector similarity for the best of both worlds: keywords nail exact terms and product names, embeddings catch paraphrases. A simple weighted blend works well.
def hybrid_rank(keyword_hits, vector_hits, alpha=0.5):
scores = {}
for doc, s in keyword_hits:
scores[doc] = scores.get(doc, 0) + (1 - alpha) * s
for doc, s in vector_hits:
scores[doc] = scores.get(doc, 0) + alpha * s
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
The answer layer
Take the top passages and have the model write a concise answer that cites its sources. Grounding it in retrieved text is what keeps the answer accurate.
def answer(query, passages):
ctx = "\n\n".join(
f"[{i+1}] {p['title']} ({p['url']}): {p['text']}"
for i, p in enumerate(passages)
)
resp = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "system", "content":
"Answer the query from the passages only, in 2-4 sentences. "
"Cite sources as [1], [2]. If unsure, say so and list the "
"most relevant links. Never invent facts or URLs."},
{"role": "user", "content":
f"Query: {query}\n\nPassages:\n{ctx}"},
],
temperature=0.1,
)
return resp.choices[0].message.content
Always render the underlying results too. The generated answer is a convenience; the links are the ground truth, and users should be able to click through to verify.
Performance and cost
- Cache popular queries: many searches repeat. Cache the answer keyed by the normalized query to cut latency and tokens.
- Stream the answer: set
stream=Trueso text appears immediately while the user scans the results list. - Make answers optional: only invoke the answer layer when retrieval confidence is high; otherwise just show results and skip the model call entirely.
- Tune the blend: adjust the keyword/vector weight per content type and measure click-through.
Choosing models
Query parsing is cheap work for a small model, while the answer layer benefits from stronger synthesis, so anthropic/claude-sonnet-4-6 is a good default there. Build a set of real queries with expected top results and evaluate as you tune. Because every model lives behind the same Model Database endpoint, you can shift the answer layer to a cheaper model and compare quality with a one-line change, paying only for what you use.
Start building with a key and credit from your dashboard, and see streaming and model details in the docs.