Keyword search matches strings; semantic search matches meaning. Ask "how do I get my money back" and a semantic system finds the "refund policy" page even though they share no words. The technology underneath is embeddings: numeric vectors that place similar meanings near each other in space. This article covers the fundamentals you need to build semantic search.
What an embedding is
An embedding model maps text to a fixed-length vector, say 1,536 numbers. Texts with similar meaning land close together; unrelated texts land far apart. "Distance" is usually measured with cosine similarity, which compares the angle between vectors. Search becomes a geometry problem: embed the query, find the nearest stored vectors.
The pipeline
- Chunk your documents into passages of a few hundred tokens.
- Embed each chunk and store the vector plus the original text and metadata.
- Embed the query at search time.
- Rank stored chunks by similarity to the query vector and return the top matches.
Computing similarity
Once you have vectors, similarity is a dot product on normalized vectors. For small collections you can do this in memory; beyond a few thousand items, reach for a dedicated vector database that handles approximate nearest-neighbor search efficiently.
import numpy as np
def cosine(a, b):
a, b = np.array(a), np.array(b)
return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))
def search(query_vec, store, k=5):
scored = [(cosine(query_vec, item["vec"]), item) for item in store]
scored.sort(key=lambda x: x[0], reverse=True)
return [item for _, item in scored[:k]]
From search to answers
Semantic search is the retrieval half of RAG. Feed the top chunks to a chat model on Model Database to turn matches into a written answer with the same OpenAI-compatible client you already use.
from openai import OpenAI
client = OpenAI(base_url="https://modeldatabase.com/v1", api_key="mdb_live_...")
hits = search(query_vec, store, k=5)
context = "\n\n".join(h["text"] for h in hits)
resp = client.chat.completions.create(
model="google/gemini-2.0-flash",
messages=[
{"role": "system", "content": "Answer from the context only."},
{"role": "user", "content": f"{context}\n\nQ: {user_query}"},
],
)
print(resp.choices[0].message.content)
Practical tips
- Normalize text before embedding: trim boilerplate, collapse whitespace. Noise in, noise out.
- Embed at the right granularity: too-large chunks blur meaning; too-small chunks lose context. Test a few sizes against real queries.
- Use one embedding model consistently: vectors from different models aren't comparable. If you change models, re-embed everything.
- Add metadata filters: combine vector similarity with structured filters (language, date, product) to narrow results.
- Consider hybrid search: blend semantic scores with keyword matching so exact identifiers aren't lost.
Honest limitations
Embeddings capture similarity, not truth or recency, a vector can't tell you a document is outdated. Similarity also isn't relevance: the nearest chunk may be on-topic but not actually answer the question, which is why re-ranking and thresholds matter. And quality depends heavily on the embedding model and your chunking; budget time to tune both against real queries rather than expecting defaults to be optimal.
Wire retrieved context into a chat model with a key from your dashboard, and see the chat completions reference in the docs.