Engineering

Embeddings and Semantic Search Basics

DPDevon PrattJan 25, 20264 min read

Keyword search matches strings; semantic search matches meaning. Ask "how do I get my money back" and a semantic system finds the "refund policy" page even though they share no words. The technology underneath is embeddings: numeric vectors that place similar meanings near each other in space. This article covers the fundamentals you need to build semantic search.

What an embedding is

An embedding model maps text to a fixed-length vector, say 1,536 numbers. Texts with similar meaning land close together; unrelated texts land far apart. "Distance" is usually measured with cosine similarity, which compares the angle between vectors. Search becomes a geometry problem: embed the query, find the nearest stored vectors.

The pipeline

Computing similarity

Once you have vectors, similarity is a dot product on normalized vectors. For small collections you can do this in memory; beyond a few thousand items, reach for a dedicated vector database that handles approximate nearest-neighbor search efficiently.

import numpy as np

def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

def search(query_vec, store, k=5):
    scored = [(cosine(query_vec, item["vec"]), item) for item in store]
    scored.sort(key=lambda x: x[0], reverse=True)
    return [item for _, item in scored[:k]]

From search to answers

Semantic search is the retrieval half of RAG. Feed the top chunks to a chat model on Model Database to turn matches into a written answer with the same OpenAI-compatible client you already use.

from openai import OpenAI
client = OpenAI(base_url="https://modeldatabase.com/v1", api_key="mdb_live_...")

hits = search(query_vec, store, k=5)
context = "\n\n".join(h["text"] for h in hits)

resp = client.chat.completions.create(
    model="google/gemini-2.0-flash",
    messages=[
        {"role": "system", "content": "Answer from the context only."},
        {"role": "user", "content": f"{context}\n\nQ: {user_query}"},
    ],
)
print(resp.choices[0].message.content)

Practical tips

Honest limitations

Embeddings capture similarity, not truth or recency, a vector can't tell you a document is outdated. Similarity also isn't relevance: the nearest chunk may be on-topic but not actually answer the question, which is why re-ranking and thresholds matter. And quality depends heavily on the embedding model and your chunking; budget time to tune both against real queries rather than expecting defaults to be optimal.

Wire retrieved context into a chat model with a key from your dashboard, and see the chat completions reference in the docs.

← All articles Get your API key →