The embedding model learned from billions of text examples to place semantically similar text at nearby positions in 768-dimensional space. So "scurvy" and "vitamin C deficiency" end up close together, even though they share no words. That's why semantic search beats keyword matching.
If your vectors are L2-normalized, cosine similarity and dot product are equivalent (cosine equals dot when both vectors have norm 1). Most modern embedding models (OpenAI's, bge, e5) ship normalized vectors, so dot product is faster and gives identical rankings. Use cosine when you can't guarantee normalization or when comparing across embedding sources with different norms. Use raw dot product only when norm itself carries meaning — e.g., some recommender embeddings encode confidence in the magnitude.
Without normalization, longer documents tend to have larger embedding norms, so dot product favors them regardless of relevance. Cosine cancels this by dividing out the norms, but at the cost of an extra sqrt-and-divide per comparison. The standard pipeline is: embed, L2-normalize at index time, then use dot product (or "inner product" mode in your vector DB) at query time — same accuracy as cosine, half the math. The bug to watch for: forgetting to normalize the query vector at search time. The index is normalized but the query isn't, and your top-k silently goes wrong.
Higher dimensions (1536, 3072) capture more nuance and tend to score higher on benchmarks but cost linearly more in storage, RAM, and query time — a billion 3072-dim float32 vectors is 12 TB before any index overhead. Lower dimensions (384, 768) are 4–8x cheaper and the quality gap narrows on domain-specific corpora. Matryoshka embeddings (MRL) give you both: train at high dimension but truncate to lower dimension at serve time with graceful quality degradation. Practical rule: start at 768, only move to 1536+ if eval shows it actually helps your task.
Most common reasons in order: (1) chunking split the answer across boundaries so no single chunk contains the full context; (2) the query's vocabulary doesn't overlap semantically with the document's (a question phrased in user-speak vs a chunk in technical jargon — embeddings help here but aren't magic); (3) the right chunk is in the index but ranked just below your top-k cutoff; (4) embedding model mismatch (you re-embedded the corpus with a new model but kept the old query encoder); (5) for rare exact tokens like product SKUs, dense retrieval just doesn't work — needs BM25 hybrid.
Check the MTEB leaderboard for your task type (retrieval, clustering, classification) at your size budget. Open-source bge-large, e5-mistral, and gte-large are competitive with the best proprietary models on most retrieval tasks. Proprietary (Voyage, Cohere, OpenAI text-embedding-3-large) often edge out on specialized tasks (code, multilingual) and are operationally simpler — one API call, no GPU. Self-hosting wins on cost at high volume (~>100M embeddings/month), data residency, and the ability to fine-tune. The decision is rarely about peak quality; it's about ops fit.
Build a small (200–500 example) labeled set of (query, relevant_doc_id) pairs and compute recall@k and MRR for each candidate model. The query side should mirror real user queries, not paraphrases of doc text — that's the bias trap that makes every embedding model look great. For unlabeled data, bootstrap with an LLM: prompt it with each chunk and ask it to generate a plausible question, then check that the question retrieves its source chunk in the top-5. The MTEB leaderboard tells you the average; your gold set tells you the truth for your domain.