Vector embeddings represent text as points in high-dimensional space where semantic similarity maps to geometric proximity. Retrieval augmented generation uses this property to find relevant documents at query time and inject them into the model’s context before generation. The result is a model that can answer questions about documents it was never trained on.
Analysis Briefing
- Topic: Vector embeddings, semantic search, and RAG architecture
- Analyst: Mike D (@MrComputerScience)
- Context: A technical briefing developed with Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: How does turning text into vectors make it possible to search by meaning rather than keywords?
What Vector Embeddings Actually Represent
An embedding model converts a piece of text into a dense vector, typically 768 to 3,072 floating point numbers. These numbers are not interpretable individually. Their meaning emerges from their relationships: texts with similar semantic content produce vectors that are geometrically close in that high-dimensional space.
The classic demonstration is word arithmetic. The vector for “king” minus “man” plus “woman” lands close to the vector for “queen.” The geometry encodes semantic relationships. A query about “how to fix a segmentation fault” will have a vector close to a document explaining memory addressing errors, even if the document never uses the word “segmentation fault.”
This is what keyword search cannot do. A BM25 index finds documents containing your exact query terms. An embedding index finds documents semantically related to your query even when the vocabulary is completely different.
How Vector Databases Enable Fast Similarity Search
Storing embeddings in a flat list and computing cosine similarity against every document is too slow at scale. A million 1,536-dimensional vectors requires roughly 6GB of memory and takes seconds to exhaustively search. Vector databases solve this with approximate nearest neighbor (ANN) indexes.
HNSW (Hierarchical Navigable Small World) is the dominant ANN algorithm. It builds a multi-layer graph where each node connects to its nearest neighbors. Search starts at the top layer and navigates down to the closest match. The result is sub-millisecond search on millions of vectors with recall rates above 95%.
import openai
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(":memory:") # in-memory for demo
# Create collection
client.create_collection(
collection_name="docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
def embed(text: str) -> list[float]:
response = openai.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# Index documents
documents = [
"Segmentation faults occur when a program accesses memory it doesn't own.",
"Buffer overflows write past the end of an allocated memory region.",
"Stack overflow happens when recursion depth exceeds the call stack limit.",
]
points = [
PointStruct(id=i, vector=embed(doc), payload={"text": doc})
for i, doc in enumerate(documents)
]
client.upsert(collection_name="docs", points=points)
# Query
query = "how to fix a memory access error in C"
results = client.search(
collection_name="docs",
query_vector=embed(query),
limit=2
)
for r in results:
print(f"Score: {r.score:.3f} | {r.payload['text']}")
Popular vector databases include Qdrant, Pinecone, Weaviate, and pgvector (a PostgreSQL extension). For most applications under a million documents, pgvector on an existing Postgres instance is the simplest path.
The Full RAG Pipeline From Query to Answer
A RAG pipeline has four steps. Indexing runs offline: chunk your documents, embed each chunk, store chunks and embeddings in the vector database. Retrieval runs at query time: embed the user’s query, search for the top-k most similar chunks, return the chunk texts.
Augmentation injects retrieved chunks into the prompt as context. Generation sends the augmented prompt to the language model.
def rag_query(user_query: str, top_k: int = 3) -> str:
# Retrieve
query_vector = embed(user_query)
results = client.search(
collection_name="docs",
query_vector=query_vector,
limit=top_k
)
context = "\n\n".join(r.payload["text"] for r in results)
# Augment and generate
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer using only the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"}
]
)
return response.choices[0].message.content
The most common RAG failure is retrieving chunks that are semantically similar to the query but do not actually answer it. The fix is hybrid search: combine semantic similarity with BM25 keyword scoring and use a reranker model to score the merged results by actual relevance to the question.
What This Means For You
- Chunk documents at semantically meaningful boundaries (paragraphs, sections) rather than fixed character counts, because a chunk that cuts mid-sentence will produce a misleading embedding that retrieves incorrectly.
- Use a reranker on top of ANN retrieval for any production RAG system, because ANN finds semantically similar chunks and a reranker finds chunks that actually answer the question, and those are not the same thing.
- Start with pgvector on Postgres before adopting a dedicated vector database, because the operational complexity of a new infrastructure component is rarely justified until you have more than a million documents and retrieval latency requirements below 10ms.
- Evaluate retrieval quality separately from generation quality, because a bad answer from a RAG system is usually a retrieval failure rather than a generation failure, and conflating them makes debugging impossible.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
