Glossary · Definition
RAG (Retrieval-Augmented Generation)
RAG (Retrieval Augmented Generation) augments an LLM with documents retrieved at query time — typically from a vector database. The LLM grounds its answer in the retrieved text instead of relying purely on training data.
Definition
RAG (Retrieval Augmented Generation) augments an LLM with documents retrieved at query time — typically from a vector database. The LLM grounds its answer in the retrieved text instead of relying purely on training data.
What it means
A RAG system has three components: an embedding model that converts documents + queries to vectors, a vector database (Pinecone, Weaviate, pgvector, etc.) that stores + retrieves the most-similar documents, and an LLM that synthesizes the retrieved context into an answer. The chunking strategy (500-1500 tokens with overlap) and reranking (often via Cohere Rerank or BM25 hybrid) heavily affect quality.
Advertisement
Why it matters
RAG is how most production AI products use private data without retraining the model. It's cheaper than fine-tuning, easier to update, and the retrieved sources are auditable. The downside: poorly tuned RAG retrieves irrelevant chunks, leading to hallucinations.
Related free tools
Frequently asked questions
RAG vs fine-tuning?
RAG: retrieve at query time, easy to update, sources auditable. Fine-tuning: bake knowledge into model weights, faster inference, but expensive and hard to update. Most production systems start with RAG and add fine-tuning only when retrieval quality plateaus.
Which vector database?
pgvector for most teams (just Postgres). Pinecone for managed scale. Weaviate for hybrid + multi-modal. Qdrant for self-host + speed.
Related terms
- DefinitionEmbeddingsEmbeddings are dense numerical vectors that represent the meaning of text (or images, audio) in a way that semantic similarity = vector closeness. They're the foundation of RAG, semantic search, recommendation, and clustering.
- DefinitionContext windowThe context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.