Glossary · Definition

RAG (Retrieval-Augmented Generation)

RAG (Retrieval Augmented Generation) augments an LLM with documents retrieved at query time — typically from a vector database. The LLM grounds its answer in the retrieved text instead of relying purely on training data.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

What it means

A RAG system has three components: an embedding model that converts documents + queries to vectors, a vector database (Pinecone, Weaviate, pgvector, etc.) that stores + retrieves the most-similar documents, and an LLM that synthesizes the retrieved context into an answer. The chunking strategy (500-1500 tokens with overlap) and reranking (often via Cohere Rerank or BM25 hybrid) heavily affect quality.

Why it matters

RAG is how most production AI products use private data without retraining the model. It's cheaper than fine-tuning, easier to update, and the retrieved sources are auditable. The downside: poorly tuned RAG retrieves irrelevant chunks, leading to hallucinations.

Related free tools

Free toolEmbeddings Cost ComparisonCompare cost-per-million-tokens across 8 embedding providers, with MTEB benchmark scores. Pick the right model for your RAG corpus size.Open tool →

Frequently asked questions

RAG vs fine-tuning?

RAG: retrieve at query time, easy to update, sources auditable. Fine-tuning: bake knowledge into model weights, faster inference, but expensive and hard to update. Most production systems start with RAG and add fine-tuning only when retrieval quality plateaus.

Which vector database?

pgvector for most teams (just Postgres). Pinecone for managed scale. Weaviate for hybrid + multi-modal. Qdrant for self-host + speed.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms