Glossary · Definition
Context window
The context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.
Definition
The context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.
What it means
Context windows are measured in tokens (~4 characters or ~0.75 words each). Claude Sonnet 4.6 and Opus 4.7 have 1M tokens; Gemini 2.5/3 Pro have 2M; GPT-5 has 400k; DeepSeek V3.2 has 128k. The window includes EVERY token: system prompt + chat history + user message + tool definitions + the model's response. Models also degrade in quality near the max — most pros operate at 50-70% of rated context for production reliability.
Advertisement
Why it matters
Picking a model with too small a context window forces you to chunk documents, lose RAG context, or break agent loops. Conversely, paying for a 2M context model when you use 50k is wasted spend. Right-sizing the window to your actual workload is one of the bigger AI-cost levers.
Related free tools
Frequently asked questions
How big is 1M tokens?
About 750,000 words — roughly 7-8 average books, or a full medium-sized codebase.
What happens when I exceed the window?
The provider truncates the oldest content (most APIs) or refuses the request. Either way, content past the limit is invisible to the model.
Related terms
- DefinitionTokenA token is the basic unit of text an LLM reads and produces. Roughly 4 characters or 0.75 words on average for English; longer for code, shorter for languages with lots of subword tokens. APIs bill by token.
- DefinitionPrompt cachingPrompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.
- DefinitionRAG (Retrieval-Augmented Generation)RAG (Retrieval Augmented Generation) augments an LLM with documents retrieved at query time — typically from a vector database. The LLM grounds its answer in the retrieved text instead of relying purely on training data.