Glossary · Definition
Prompt caching
Prompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.
Definition
Prompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.
What it means
Anthropic, OpenAI, and Google Gemini all support prompt caching as of 2026. Implementation differs slightly: Anthropic uses explicit cache_control breakpoints with 5-min default TTL (1-hour optional). OpenAI auto-caches prefixes ≥1024 tokens for 5-10 min. Gemini has explicit context caching with 1-hour TTL. Cache hits cost roughly 10% of normal input tokens — sometimes 25% on Gemini.
Advertisement
Why it matters
For agentic workloads, RAG, and any app with stable system prompts, prompt caching can cut input costs 80-90%. It's the single biggest cost lever most teams miss. The fix is structural: keep stable parts (system prompt, examples, RAG context) at the start; put dynamic per-request content at the end.
Related free tools
Frequently asked questions
How much does it save?
On cache-friendly workloads (agent loops, RAG, repeated few-shot prompts), 70-90% off the input bill. Use the prompt cache savings calculator to estimate yours.
Does it work cross-provider?
No — each provider's cache is separate. If you switch from Claude to GPT, you start fresh.
Related terms
- DefinitionTokenA token is the basic unit of text an LLM reads and produces. Roughly 4 characters or 0.75 words on average for English; longer for code, shorter for languages with lots of subword tokens. APIs bill by token.
- DefinitionContext windowThe context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.