Glossary · Definition

Prompt caching

Prompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Prompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.

What it means

Anthropic, OpenAI, and Google Gemini all support prompt caching as of 2026. Implementation differs slightly: Anthropic uses explicit cache_control breakpoints with 5-min default TTL (1-hour optional). OpenAI auto-caches prefixes ≥1024 tokens for 5-10 min. Gemini has explicit context caching with 1-hour TTL. Cache hits cost roughly 10% of normal input tokens — sometimes 25% on Gemini.

Why it matters

For agentic workloads, RAG, and any app with stable system prompts, prompt caching can cut input costs 80-90%. It's the single biggest cost lever most teams miss. The fix is structural: keep stable parts (system prompt, examples, RAG context) at the start; put dynamic per-request content at the end.

Related free tools

Free toolPrompt Cache Savings CalculatorCalculate your monthly savings from prompt caching across Anthropic, OpenAI, and Gemini. 90% off cached input tokens — usually pays back instantly.Open tool →

Frequently asked questions

How much does it save?

On cache-friendly workloads (agent loops, RAG, repeated few-shot prompts), 70-90% off the input bill. Use the prompt cache savings calculator to estimate yours.

Does it work cross-provider?

No — each provider's cache is separate. If you switch from Claude to GPT, you start fresh.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms