Skip to content
Free Tool Arena

Glossary · Definition

Prompt caching

Prompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.

Updated June 2026 · 4 min read
100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Prompt caching is a feature where the AI provider stores frequently reused prompt prefixes (system messages, RAG context, few-shot examples) and bills cached reads at ~10% of normal input cost.

What it means

Anthropic, OpenAI, and Google Gemini all support prompt caching as of 2026. Implementation differs slightly: Anthropic uses explicit cache_control breakpoints with 5-min default TTL (1-hour optional). OpenAI auto-caches prefixes ≥1024 tokens for 5-10 min. Gemini has explicit context caching with 1-hour TTL. Cache hits cost roughly 10% of normal input tokens — sometimes 25% on Gemini.

Advertisement

Why it matters

For agentic workloads, RAG, and any app with stable system prompts, prompt caching can cut input costs 80-90%. It's the single biggest cost lever most teams miss. The fix is structural: keep stable parts (system prompt, examples, RAG context) at the start; put dynamic per-request content at the end.

Related free tools

Frequently asked questions

How much does it save?

On cache-friendly workloads (agent loops, RAG, repeated few-shot prompts), 70-90% off the input bill. Use the prompt cache savings calculator to estimate yours.

Does it work cross-provider?

No — each provider's cache is separate. If you switch from Claude to GPT, you start fresh.

Related terms

Found this useful?

The tools stay free thanks to readers who chip in or spread the word.

Buy Me a Coffee