What's the actual SLA on Batch API?

All four major providers (Anthropic, OpenAI, Google, DeepSeek) commit to 24-hour completion. Most actual returns are 1-6 hours; spikes during peak demand can push toward the 24h cap. If you need guaranteed faster turnaround, you must use real-time API at full price.

Are all model variants supported in batch?

Most are, but check provider docs. Anthropic supports Sonnet, Haiku, Opus in batch. OpenAI supports GPT-4o, GPT-4o-mini, o1, o3-mini in batch. Google supports Gemini 1.5/2.x Pro and Flash in batch. DeepSeek supports V3 and R1 in batch. Some specialty endpoints (Anthropic’s computer-use, OpenAI’s real-time API, vision-only models) are not batchable.

Does the 50% discount apply to cached input?

Provider-dependent. Anthropic prompt-caching pricing remains separate from batch — you can stack cache + batch in some cases for compounded savings. OpenAI’s Batch + cached input give similar layered discounts. Read the per-provider pricing pages carefully; the savings can be substantial when stacked.

How do I switch a workload to batch?

Three steps: (1) tag your async workloads — anything that doesn't need a live response. (2) Modify the API endpoint URL — instead of POSTing to /v1/messages or /v1/chat/completions, you upload a JSONL file of requests to /v1/batches. (3) Poll for completion or set up a webhook. Most SDKs (Anthropic Python, OpenAI Python) have built-in batch helpers.

Are there minimum batch sizes?

No strict minimums, but the per-batch overhead means very small batches (1-10 requests) don’t save much in operational time. Sweet spot is 100-10,000 requests per batch. Anthropic caps at 100,000 per batch; OpenAI/Google have similar high caps. Split larger workloads across multiple batches.

What about rate limits?

Batch API has separate rate limits from real-time API at all four providers — typically much higher daily token caps because the workload is async. Anthropic publishes batch-specific rate limits in their console. Plan accordingly: batch is great for huge volumes that would exceed real-time RPM/TPM caps.

AI & Prompt Tools · Free tool

Batch API Savings Calculator

Anthropic, OpenAI, Gemini, and DeepSeek all offer 50% off via batch APIs. Calculate your savings on bulk classification, embeddings, and evals.

Updated June 2026

Input tokens (k)/callOutput tokens (k)/callTotal calls in batch

Average savings switching to batch

$5,015.63 per batch (50% off)

Provider	Real-time	Batch	SLA	Savings
Claude (Anthropic)	$18,750	$9,375	24h	$9,375
OpenAI (GPT-5)	$13,750	$6,875	24h	$6,875
Gemini 2.5 Pro	$6,875	$3,437.5	24h	$3,437.5
DeepSeek (off-peak)	$750	$375	8h	$375

When to use batch: async jobs that don’t need a same-second response — bulk classification, summarization, embedding generation, evals. Submit a JSONL of requests, get a results JSONL back within 24h (most return in 1-6h). 50% savings for the price of patience.

Found this useful?Email Buy Me a Coffee

What it does

The major LLM providers — Anthropic, OpenAI, Google, DeepSeek — all offer a Batch API variant that trades synchronous response time for a flat 50% discount on input and output tokens. The economic logic: batch jobs let providers schedule inference opportunistically across cluster capacity, packing requests into otherwise-idle GPU slots and amortizing infrastructure differently than the real-time path. For customers, the tradeoff is response time — batch jobs typically return in 1-6 hours, with a 24-hour SLA cap. So the question for any workload is: do you actually need the response in the next second, or could you accept “sometime within 24 hours” for half the cost?

The calculator takes your monthly token volume (input + output, per provider) and shows the dollar savings of switching eligible workloads to batch. For a workload spending $5,000/month on Sonnet at standard rates, batching the asynchronous portions would save up to $2,500/month — meaningful for any AI-heavy product. Workloads that batch well: bulk classification or labeling (every record is independent, doesn’t need live response), nightly summarization of documents/conversations/transactions, embedding generation for vector indexes, prompt evals and benchmarks (you’re testing across hundreds of variants), training-data synthesis, and content moderation queues where 1-6 hour latency is acceptable.

What does NOT batch: any user-facing synchronous interaction (chat, search, completion-as-you-type), real-time agents, streaming responses, anything triggered by a user click and showing a loading spinner. Most production LLM apps split into hot and cold paths: hot path uses real-time API for user-facing requests, cold path uses batch for asynchronous work. Done right, this can cut overall AI costs by 30-60% with no UX degradation. Provider-specific notes: Anthropic batch caps at 100K requests per batch, returns within 24h; OpenAI batch returns within 24h; Google batch returns within 24h; DeepSeek batch is similar with slightly tighter SLAs.

Embed this tool on your siteShow snippet

Paste this snippet into any page. Loads on-demand (lazy), no tracking scripts, and sized to most dashboards. Replace the height to fit your layout.

<iframe src="https://freetoolarena.com/embed/batch-api-savings-calculator" width="100%" height="720" frameborder="0" loading="lazy" title="Batch API Savings Calculator" style="border:1px solid #e2e8f0;border-radius:12px;max-width:720px;"></iframe>

Embed docs →

How to use it

Enter your monthly input + output token volume per provider.
Mark which workloads can tolerate 1-24h latency (bulk classification, embeddings, summarization, evals).
Read the 50% savings calculation across all four providers.
Compare to current spend — split-path architectures (hot real-time + cold batch) typically save 30-60% overall.
Plan the migration: tag your async workloads, queue them through the batch endpoint instead of streaming API.

When to use this tool

Estimating savings before adopting Batch API for cold-path workloads.
Justifying a batch-pipeline architecture to engineering leadership with concrete dollar numbers.
Comparing batch economics across the 4 major providers (Anthropic, OpenAI, Google, DeepSeek).
Annual budget planning — projecting AI spend with split hot/cold architecture.

When not to use it

Real-time user-facing workloads — never batch what users wait for in a UI.
Streaming responses (chat) — batch endpoints don’t support streaming output.
Workloads requiring tool use / function calling with multiple synchronous turns — batch is single-request only.
Tiny token volumes (<$50/month) — savings are real but operational complexity often isn’t worth it for small spend.

Common use cases

Onboarding a colleague who needs the same calculation/conversion
Verifying a number or output before passing it on
Quick calculation during a typical workday
Pre-decision sanity-check on inputs and outputs

Frequently asked questions

What's the actual SLA on Batch API?: All four major providers (Anthropic, OpenAI, Google, DeepSeek) commit to 24-hour completion. Most actual returns are 1-6 hours; spikes during peak demand can push toward the 24h cap. If you need guaranteed faster turnaround, you must use real-time API at full price.
Are all model variants supported in batch?: Most are, but check provider docs. Anthropic supports Sonnet, Haiku, Opus in batch. OpenAI supports GPT-4o, GPT-4o-mini, o1, o3-mini in batch. Google supports Gemini 1.5/2.x Pro and Flash in batch. DeepSeek supports V3 and R1 in batch. Some specialty endpoints (Anthropic’s computer-use, OpenAI’s real-time API, vision-only models) are not batchable.
Does the 50% discount apply to cached input?: Provider-dependent. Anthropic prompt-caching pricing remains separate from batch — you can stack cache + batch in some cases for compounded savings. OpenAI’s Batch + cached input give similar layered discounts. Read the per-provider pricing pages carefully; the savings can be substantial when stacked.
How do I switch a workload to batch?: Three steps: (1) tag your async workloads — anything that doesn't need a live response. (2) Modify the API endpoint URL — instead of POSTing to /v1/messages or /v1/chat/completions, you upload a JSONL file of requests to /v1/batches. (3) Poll for completion or set up a webhook. Most SDKs (Anthropic Python, OpenAI Python) have built-in batch helpers.
Are there minimum batch sizes?: No strict minimums, but the per-batch overhead means very small batches (1-10 requests) don’t save much in operational time. Sweet spot is 100-10,000 requests per batch. Anthropic caps at 100,000 per batch; OpenAI/Google have similar high caps. Split larger workloads across multiple batches.
What about rate limits?: Batch API has separate rate limits from real-time API at all four providers — typically much higher daily token caps because the workload is async. Anthropic publishes batch-specific rate limits in their console. Plan accordingly: batch is great for huge volumes that would exceed real-time RPM/TPM caps.

Learn more

Explore more ai & prompt tools tools

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →