Glossary · Definition

Inference (AI)

Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).

What it means

When you call an AI API, you pay for inference. When you self-host an LLM on your own GPU, you're running inference. Cost factors: model size (big models = expensive inference), quantization (lower bits = cheaper), batch size (larger batches = better throughput per dollar), context length (longer = more memory). Optimizations: vLLM, TensorRT-LLM, llama.cpp, exo (multi-machine). Specialized hardware: Groq LPUs, Cerebras wafer-scale.

Why it matters

Inference is the production cost driver of AI. Training costs are headline-grabbing but mostly absorbed by labs. Inference costs scale with usage and bite teams shipping AI features. Optimizing inference (quantization, caching, batch APIs, model right-sizing) is where the dollars are saved.

Related free tools

Free toolAI Cost EstimatorEstimate daily, monthly, and yearly API cost for GPT-4o, Claude, Gemini, and more based on your traffic and token usage.Open tool →Free toolLocal vs API Break-even CalculatorHow many months until self-hosting pays back vs using API? Compare Mac Studio, RTX 4090/5090, and Hyperspace pods at your usage level.Open tool →

Frequently asked questions

Cheapest inference in 2026?

DeepSeek V3.2 at $0.27/$1.10 per 1M tokens. Self-hosted on Hyperspace pods if you have the hardware. Off-peak DeepSeek is even cheaper.

Fastest inference?

Groq + Cerebras at 500-2,500 tokens/sec. Standard API providers run 30-100 tokens/sec.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms