Glossary · Definition
Inference (AI)
Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).
Definition
Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).
What it means
When you call an AI API, you pay for inference. When you self-host an LLM on your own GPU, you're running inference. Cost factors: model size (big models = expensive inference), quantization (lower bits = cheaper), batch size (larger batches = better throughput per dollar), context length (longer = more memory). Optimizations: vLLM, TensorRT-LLM, llama.cpp, exo (multi-machine). Specialized hardware: Groq LPUs, Cerebras wafer-scale.
Advertisement
Why it matters
Inference is the production cost driver of AI. Training costs are headline-grabbing but mostly absorbed by labs. Inference costs scale with usage and bite teams shipping AI features. Optimizing inference (quantization, caching, batch APIs, model right-sizing) is where the dollars are saved.
Related free tools
Frequently asked questions
Cheapest inference in 2026?
DeepSeek V3.2 at $0.27/$1.10 per 1M tokens. Self-hosted on Hyperspace pods if you have the hardware. Off-peak DeepSeek is even cheaper.
Fastest inference?
Groq + Cerebras at 500-2,500 tokens/sec. Standard API providers run 30-100 tokens/sec.
Related terms
- DefinitionFine-tuningFine-tuning is the process of further training a pretrained model on your specific data, baking in style, format, or domain knowledge that's hard to achieve with prompting alone.
- DefinitionQuantizationQuantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.
- DefinitionMoE (Mixture of Experts)MoE (Mixture of Experts) is an AI architecture where the model has many specialized sub-networks ('experts') and only activates a few per token. Lets the model be huge in total parameters but cheap to run.