Glossary · Definition
Edge inference
Edge inference means running AI models close to where data is generated — on the user's device, in a CDN/edge POP, or in a regional data center — rather than at a centralized cloud location.
Definition
Edge inference means running AI models close to where data is generated — on the user's device, in a CDN/edge POP, or in a regional data center — rather than at a centralized cloud location.
What it means
Two flavors: (1) On-device — Apple Intelligence on iPhone, ChatGPT mini in browser, Gemma running in browser. (2) Edge data centers — Cloudflare Workers AI, Vercel Edge Functions running smaller models near users. Use cases: ultra-low latency (voice mode, mobile chat), privacy (data never leaves device), offline support, cost (offload from centralized API).
Advertisement
Why it matters
Edge inference is what makes 'AI everywhere' feel responsive. Apple's on-device Intelligence is the consumer-facing example; Cloudflare Workers AI + Vercel are the developer-platform examples. As models get smaller (Phi-4, Gemma 3 9B, SmolLM3) and more capable, on-device + edge inference becomes practical for more use cases.
Related free tools
Frequently asked questions
How small can edge models be?
Phi-4 (14B) runs on a high-end laptop. Gemma 3 9B runs on a Mac mini. SmolLM3 3B runs on smartphones. Ultra-edge: TinyLlama 1.1B runs in browser via WebGPU.
Edge vs cloud cost?
Edge moves cost from per-token fees to upfront infrastructure. Often cheaper at scale, especially for high-volume + privacy-sensitive workloads.
Related terms
- DefinitionInference (AI)Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).
- DefinitionVRAMVRAM (Video RAM) is the memory on your GPU. It determines which AI models you can run locally — the model + KV cache + activations all need to fit. The single most-relevant hardware spec for local AI.