Glossary · Definition
Streaming (AI)
AI streaming sends tokens to the user as they're generated, instead of waiting for the full response. The reason ChatGPT, Claude, and Gemini feel fast — text appears word-by-word.
Definition
AI streaming sends tokens to the user as they're generated, instead of waiting for the full response. The reason ChatGPT, Claude, and Gemini feel fast — text appears word-by-word.
What it means
Without streaming, a 200-token response might take 5-10 seconds before appearing. With streaming, the first token appears in 200-500ms (TTFT — Time To First Token), and the user sees progress immediately. Implementation: Server-Sent Events (SSE), HTTP/2 streams, or WebSockets. All major LLM APIs support streaming via stream:true flag. Frameworks like Vercel AI SDK abstract the details.
Advertisement
Why it matters
Streaming dramatically changes perceived performance. Same actual generation speed feels 5-10x faster when streamed because users see immediate progress. Critical for chat UX, voice mode, and any real-time application. Non-streaming is fine for batch processing or when you need the full response before acting (function-call decisions, etc.).
Related free tools
Frequently asked questions
Always stream?
For user-facing chat, yes. For batch or programmatic where you need the full response: no benefit, can simplify code.
Latency vs throughput?
Streaming wins on perceived latency (TTFT). Throughput (tokens/sec) is the same regardless.
Related terms
- DefinitionInference (AI)Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).
- DefinitionContext windowThe context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.