Speed Test: Which AI Model Is Fastest for Agents in 2026?
When building AI agents, latency is everything. A model that takes 5 seconds to respond kills the agent experience. We tested 30 models on the metrics that matter for agents: time-to-first-token, streaming throughput, and multi-turn consistency.
Why Speed Matters for Agents
Agents make multiple LLM calls per task:
- Planning call (1-3 seconds)
- Tool execution (varies)
- Reflection call (1-3 seconds)
- Final response (1-5 seconds)
A “slow” model (3+ seconds per call) turns a 5-step agent into a 15+ second experience. Users bail.
Test Methodology
We measured:
- Time to First Token (TTFT): How fast does output start?
- Tokens per Second (TPS): Sustained generation speed
- P50/P95 Latency: Consistency under load
- Cold Start: First request latency
- Concurrent Load: 10, 50, 100 parallel requests
Each test ran 100 times. Models tested via official APIs from US-East region.
Results: The Speed Leaderboard
Time to First Token (Lower is Better)
| Model | P50 TTFT | P95 TTFT | Score |
|---|---|---|---|
| Gemini 2.5 Flash | 0.12s | 0.28s | 9.8 |
| GPT-4o | 0.15s | 0.32s | 9.6 |
| Claude Haiku 4 | 0.18s | 0.35s | 9.4 |
| DeepSeek V3 | 0.22s | 0.41s | 9.0 |
| GPT-4o-mini | 0.14s | 0.38s | 9.2 |
| Claude Sonnet 4.5 | 0.35s | 0.62s | 8.4 |
| Gemini 3 Pro | 0.42s | 0.78s | 8.0 |
| GPT-5.2 | 0.48s | 0.92s | 7.6 |
| Claude Opus 4.6 | 0.65s | 1.24s | 6.8 |
Tokens Per Second (Higher is Better)
| Model | TPS | Score |
|---|---|---|
| Gemini 2.5 Flash | 359 | 9.9 |
| GPT-4o | 312 | 9.6 |
| Claude Haiku 4 | 285 | 9.3 |
| GPT-4o-mini | 278 | 9.1 |
| DeepSeek V3 | 245 | 8.7 |
| Claude Sonnet 4.5 | 165 | 7.8 |
| Gemini 3 Pro | 148 | 7.4 |
| GPT-5.2 | 132 | 7.0 |
| Claude Opus 4.6 | 95 | 6.2 |
Best for Agentic Workflows
| Model | TTFT | TPS | Quality | Agent Score |
|---|---|---|---|---|
| GPT-4o | 0.15s | 312 | 8.6 | 9.2 |
| Gemini 2.5 Flash | 0.12s | 359 | 8.0 | 9.0 |
| Claude Sonnet 4.5 | 0.35s | 165 | 9.0 | 8.6 |
| DeepSeek V3 | 0.22s | 245 | 8.2 | 8.4 |
| Claude Haiku 4 | 0.18s | 285 | 7.8 | 8.2 |
Key Insights
1. Flash/Mini Models Win for Agents
Gemini 2.5 Flash and GPT-4o deliver 2-3x the throughput of “thinking” models. For agents making 5+ calls, this is the difference between 8 seconds and 25 seconds total latency.
2. Claude Opus is Too Slow for Agents
At 95 TPS and 0.65s TTFT, Claude Opus 4.6 is optimized for quality, not speed. Use it for:
- Single complex queries
- Planning/strategy calls
- Final response generation
Don’t use it for:
- Multi-step tool execution
- High-frequency agent loops
- Real-time chat
3. The Quality-Speed Tradeoff
| Speed Tier | Models | Quality Range | Use Case |
|---|---|---|---|
| Ultra-fast (300+ TPS) | Gemini Flash, GPT-4o | 7.5-8.6 | High-volume agents |
| Fast (200-300 TPS) | DeepSeek, Haiku | 7.8-8.2 | Balanced agents |
| Medium (100-200 TPS) | Sonnet, Gemini Pro | 8.5-9.0 | Quality-focused agents |
| Slow (<100 TPS) | Opus, GPT-5 | 9.0-9.5 | Planning only |
4. Streaming is Non-Negotiable
Non-streaming responses feel 2-3x slower because users wait for complete output. All top models support streaming:
// Streaming reduces perceived latency by 50%+
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
Architecture Recommendations
Simple Agent (3-5 calls)
Router → GPT-4o → Tools → GPT-4o → Response
Expected latency: 2-4 seconds
Complex Agent (10+ calls)
Planner (Claude Opus) → Executor (GPT-4o x5) → Reviewer (Sonnet) → Response
Expected latency: 8-12 seconds
Real-Time Agent (sub-second target)
Everything → Gemini 2.5 Flash
Expected latency: 1-2 seconds (with streaming)
Cost vs Speed Analysis
| Model | Speed Score | Cost/1M | Value Score |
|---|---|---|---|
| Gemini 2.5 Flash | 9.9 | $0.30 | 9.8 |
| GPT-4o-mini | 9.2 | $0.30 | 9.5 |
| DeepSeek V3 | 8.7 | $1.37 | 9.2 |
| GPT-4o | 9.6 | $15 | 8.4 |
| Claude Haiku 4 | 9.3 | $0.75 | 9.4 |
Cold Start Analysis
First requests can be 2-5x slower due to model loading:
| Model | Cold Start | Warm P50 | Penalty |
|---|---|---|---|
| GPT-4o | 0.45s | 0.15s | 3x |
| Claude Sonnet | 0.82s | 0.35s | 2.3x |
| DeepSeek V3 | 0.58s | 0.22s | 2.6x |
Mitigation: Keep warm connections with periodic pings if latency is critical.
Our Recommendations
| Scenario | Primary | Fallback | Why |
|---|---|---|---|
| Real-time agent | Gemini 2.5 Flash | GPT-4o | Speed + streaming |
| Balanced agent | GPT-4o | Claude Sonnet | Speed + quality |
| Complex agent | Claude Sonnet | GPT-5.2 | Quality for planning |
| Budget agent | DeepSeek V3 | Haiku | Cost efficiency |
| High-volume | GPT-4o-mini | Gemini Flash | Scale economics |
Test Your Own Setup
Latency varies by:
- Region (test from your production region)
- Time of day (peak hours are slower)
- Prompt length (longer prompts = slower TTFT)
- Output length (streaming helps perceived speed)
We’re releasing our benchmark script: agent-speed-bench.py
Related: See our hallucination test results for accuracy benchmarks.