Feb 12, 2026

Speed Test: Which AI Model Is Fastest for Agents in 2026?

When building AI agents, latency is everything. A model that takes 5 seconds to respond kills the agent experience. We tested 30 models on the metrics that matter for agents: time-to-first-token, streaming throughput, and multi-turn consistency.

Why Speed Matters for Agents

Agents make multiple LLM calls per task:

Planning call (1-3 seconds)
Tool execution (varies)
Reflection call (1-3 seconds)
Final response (1-5 seconds)

A “slow” model (3+ seconds per call) turns a 5-step agent into a 15+ second experience. Users bail.

Test Methodology

We measured:

Time to First Token (TTFT): How fast does output start?
Tokens per Second (TPS): Sustained generation speed
P50/P95 Latency: Consistency under load
Cold Start: First request latency
Concurrent Load: 10, 50, 100 parallel requests

Each test ran 100 times. Models tested via official APIs from US-East region.

Results: The Speed Leaderboard

Time to First Token (Lower is Better)

Model	P50 TTFT	P95 TTFT	Score
Gemini 2.5 Flash	0.12s	0.28s	9.8
GPT-4o	0.15s	0.32s	9.6
Claude Haiku 4	0.18s	0.35s	9.4
DeepSeek V3	0.22s	0.41s	9.0
GPT-4o-mini	0.14s	0.38s	9.2
Claude Sonnet 4.5	0.35s	0.62s	8.4
Gemini 3 Pro	0.42s	0.78s	8.0
GPT-5.2	0.48s	0.92s	7.6
Claude Opus 4.6	0.65s	1.24s	6.8

Tokens Per Second (Higher is Better)

Model	TPS	Score
Gemini 2.5 Flash	359	9.9
GPT-4o	312	9.6
Claude Haiku 4	285	9.3
GPT-4o-mini	278	9.1
DeepSeek V3	245	8.7
Claude Sonnet 4.5	165	7.8
Gemini 3 Pro	148	7.4
GPT-5.2	132	7.0
Claude Opus 4.6	95	6.2

Best for Agentic Workflows

Model	TTFT	TPS	Quality	Agent Score
GPT-4o	0.15s	312	8.6	9.2
Gemini 2.5 Flash	0.12s	359	8.0	9.0
Claude Sonnet 4.5	0.35s	165	9.0	8.6
DeepSeek V3	0.22s	245	8.2	8.4
Claude Haiku 4	0.18s	285	7.8	8.2

Key Insights

1. Flash/Mini Models Win for Agents

Gemini 2.5 Flash and GPT-4o deliver 2-3x the throughput of “thinking” models. For agents making 5+ calls, this is the difference between 8 seconds and 25 seconds total latency.

2. Claude Opus is Too Slow for Agents

At 95 TPS and 0.65s TTFT, Claude Opus 4.6 is optimized for quality, not speed. Use it for:

Single complex queries
Planning/strategy calls
Final response generation

Don’t use it for:

Multi-step tool execution
High-frequency agent loops
Real-time chat

3. The Quality-Speed Tradeoff

Speed Tier	Models	Quality Range	Use Case
Ultra-fast (300+ TPS)	Gemini Flash, GPT-4o	7.5-8.6	High-volume agents
Fast (200-300 TPS)	DeepSeek, Haiku	7.8-8.2	Balanced agents
Medium (100-200 TPS)	Sonnet, Gemini Pro	8.5-9.0	Quality-focused agents
Slow (<100 TPS)	Opus, GPT-5	9.0-9.5	Planning only

4. Streaming is Non-Negotiable

Non-streaming responses feel 2-3x slower because users wait for complete output. All top models support streaming:

// Streaming reduces perceived latency by 50%+
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Architecture Recommendations

Simple Agent (3-5 calls)

Router → GPT-4o → Tools → GPT-4o → Response

Expected latency: 2-4 seconds

Complex Agent (10+ calls)

Planner (Claude Opus) → Executor (GPT-4o x5) → Reviewer (Sonnet) → Response

Expected latency: 8-12 seconds

Real-Time Agent (sub-second target)

Everything → Gemini 2.5 Flash

Expected latency: 1-2 seconds (with streaming)

Cost vs Speed Analysis

Model	Speed Score	Cost/1M	Value Score
Gemini 2.5 Flash	9.9	$0.30	9.8
GPT-4o-mini	9.2	$0.30	9.5
DeepSeek V3	8.7	$1.37	9.2
GPT-4o	9.6	$15	8.4
Claude Haiku 4	9.3	$0.75	9.4

Cold Start Analysis

First requests can be 2-5x slower due to model loading:

Model	Cold Start	Warm P50	Penalty
GPT-4o	0.45s	0.15s	3x
Claude Sonnet	0.82s	0.35s	2.3x
DeepSeek V3	0.58s	0.22s	2.6x

Mitigation: Keep warm connections with periodic pings if latency is critical.

Our Recommendations

Scenario	Primary	Fallback	Why
Real-time agent	Gemini 2.5 Flash	GPT-4o	Speed + streaming
Balanced agent	GPT-4o	Claude Sonnet	Speed + quality
Complex agent	Claude Sonnet	GPT-5.2	Quality for planning
Budget agent	DeepSeek V3	Haiku	Cost efficiency
High-volume	GPT-4o-mini	Gemini Flash	Scale economics

Test Your Own Setup

Latency varies by:

Region (test from your production region)
Time of day (peak hours are slower)
Prompt length (longer prompts = slower TTFT)
Output length (streaming helps perceived speed)

We’re releasing our benchmark script: agent-speed-bench.py

Related: See our hallucination test results for accuracy benchmarks.