Home Models Compare Local Models Pricing Scorecards Evals OpenClaw Methodology

Top 25 AI Models — Complete Leaderboard

Compare the top 25 AI language models ranked by coding, reasoning, and tool-use benchmarks. All pricing and specifications verified against official provider documentation.

Verified 2026-03-10 15 models tracked 11 providers
15
Models Tracked
11
AI Providers
4
Budget Options
3
Large Context

Need the cheapest strong model?

Start with MiniMax M2.5 or Grok 4.1 Fast if cost matters more than bragging rights.

See quick picks →

Need self-hosted privacy?

2 models in this leaderboard support open-weights / local-first workflows.

Explore local models →

Want the full tradeoff?

Compare open weights vs proprietary models before you commit to an API provider.

Read the guide →
1
Claude Opus 4.6
Anthropic • 200K context
9.7
Coding
9.8
Reasoning
9.5
Tool Use
Complex reasoningCritical decisionsLong-form analysis
$5
Input $/M tokens
$25
Output $/M tokens
9.69
Weighted Score
2
GPT-5.4
OpenAI • 1.05M context
9.8
Coding
9.5
Reasoning
9.7
Tool Use
CodingAgentsTool integration
$2.5
Input $/M tokens
$15
Output $/M tokens
9.67
Weighted Score
3
Gemini 3.1 Pro
Google • 1M context
9.5
Coding
9.5
Reasoning
9.3
Tool Use
Multimodal tasksLong contextSearch integration
$1.25
Input $/M tokens
$5
Output $/M tokens
9.45
Weighted Score
4
Claude Sonnet 4.6
Anthropic • 200K context
9.4
Coding
9.3
Reasoning
9.1
Tool Use
Balanced performanceProduction workloadsCost-efficient
$3
Input $/M tokens
$15
Output $/M tokens
9.29
Weighted Score
5
GPT-5.3 Codex
OpenAI • 200K context
9.7
Coding
9.3
Reasoning
9.4
Tool Use
Coding-focused tasksType inferenceAgentic coding
$3
Input $/M tokens
$15
Output $/M tokens
9.48
Weighted Score
6
GLM-5
Zhipu AI • 205K context
9.2
Coding
9.3
Reasoning
9
Tool Use
Bilingual (CN/EN)Value-focusedEnterprise
$0.5
Input $/M tokens
$2
Output $/M tokens
9.18
Weighted Score
7
Llama 4 (405B)
Meta • 128K context
9
Coding
9.1
Reasoning
8.7
Tool Use
Self-hostedOpen sourceCustomizable
$2
Input $/M tokens
$8
Output $/M tokens
8.96
Weighted Score
8
DeepSeek V3
DeepSeek • 128K context
8.8
Coding
8.9
Reasoning
8.5
Tool Use
Budget codingHigh-volumeCost-sensitive
$0.27
Input $/M tokens
$1.1
Output $/M tokens
8.76
Weighted Score
9
GPT-5.2
OpenAI • 128K context
9.3
Coding
9.2
Reasoning
9
Tool Use
General-purposeBalanced tasks
$1.75
Input $/M tokens
$14
Output $/M tokens
9.19
Weighted Score
10
Mistral Large 3
Mistral • 128K context
8.9
Coding
9
Reasoning
8.6
Tool Use
European complianceMultilingualEnterprise
$2
Input $/M tokens
$6
Output $/M tokens
8.86
Weighted Score
11
Kimi K2.5
Moonshot AI • 256K context
9.4
Coding
9.3
Reasoning
9.2
Tool Use
Visual codingLong contextAgent workflows
$0.6
Input $/M tokens
$2.5
Output $/M tokens
9.32
Weighted Score
12
MiniMax M2.5
MiniMax • 196K context
9.1
Coding
9.2
Reasoning
8.9
Tool Use
Real-world productivityCost-sensitiveHigh-volume
$0.3
Input $/M tokens
$1.2
Output $/M tokens
9.08
Weighted Score
13
Grok 4.1 Fast
xAI • 2M context
9
Coding
9.1
Reasoning
8.8
Tool Use
Long contextWeb searchX platform data
$0.2
Input $/M tokens
$0.5
Output $/M tokens
8.98
Weighted Score
14
Qwen 3 Max
Alibaba • 262K context
9.2
Coding
9.1
Reasoning
8.9
Tool Use
MultilingualEnterpriseChinese language
$1.2
Input $/M tokens
$6
Output $/M tokens
9.09
Weighted Score
15
GPT-OSS-120B
OpenAI • 128K context
9.3
Coding
9.2
Reasoning
9
Tool Use
Self-hostedPrivacyCustomization
$0
Input $/M tokens
$0
Output $/M tokens
9.19
Weighted Score