Top 25 AI Models — Complete Leaderboard

Compare the top 25 AI language models ranked by coding, reasoning, and tool-use benchmarks. All pricing and specifications verified against official provider documentation.

Verified 2026-03-10 15 models tracked 11 providers

Models Tracked

AI Providers

Budget Options

Large Context

Need the cheapest strong model?

Start with MiniMax M2.5 or Grok 4.1 Fast if cost matters more than bragging rights.

See quick picks →

Need self-hosted privacy?

2 models in this leaderboard support open-weights / local-first workflows.

Explore local models →

Want the full tradeoff?

Compare open weights vs proprietary models before you commit to an API provider.

Read the guide →

Claude Opus 4.6

Anthropic • 200K context

9.7

Coding

9.8

Reasoning

9.5

Tool Use

Complex reasoningCritical decisionsLong-form analysis

Input $/M tokens

$25

Output $/M tokens

9.69

Weighted Score

Anthropic Models Anthropic Pricing

GPT-5.4

OpenAI • 1.05M context

9.8

Coding

9.5

Reasoning

9.7

Tool Use

CodingAgentsTool integration

$2.5

Input $/M tokens

$15

Output $/M tokens

9.67

Weighted Score

OpenAI Models OpenAI Pricing

Gemini 3.1 Pro

Google • 1M context

9.5

Coding

9.5

Reasoning

9.3

Tool Use

Multimodal tasksLong contextSearch integration

$1.25

Input $/M tokens

Output $/M tokens

9.45

Weighted Score

Gemini Model Card Gemini Pricing

Claude Sonnet 4.6

Anthropic • 200K context

9.4

Coding

9.3

Reasoning

9.1

Tool Use

Balanced performanceProduction workloadsCost-efficient

Input $/M tokens

$15

Output $/M tokens

9.29

Weighted Score

Anthropic Models Anthropic Pricing

GPT-5.3 Codex

OpenAI • 200K context

9.7

Coding

9.3

Reasoning

9.4

Tool Use

Coding-focused tasksType inferenceAgentic coding

Input $/M tokens

$15

Output $/M tokens

9.48

Weighted Score

OpenAI Models OpenAI Pricing

GLM-5

Zhipu AI • 205K context

9.2

Coding

9.3

Reasoning

Tool Use

Bilingual (CN/EN)Value-focusedEnterprise

$0.5

Input $/M tokens

Output $/M tokens

9.18

Weighted Score

GLM-5 Docs Zhipu Pricing

Llama 4 (405B)

Meta • 128K context

Coding

9.1

Reasoning

8.7

Tool Use

Self-hostedOpen sourceCustomizable

Input $/M tokens

Output $/M tokens

8.96

Weighted Score

Meta Llama HuggingFace

DeepSeek V3

DeepSeek • 128K context

8.8

Coding

8.9

Reasoning

8.5

Tool Use

Budget codingHigh-volumeCost-sensitive

$0.27

Input $/M tokens

$1.1

Output $/M tokens

8.76

Weighted Score

DeepSeek Models DeepSeek Pricing

GPT-5.2

OpenAI • 128K context

9.3

Coding

9.2

Reasoning

Tool Use

General-purposeBalanced tasks

$1.75

Input $/M tokens

$14

Output $/M tokens

9.19

Weighted Score

OpenAI Models OpenAI Pricing

Mistral Large 3

Mistral • 128K context

8.9

Coding

Reasoning

8.6

Tool Use

European complianceMultilingualEnterprise

Input $/M tokens

Output $/M tokens

8.86

Weighted Score

Mistral Models Mistral Pricing

Kimi K2.5

Moonshot AI • 256K context

9.4

Coding

9.3

Reasoning

9.2

Tool Use

Visual codingLong contextAgent workflows

$0.6

Input $/M tokens

$2.5

Output $/M tokens

9.32

Weighted Score

OpenRouter Moonshot Platform

MiniMax M2.5

MiniMax • 196K context

9.1

Coding

9.2

Reasoning

8.9

Tool Use

Real-world productivityCost-sensitiveHigh-volume

$0.3

Input $/M tokens

$1.2

Output $/M tokens

9.08

Weighted Score

OpenRouter Together AI

Grok 4.1 Fast

xAI • 2M context

Coding

9.1

Reasoning

8.8

Tool Use

Long contextWeb searchX platform data

$0.2

Input $/M tokens

$0.5

Output $/M tokens

8.98

Weighted Score

xAI Docs OpenRouter

Qwen 3 Max

Alibaba • 262K context

9.2

Coding

9.1

Reasoning

8.9

Tool Use

MultilingualEnterpriseChinese language

$1.2

Input $/M tokens

Output $/M tokens

9.09

Weighted Score

Alibaba Cloud Price Per Token

GPT-OSS-120B

OpenAI • 128K context

9.3

Coding

9.2

Reasoning

Tool Use

Self-hostedPrivacyCustomization

Input $/M tokens

Output $/M tokens

9.19

Weighted Score

OpenAI Blog HuggingFace