Eval Categories
💻
Coding
Bug fixes, refactors, API patches, diffs.
Current leader: GPT-5 8 evals 🧠Reasoning
Decisions, tradeoffs, leadership calls.
Current leader: Claude Opus 4.1 6 evals 🔧Tool-use
Docs, CLI flows, integrations.
Current leader: GPT-5 6 evals 📚RAG / Research
Retrieval, synthesis, audit trails.
1 evals 🤖Agents
Multi-step plans, tool reliability.
1 evals ⚡Latency & Cost
Speed, spend, and throughput tradeoffs.
1 evalsCoding Scorecards
View all →Bug fixes, diffs, refactors, and API patches tested across leading models.
Reasoning Scorecards
Decision quality under pressure, tradeoffs, and strategy calls.
Tool-use Scorecards
Docs-driven tasks, CLI accuracy, and security hygiene.
RAG / Research Scorecards
Retrieval accuracy, synthesis quality, and citation reliability.
Agent Reliability Scorecards
Multi-step task success, tool failures, and recovery behavior.
All Recent Scorecards
View all →-
Daily Model Eval Scorecard — 2026-04-07
-
Daily Model Eval Scorecard — 2026-04-01
-
Daily Model Eval Scorecard — 2026-03-30
-
Daily Model Eval Scorecard — 2026-03-23
-
Daily Model Eval Scorecard — 2026-03-12
-
What Is Chatbot Arena ELO? The Crowdsourced AI Ranking Explained
-
What Is MMLU-Pro? The Advanced AI Benchmark Explained
-
What Is SWE-Bench? The AI Coding Benchmark Explained
-
AI API Costs 2026: Complete Pricing Comparison
-
AI Model API vs Self-Hosted: 2026 Cost Comparison
-
AI Hallucinations: Why Models Make Things Up and How to Prevent Them
-
AI Benchmark Results 2026: Model Performance Rankings