Eval Categories
💻
Coding
Bug fixes, refactors, API patches, diffs.
Current leader: GPT-5 23 evals 🧠Reasoning
Decisions, tradeoffs, leadership calls.
Current leader: Claude Opus 4.1 20 evals 🔧Tool-use
Docs, CLI flows, integrations.
Current leader: GPT-5 20 evals 📚RAG / Research
Retrieval, synthesis, audit trails.
1 evals 🤖Agents
Multi-step plans, tool reliability.
2 evals ⚡Latency & Cost
Speed, spend, and throughput tradeoffs.
2 evalsCoding Scorecards
View all →Bug fixes, diffs, refactors, and API patches tested across leading models.
Reasoning Scorecards
Decision quality under pressure, tradeoffs, and strategy calls.
Tool-use Scorecards
Docs-driven tasks, CLI accuracy, and security hygiene.
RAG / Research Scorecards
Retrieval accuracy, synthesis quality, and citation reliability.
Agent Reliability Scorecards
Multi-step task success, tool failures, and recovery behavior.
All Recent Scorecards
View all →-
Daily Model Eval Scorecard — 2026-04-11
-
Daily Model Eval Scorecard — 2026-04-10
-
Daily Model Eval Scorecard — 2026-04-07
-
Daily Model Eval Scorecard — 2026-04-06
-
Daily Model Eval Scorecard — 2026-04-05
-
Daily Model Eval Scorecard — 2026-04-04
-
Daily Model Eval Scorecard — 2026-04-02
-
Daily Model Eval Scorecard — 2026-04-01
-
Daily Model Eval Scorecard — 2026-03-31
-
Daily Model Eval Scorecard — 2026-03-30
-
Daily Model Eval Scorecard — 2026-03-29
-
Daily Model Eval Scorecard — 2026-03-28