Recent Scorecards
Latest model evaluations across coding, reasoning, and tool-use tasks.
Why Engineers Trust This
Built for teams choosing AI models for production workloads.
Real engineering tasks
Bug fixes, build-vs-buy decisions, docs-driven workflows. No synthetic toy problems.
Rubrics you can audit
Every eval includes the exact prompt, scoring criteria, and failure cases. Fully reproducible.
Actionable recommendations
Pick a model for the job with clear rationale. We track cost, latency, and reliability -- not just accuracy.
How It Works
Transparent, reproducible benchmarks you can verify.
Real tasks
We run actual engineering prompts — bug fixes, tradeoffs, integrations.
Blind scoring
Two engineers score each output against a 10-point rubric.
Full transparency
Every prompt, rubric, and failure case is published.
Daily updates
New scorecards every morning at 6am ET.
What We Track
| Category | Fields |
|---|---|
| Performance | Task score, error rate, rubric compliance |
| Cost | Tokens used, estimated spend, cost per task |
| Latency | P50/P95 response time, time-to-first-token |
| Reliability | Failure cases, guardrail misses, audit notes |
Latest Articles
Deep dives on model performance, comparisons, and engineering benchmarks.
Daily Model Eval Scorecard — 2026-04-07
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Gemini 3.1 Pro Preview, GPT-5.4 XHigh, Gemma 4, and Qwen 3.5.
Daily Model Eval Scorecard — 2026-04-01
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Gemini 3.1 Pro Preview, GPT-5.4 XHigh, Grok 4.20 Beta, and Llama 4 Scout.
Daily Model Eval Scorecard — 2026-03-30
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Gemini 3.1 Pro Preview, GPT-5.4 XHigh, Mercury 2, and Llama 4 Scout.
Provider Pages
Quick picks by provider if you already know whose API you want to use.
Start with today's winner
Go straight to the daily scorecard, then drill into task-level breakdowns for coding, reasoning, and tool-use.
View scorecards →