Daily Model Eval Scorecard — 2026-02-16
This is the daily scorecard for three practical tasks: a React state race condition, a capacity planning decision for a startup, and a multi-step GitHub Actions workflow. We test 5 frontier models on operator-grade workloads.
Scorecard (10-point scale)
| Model | Coding | Reasoning | Tool-use | Weighted Total |
|---|---|---|---|---|
| Claude Opus 4.6 (Adaptive) | 9.6 | 9.7 | 9.4 | 9.59 |
| GPT-5.2 (xhigh) | 9.6 | 9.5 | 9.5 | 9.54 |
| Claude Opus 4.5 | 9.4 | 9.5 | 9.2 | 9.39 |
| GLM-5 | 9.0 | 9.1 | 8.8 | 9.00 |
| Gemini 3 Pro | 9.3 | 9.3 | 9.1 | 9.27 |
Weights: coding 40%, reasoning 35%, tool-use 25%. We bias toward code correctness and decision quality because those errors are the most expensive in production.
Today’s winner
Claude Opus 4.6 (Adaptive) takes the top spot with exceptional reasoning capabilities. GPT-5.2 (xhigh) remains a close second with the best tool-use performance. Claude Opus 4.5 continues to impress with consistent quality across all tasks.
Tasks + prompts
1) Coding: React useEffect race condition
Goal: Fix a stale closure bug in a React component.
Prompt
You have a React component that fetches user data when a userId prop changes. Users report seeing the wrong user's data briefly before it corrects. Identify the bug and provide a fixed component.
function UserProfile({ userId }) {
const [user, setUser] = useState(null)
useEffect(() => {
fetch(`/api/users/${userId}`).then(res => res.json()).then(setUser)
}, [userId])
return <div>{user?.name}</div>
}
Rubric
- Identifies race condition / stale closure
- Uses AbortController or cleanup function
- Handles loading/error states
2) Reasoning: capacity planning decision
Goal: Recommend infrastructure scaling strategy.
Prompt
Your SaaS startup's traffic just tripled in 3 months (10K → 30K MAU). Current AWS bill is $4K/mo. You're considering: (1) vertical scaling, (2) horizontal with auto-scaling, (3) migrate to serverless. Recommend with a 12-month TCO analysis.
Rubric
- Provides concrete numbers for each option
- Considers migration cost and learning curve
- Recommends based on growth trajectory, not just current needs
3) Tool use: GitHub Actions CI pipeline
Goal: Configure a multi-step CI pipeline with caching.
Prompt
Set up a GitHub Actions workflow that: (1) runs tests on Node 18, 20, 22, (2) caches npm dependencies, (3) uploads test coverage to Codecov, (4) runs only on changes to src/. Provide the workflow YAML.
Rubric
- Uses matrix strategy for Node versions
- Correctly configures cache with restore/save
- Uses paths filter to avoid unnecessary runs
Operator takeaways
- Claude Opus 4.6 (Adaptive) is the new intelligence leader. Exceptional for complex reasoning and critical decisions.
- GPT-5.2 (xhigh) excels at coding and agentic tasks with excellent tool integration.
- Claude Opus 4.5 remains a top choice for reasoning-heavy workflows and consistent quality.
- GLM-5 delivers top-tier intelligence at a fraction of the cost with strong bilingual support.
- Gemini 3 Pro leads multimodal understanding with superior search integration.
Fact-check notes (updated February 16, 2026)
The model names on this page were validated against official provider documentation:
- Anthropic: Model list, Pricing
- OpenAI: Models, Pricing
- Google: Gemini models, Gemini pricing
- Zhipu: GLM-5, Pricing
Why we anchor against public benchmarks
Public benchmarks aren’t perfect, but they provide outside calibration:
- SWE-bench for real bug-fixing tasks. SWE-bench Leaderboards
- HumanEval for code-generation correctness. OpenAI HumanEval
- Chatbot Arena for crowd-sourced Elo ratings. Chatbot Arena
These provide context, but we still run task-level evals because leadership decisions are rarely captured in a leaderboard.
What’s next
Tomorrow’s eval focuses on:
- Database migration (PostgreSQL → Neon)
- Security hardening (OAuth implementation audit)
Related: individual deep-dives on coding, reasoning, and tool-use.