Home Models Coding Agents Compare Pricing Model Picker Source Data Local Models OpenClaw
← Back to all evals
Daily Model Eval Scorecard — 2026-02-16

Daily Model Eval Scorecard — 2026-02-16


This is the daily scorecard for three practical tasks: a React state race condition, a capacity planning decision for a startup, and a multi-step GitHub Actions workflow. We test 5 frontier models on operator-grade workloads.

Scorecard (10-point scale)

ModelCodingReasoningTool-useWeighted Total
Claude Opus 4.6 (Adaptive)9.69.79.49.59
GPT-5.2 (xhigh)9.69.59.59.54
Claude Opus 4.59.49.59.29.39
GLM-59.09.18.89.00
Gemini 3 Pro9.39.39.19.27

Weights: coding 40%, reasoning 35%, tool-use 25%. We bias toward code correctness and decision quality because those errors are the most expensive in production.

Today’s winner

Claude Opus 4.6 (Adaptive) takes the top spot with exceptional reasoning capabilities. GPT-5.2 (xhigh) remains a close second with the best tool-use performance. Claude Opus 4.5 continues to impress with consistent quality across all tasks.

Tasks + prompts

1) Coding: React useEffect race condition

Goal: Fix a stale closure bug in a React component.

Prompt

You have a React component that fetches user data when a userId prop changes. Users report seeing the wrong user's data briefly before it corrects. Identify the bug and provide a fixed component.

function UserProfile({ userId }) {
  const [user, setUser] = useState(null)

  useEffect(() => {
    fetch(`/api/users/${userId}`).then(res => res.json()).then(setUser)
  }, [userId])

  return <div>{user?.name}</div>
}

Rubric

  • Identifies race condition / stale closure
  • Uses AbortController or cleanup function
  • Handles loading/error states

2) Reasoning: capacity planning decision

Goal: Recommend infrastructure scaling strategy.

Prompt

Your SaaS startup's traffic just tripled in 3 months (10K → 30K MAU). Current AWS bill is $4K/mo. You're considering: (1) vertical scaling, (2) horizontal with auto-scaling, (3) migrate to serverless. Recommend with a 12-month TCO analysis.

Rubric

  • Provides concrete numbers for each option
  • Considers migration cost and learning curve
  • Recommends based on growth trajectory, not just current needs

3) Tool use: GitHub Actions CI pipeline

Goal: Configure a multi-step CI pipeline with caching.

Prompt

Set up a GitHub Actions workflow that: (1) runs tests on Node 18, 20, 22, (2) caches npm dependencies, (3) uploads test coverage to Codecov, (4) runs only on changes to src/. Provide the workflow YAML.

Rubric

  • Uses matrix strategy for Node versions
  • Correctly configures cache with restore/save
  • Uses paths filter to avoid unnecessary runs

Operator takeaways

  • Claude Opus 4.6 (Adaptive) is the new intelligence leader. Exceptional for complex reasoning and critical decisions.
  • GPT-5.2 (xhigh) excels at coding and agentic tasks with excellent tool integration.
  • Claude Opus 4.5 remains a top choice for reasoning-heavy workflows and consistent quality.
  • GLM-5 delivers top-tier intelligence at a fraction of the cost with strong bilingual support.
  • Gemini 3 Pro leads multimodal understanding with superior search integration.

Fact-check notes (updated February 16, 2026)

The model names on this page were validated against official provider documentation:

Why we anchor against public benchmarks

Public benchmarks aren’t perfect, but they provide outside calibration:

These provide context, but we still run task-level evals because leadership decisions are rarely captured in a leaderboard.

What’s next

Tomorrow’s eval focuses on:

  • Database migration (PostgreSQL → Neon)
  • Security hardening (OAuth implementation audit)

Related: individual deep-dives on coding, reasoning, and tool-use.