Home Models Compare Scorecards Evals Methodology FAQ
← Back to all evals
Daily Model Eval Scorecard — 2026-02-13

Daily Model Eval Scorecard — 2026-02-13


This is the daily scorecard for three practical tasks: a React state bug, a capacity‑planning decision for Q3, and a GitHub Actions workflow setup. We test 3 frontier models on operator‑grade workloads.

Scorecard (10‑point scale)

ModelCorrectnessSpeedClarityWeighted Total
GPT‑59.28.99.49.18
Gemini 2.5 Pro9.09.38.89.04
DeepSeek R19.48.28.58.74

Weights: correctness 45%, speed 25%, clarity 30%. We prioritize code correctness and decision quality because those errors are the most expensive in production.

Today’s winner

GPT‑5 edges ahead with the best balance of correctness and clarity. DeepSeek R1 leads on correctness (9.4), but its slower inference and slightly lower clarity hurt its weighted score. Gemini 2.5 Pro is the speed demon — fastest response times by a clear margin.

Tasks + prompts

1) Coding: React state bug

Goal: Fix stale closure issue in a useEffect hook.

Prompt

You have a React component with a button that increments a counter. Users report the count resets to 0 when the component re‑renders. Identify the bug and fix it.

function Counter() {
  const [count, setCount] = useState(0)

  useEffect(() => {
    const timer = setInterval(() => {
      console.log('Count is:', count)
    }, 1000)
    return () => clearInterval(timer)
  }, []) // Missing dependency?

  return <button onClick={() => setCount(count + 1)}>{count}</button>
}

Rubric

  • Identifies stale closure: count captured at initial render
  • Correctly adds count to dependency array OR uses functional update setCount(c => c + 1)
  • Explains why the fix works

2) Reasoning: capacity planning decision

Goal: Recommend infrastructure scaling for Q3 traffic growth.

Prompt

Your SaaS platform currently handles 50k daily active users on 3 m5.xlarge EC2 instances. Projected growth is 3x by Q3. You have a $15k/month infrastructure budget. Evaluate: vertical scaling vs horizontal scaling vs managed services (ECS/Fargate). Provide a decision matrix and final recommendation.

Rubric

  • Covers cost, scalability, operational overhead, and risk
  • Provides a decision matrix or tradeoff table
  • Makes a clear recommendation with justification

3) Tool‑use: CI/CD workflow setup

Goal: Set up a GitHub Actions workflow for a Node.js API.

Prompt

Create a GitHub Actions workflow that: (1) runs tests on Node 18 and 20, (2) caches npm dependencies, (3) deploys to AWS Elastic Beanstalk on main branch push. Provide the minimal YAML configuration.

Rubric

  • Uses actions/setup-node with matrix strategy
  • Includes dependency caching
  • Has correct EB deploy action or custom script

Operator takeaways

  • GPT‑5 is the all‑rounder. Best when you need balanced performance across correctness, speed, and clarity. Ideal for mixed workloads.
  • Gemini 2.5 Pro shines on speed. Best for high‑throughput scenarios where response latency matters more than marginal correctness gains.
  • DeepSeek R1 leads on correctness. Choose this when accuracy is non‑negotiable and you can tolerate slightly slower responses.

Why we run task‑level evals

Public benchmarks are useful for outside calibration, but they don’t capture real‑world production decisions:

We run our own evals because leadership decisions and production bugs aren’t captured in leaderboards.

What’s next

Tomorrow’s eval focuses on:

  • Database migration (PostgreSQL → CockroachDB)
  • Multi‑step research (agentic synthesis)

Related: individual deep‑dives on coding, reasoning, and tool‑use.