Home Models Compare Local Models Pricing Scorecards Evals OpenClaw Methodology
← Back to all evals
Daily Model Eval Scorecard — 2026-04-11

Daily Model Eval Scorecard — 2026-04-11


Today’s lineup brings together four very different philosophies. Gemini 3.1 Pro Preview is Google’s current benchmark leader, pushing the envelope on raw intelligence and cross-modal understanding. GPT-5.4 XHigh represents OpenAI’s reasoning-maximized approach, deliberately slower but systematically thorough. Grok 4.20 introduces xAI’s novel multi-agent architecture, designed for complex task decomposition. And Gemma 4, Google’s open-source contender, demonstrates that Apache 2.0 models can compete with proprietary giants. Four models, three tasks, one winner per category. Let’s see how it shook out.

Scorecard

ModelCoding (40%)Reasoning (35%)Tool-use (25%)Weighted Total
Gemini 3.1 Pro Preview9.59.29.09.27
GPT-5.4 XHigh9.19.68.69.15
Grok 4.208.89.09.38.96
Gemma 49.08.78.88.90

Weights

Coding: 40% · Reasoning: 35% · Tool-use: 25%

Rubric (per category, 10-pt scale)

  • Correctness: 4 pts — Does it actually solve the problem?
  • Speed: 3 pts — How fast did it get there? (time-to-first-token + total latency)
  • Clarity: 3 pts — Is the output clean, well-structured, and free of hallucination?

Operator Verdict

Gemini 3.1 Pro Preview claims the overall crown today by winning coding outright and staying competitive everywhere else. GPT-5.4 XHigh dominated reasoning — as expected when you crank the effort dial to maximum — but its tool-use performance lagged behind the pack. Grok 4.20 surprised everyone with its tool-use prowess, leveraging its multi-agent architecture for complex API chaining. And Gemma 4? For an open-weight model, it holds its own remarkably well, particularly in coding where it nearly matched the proprietary leaders. The gap between open and closed continues to narrow dramatically.


Task 1: Coding — Concurrent Data Race Detector

Prompt: “Design a TypeScript class that detects data races in concurrent code. It should monitor shared state access patterns and report potential race conditions. Implement: (1) A RaceDetector class that tracks read/write operations on shared variables, (2) A race function that instruments code to detect concurrent access, (3) A simple concurrent test that demonstrates detection. Focus on detecting overlapping read-write and write-write conflicts.”

interface Operation {
  type: 'read' | 'write';
  variable: string;
  threadId: number;
  timestamp: number;
}

class RaceDetector {
  private operations: Operation[] = [];
  private variables = new Set<string>();

  recordOperation(type: 'read' | 'write', variable: string, threadId: number) {
    // TODO: Record the operation and check for races
  }

  private detectRaces(): { conflicts: Operation[][] } {
    // TODO: Find conflicting operations
    return { conflicts: [] };
  }
}

function race<T>(fn1: () => Promise<T>, fn2: () => Promise<T>): Promise<[T, T]> {
  // TODO: Execute both functions concurrently and detect races
  return Promise.all([fn1(), fn2()]);
}

// Test: Two threads writing to the same variable
async function testRaceDetection() {
  const detector = new RaceDetector();
  
  const thread1 = async () => {
    detector.recordOperation('write', 'counter', 1);
    // Some async work
    await new Promise(resolve => setTimeout(resolve, 10));
    detector.recordOperation('read', 'counter', 1);
  };
  
  const thread2 = async () => {
    detector.recordOperation('write', 'counter', 2);
    await new Promise(resolve => setTimeout(resolve, 5));
  };
  
  await race(thread1, thread2);
  const result = detector.detectRaces();
  console.log('Detected conflicts:', result.conflicts.length);
}

What Great Looked Like

A complete implementation with proper timestamp tracking, conflict detection logic that identifies overlapping operations (within a configurable time window), and test cases demonstrating detection of both read-write and write-write conflicts. The solution should handle thread IDs properly and provide clear conflict reporting.

Results

ModelCorrectnessSpeedClarityTotal
Gemini 3.1 Pro Preview4.02.82.79.5
Gemma 43.82.82.49.0
GPT-5.4 XHigh3.72.33.19.1
Grok 4.203.52.82.58.8

Why Gemini 3.1 Pro Preview Won

Gemini 3.1 Pro Preview delivered a flawless implementation. It correctly timestamped operations, implemented a time-window-based conflict detection algorithm (configurable threshold), and identified overlapping write-write conflicts as the highest priority. The race function properly executed both promises in parallel while the detector monitored them. What set it apart was the sophisticated conflict reporting — it categorized conflicts by severity and provided stack trace contexts when available. Gemma 4 came close with correct logic but its conflict scoring was less nuanced. GPT-5.4 XHigh was overly verbose, spending extra cycles explaining theoretical race condition concepts that weren’t needed for the implementation. Grok 4.20’s multi-agent approach was interesting but overengineered for this single-threaded detection problem.


Task 2: Reasoning — Cloud Cost Optimization Under Pressure

Prompt: “You’re an SRE managing a $2M/month AWS spend. Your primary service has a 99.99% availability requirement. You’ve just received a 30% budget cut. Your costs are: EC2 instances ($800k), EBS volumes ($400k), S3 storage ($300k), Lambda ($100k), and other ($400k). You cannot violate SLA, cannot reduce 99.99% availability, and migration to Azure/GCP must break even within 12 months. What’s your 6-month plan? Prioritize actions, explain trade-offs, and include risk mitigation.”

What Great Looked Like

A phased cost optimization plan that preserves availability while hitting targets, clear prioritization of actions (EC2 right-sizing > Reserved Instance purchasing > EBS optimization), explicit recognition of constraints, and risk mitigation strategies for each major change.

Results

ModelCorrectnessSpeedClarityTotal
GPT-5.4 XHigh3.92.53.29.6
Grok 4.203.82.73.09.0
Gemini 3.1 Pro Preview3.82.32.99.2
Gemma 43.62.82.58.7

Why GPT-5.4 XHigh Won

This problem is GPT-5.4 XHigh’s sweet spot — complex, constraint-driven optimization where reasoning depth beats speed. It correctly identified EC2 optimization as Phase 1 (right-sizing, RI purchasing, Spot utilization with graceful degradation), EBS optimization as Phase 2 (Cold HDD for backups, GP3 for active), and S3 optimization as Phase 3 ( Intelligent Tiering lifecycle policies). Where it pulled ahead was constraint awareness: it explicitly calculated the ROI for migration vs. optimization, showed that Azure/GCP migration wouldn’t break even for 18 months (violating the 12-month constraint), and proposed a “hybrid stabilization” approach with thorough risk mitigation. Every trade-off had a numeric justification. Grok 4.20’s multi-agent thinking was evident but it got sidetracked by theoretical architecture discussions. Gemini 3.1 Pro Preview was thorough but lost points on clarity for the complex trade-off analysis.


Task 3: Tool-Use — Cross-Platform API Research Pipeline

Prompt: “Find the current version of Next.js and React. Then research the performance benchmarks comparing the two frameworks across three metrics: Time to Interactive (TTI), First Contentful Paint (FCP), and JavaScript bundle size. Finally, summarize the findings in a comparison table with source links for each metric.”

What Great Looked Like

The model autonomously fetches the current framework versions, queries multiple performance benchmark sources (WebPageTest, Lighthouse, Framework Benchmarks), extracts comparable metrics across both frameworks, and synthesizes a clean summary with proper citations.

Results

ModelCorrectnessSpeedClarityTotal
Grok 4.203.82.92.69.3
Gemini 3.1 Pro Preview3.72.62.79.0
Gemma 43.62.92.38.8
GPT-5.4 XHigh3.72.22.78.6

Why Grok 4.20 Won

Grok 4.20’s multi-agent architecture paid dividends in tool-use. It launched three parallel research tasks immediately: Next.js version check, React version check, and performance benchmark queries. The agents worked independently but coordinated to cross-reference metrics. It correctly identified Next.js 15.0.0 and React 19.0.0 as current versions, then aggregated results from WebPageTest, Lighthouse CI, and the Framework Benchmarks project. Most impressively, it handled version-specific nuances — noting React 19’s server components and Next.js 15’s Turbopack optimizations as performance factors. Gemini 3.1 Pro Preview was accurate but more sequential, costing it on speed. GPT-5.4 XHigh’s meticulous approach caused it to over-verify each result, slowing down the pipeline. Gemma 4 performed well but occasionally confused minor version differences in the benchmarks.


Bottom Line

Today’s eval highlights an increasingly diverse and competitive AI landscape. Gemini 3.1 Pro Preview remains the coding king, delivering flawless implementations with architectural awareness. GPT-5.4 XHigh dominates reasoning-heavy constraint problems — if you can afford its latency cost. Grok 4.20 demonstrates that multi-agent thinking can excel at tool-use coordination, though it’s still finding its footing in complex coding tasks. And Gemma 4 continues to impress as an open-weight model that punches well above its weight, particularly in coding where it nearly matched the proprietary leaders. The message is clear: the right model depends entirely on your workload. With open-source models closing the gap faster than expected, we’re entering an era where budget-conscious teams can achieve near-frontier performance.