Daily Model Eval Scorecard — 2026-04-06
Today’s lineup brings together four very different approaches to AI intelligence. GPT-5.4 is OpenAI’s flagship, betting on broad capability and ecosystem integration. Claude Opus 4.6 pushes Anthropic’s focus on deep understanding and precise instruction-following. Gemini 3.1 Pro represents Google’s aggressive pursuit of reasoning dominance. And DeepSeek V4, the latest from the open-source challenger, aims to prove that cost efficiency doesn’t mean capability compromise. Four models, three tasks, one winner per category. Let’s see how it shook out.
Scorecard
| Model | Coding (40%) | Reasoning (35%) | Tool-use (25%) | Weighted Total |
|---|---|---|---|---|
| GPT-5.4 | 9.1 | 9.4 | 8.8 | 9.05 |
| Gemini 3.1 Pro | 8.9 | 9.6 | 9.0 | 9.19 |
| Claude Opus 4.6 | 9.4 | 9.2 | 8.7 | 9.07 |
| DeepSeek V4 | 8.8 | 8.7 | 9.2 | 8.88 |
Weights
Coding: 40% · Reasoning: 35% · Tool-use: 25%
Rubric (per category, 10-pt scale)
- Correctness: 4 pts — Does it actually solve the problem?
- Speed: 3 pts — How fast did it get there? (time-to-first-token + total latency)
- Clarity: 3 pts — Is the output clean, well-structured, and free of hallucination?
Operator Verdict
Gemini 3.1 Pro takes the overall crown today by dominating reasoning outright and staying competitive in tool-use. GPT-5.4 remains the consistent all-rounder with solid coding and reasoning performance, though its tool-use lag is becoming noticeable. Claude Opus 4.6 reaffirms its coding dominance, pulling ahead in that category while maintaining strong reasoning results. DeepSeek V4 continues to impress with its cost-effective approach, particularly excelling in tool-use where it leverages efficient API patterns. The gap between proprietary and open-source models is narrowing, with DeepSeek V4 proving you don’t need enterprise pricing to compete at the frontier.
Task 1: Coding — React Server Components Bug Fix
Prompt: “Fix the data fetching race condition in this React Server Components setup. There’s a bug where getServerSession() and getUsers() both execute on the server simultaneously, causing a hydration mismatch. Provide the corrected code with an optimized pattern for sequential data fetching and explain the fix.”
// app/page.tsx
import { Suspense } from 'react';
import { getServerSession } from 'next-auth';
import { getUsers } from '@/lib/users';
async function ServerComponent() {
// Bug: parallel execution causes hydration mismatch
const session = await getServerSession();
const users = await getUsers();
return (
<div>
<h1>Welcome {session?.user?.name}</h1>
<Suspense fallback={<div>Loading users...</div>}>
<UserList users={users} />
</Suspense>
</div>
);
}
async function UserList({ users }) {
return (
<ul>
{users.map(user => (
<li key={user.id}>{user.name}</li>
))}
</ul>
);
}
export default function HomePage() {
return <ServerComponent />;
}
What Great Looked Like
A corrected implementation that properly sequences data fetching, eliminates the race condition, and explains the React Server Components optimization pattern.
Results
| Model | Correctness | Speed | Clarity | Total |
|---|---|---|---|---|
| Claude Opus 4.6 | 4.0 | 2.7 | 2.7 | 9.4 |
| GPT-5.4 | 3.9 | 2.6 | 2.6 | 9.1 |
| DeepSeek V4 | 3.8 | 2.7 | 2.3 | 8.8 |
| Gemini 3.1 Pro | 3.7 | 2.6 | 2.6 | 8.9 |
Why Claude Opus 4.6 Won
Opus 4.6 immediately identified the core issue: React Server Components require sequential data fetching to avoid hydration mismatches. It implemented the correct pattern by moving getUsers() inside the Suspense boundary and using Promise.all() for non-blocking operations where possible. The fix included proper error boundaries and clear comments explaining why React’s streaming architecture requires this sequencing. GPT-5.4 provided a solid solution but was more verbose, while DeepSeek V4’s approach technically worked but lacked the architectural context that makes it production-ready. Gemini 3.1 Pro focused too much on explaining Next.js specifics rather than the fundamental React Server Components pattern.
Task 2: Reasoning — Database Schema Migration Strategy
Prompt: “You’re migrating a legacy monolith to microservices. The current system has: (1) 100GB MySQL with 50+ tables and complex foreign keys, (2) 500GB Postgres document storage with custom indexing, (3) 1TB file storage with 10M+ small files. You need to migrate to separate services with zero downtime. Team has 6 months and 4 engineers. Which components do you migrate first, in what order, and what’s your rollback strategy? Justify technical constraints.”
What Great Looked Like
A clear migration sequence that prioritizes components by risk and complexity, with detailed rollback contingencies for each phase.
Results
| Model | Correctness | Speed | Clarity | Total |
|---|---|---|---|---|
| Gemini 3.1 Pro | 3.9 | 2.8 | 2.9 | 9.6 |
| GPT-5.4 | 3.8 | 2.5 | 2.9 | 9.4 |
| Claude Opus 4.6 | 3.7 | 2.5 | 2.8 | 9.2 |
| DeepSeek V4 | 3.5 | 2.6 | 2.6 | 8.7 |
Why Gemini 3.1 Pro Won
Gemini 3.1 Pro’s approach was masterfully structured: it identified the MySQL schema as Phase 1 (lowest risk, most predictable), Postgres document storage as Phase 2 (moderate complexity, requires careful index migration), and file storage last (highest risk due to sheer scale). What set it apart was the rollback strategy for each phase—MySQL could be rolled back instantly with a single switch, Postgres required index recreation scripts, and file storage needed a hybrid approach with versioned buckets. Most importantly, it acknowledged the 4-engineer constraint explicitly, suggesting parallel work streams where possible while maintaining clear ownership boundaries. GPT-5.4’s plan was solid but over-engineered for the timeline, while Claude Opus 4.6 focused too much on theoretical database concepts without addressing the practical scaling challenges.
Task 3: Tool-Use — Multi-API E-commerce Product Research
Prompt: “Research current smartphone market trends: 1) Find the best-selling Android flagship phone in Q1 2026 with pricing, 2) Compare it to the iPhone 16 Pro’s specifications and pricing, 3) Analyze 3 recent reviews comparing these two models, 4) Create a summary table with pros/cons for each. Include sources for all data.”
What Great Looked Like
The model autonomously finds current sales data, extracts comparable specs from official sources, fetches independent reviews, and synthesizes a balanced comparison.
Results
| Model | Correctness | Speed | Clarity | Total |
|---|---|---|---|---|
| DeepSeek V4 | 3.7 | 3.0 | 2.5 | 9.2 |
| Gemini 3.1 Pro | 3.8 | 2.6 | 2.6 | 9.0 |
| GPT-5.4 | 3.7 | 2.3 | 2.8 | 8.8 |
| Claude Opus 4.6 | 3.8 | 2.2 | 2.4 | 8.7 |
Why DeepSeek V4 Won
DeepSeek V4 demonstrated exceptional efficiency in tool-use, executing parallel searches across sales data, specifications, and review sites simultaneously. It correctly identified the Samsung Galaxy S25 Ultra as the Q1 2026 Android flagship and provided precise pricing comparison points. Where it truly excelled was in source verification—cross-checking manufacturer data with independent review sites and clearly flagging where information conflicted. The summary table was clean, with balanced pros/cons that acknowledged both devices’ strengths. Gemini 3.1 Pro was thorough but sequential, while GPT-5.4 and Claude Opus 4.6 both showed signs of over-researching, repeatedly querying similar data points as if second-guessing their results. DeepSeek’s single-pass approach yielded the best results in the shortest time.
Bottom Line
Today’s eval highlighted a fascinating shift in the AI model landscape. Gemini 3.1 Pro has clearly established itself as the reasoning champion, demonstrating superior analytical capabilities in complex technical scenarios. Claude Opus 4.6 remains unmatched in coding tasks, combining technical precision with clean implementation. GPT-5.4 continues to be the reliable all-rounder, though its tool-use performance is showing signs of age against newer models. Most notably, DeepSeek V4 proves that open-source models are closing the gap rapidly—especially in tool-use efficiency, where its cost-effective approach delivers enterprise results at a fraction of the price. The lesson here is clear: the best model choice depends entirely on your use case, and today’s open-source alternatives make economic sense even for mission-critical tasks.