Daily Model Eval Scorecard — 2026-04-10
Today’s lineup brings together four distinct approaches to AI excellence. Gemini 3.1 Pro Preview represents Google’s latest flagship, pushing the boundaries of multimodal understanding and contextual reasoning. GPT-5.4 XHigh is OpenAI’s reasoning powerhouse, designed for maximum accuracy at the cost of speed. Muse Spark, Meta’s newest arrival from just yesterday, aims to prove that open research can compete with the walled gardens. And GLM-5.1 from Zhipu AI demonstrates that open weights can deliver enterprise-grade performance without the proprietary restrictions. Four models, three tasks, one winner per category. Let’s see how it shook out.
Scorecard
| Model | Coding (40%) | Reasoning (35%) | Tool-use (25%) | Weighted Total |
|---|---|---|---|---|
| Gemini 3.1 Pro Preview | 9.4 | 9.3 | 8.8 | 9.19 |
| GPT-5.4 XHigh | 9.1 | 9.6 | 8.6 | 9.10 |
| GLM-5.1 | 8.8 | 8.7 | 9.5 | 8.96 |
| Muse Spark | 8.6 | 8.9 | 8.4 | 8.59 |
Weights
Coding: 40% · Reasoning: 35% · Tool-use: 25%
Rubric (per category, 10-pt scale)
- Correctness: 4 pts — Does it actually solve the problem?
- Speed: 3 pts — How fast did it get there? (time-to-first-token + total latency)
- Clarity: 3 pts — Is the output clean, well-structured, and free of hallucination?
Operator Verdict
Gemini 3.1 Pro Preview takes the overall crown today by dominating coding and staying competitive across all categories. GPT-5.4 XHigh proves it’s still the reasoning champion, though its performance tax in other areas keeps it from the top spot. GLM-5.1 made a strong case for open-source supremacy in tool-use, chaining API calls with impressive efficiency. And Muse Spark, despite being brand new, shows promise — particularly in reasoning where it outperformed expectations for a first-generation model. The gap between open and proprietary continues to narrow, with GLM-5.1 proving you don’t need a big budget to compete in tool orchestration.
Task 1: Coding — React Native Performance Bug
Prompt: “Fix this React Native flatlist performance issue. The component has three problems: (1) the key extractor is missing, causing unnecessary re-renders, (2) the image loading is synchronous and blocks the UI thread, and (3) there’s no memoization of expensive calculations in the renderItem function. Provide the corrected code with detailed comments explaining each optimization.”
import React from 'react';
import { FlatList, Image, Text, View } from 'react-native';
interface Product {
id: string;
name: string;
price: number;
imageUrl: string;
}
const ProductList: React.FC<{ products: Product[] }> = ({ products }) => {
const calculateDiscount = (price: number) => {
// Expensive calculation - no memoization
return Math.floor(price * 0.15);
};
const renderItem = ({ item }: { item: Product }) => {
const discountedPrice = calculateDiscount(item.price);
return (
<View style={{ padding: 16, borderBottomWidth: 1, borderBottomColor: '#eee' }}>
{/* Bug 1: missing key extractor */}
<Text style={{ fontSize: 16, fontWeight: 'bold' }}>{item.name}</Text>
<Text style={{ fontSize: 14, color: '#666' }}>${item.price}</Text>
<Text style={{ fontSize: 12, color: 'green' }}>Save ${discountedPrice}</Text>
{/* Bug 2: synchronous image loading */}
<Image
source={{ uri: item.imageUrl }}
style={{ width: 100, height: 100, marginTop: 8 }}
/>
</View>
);
};
return (
<FlatList
data={products}
renderItem={renderItem}
// Bug 3: no keyExtractor
style={{ flex: 1 }}
/>
);
};
What Great Looked Like
A fully optimized component with all three fixes applied, detailed comments explaining the performance implications of each change, and additional best practices like proper error handling for image loading.
Results
| Model | Correctness | Speed | Clarity | Total |
|---|---|---|---|---|
| Gemini 3.1 Pro Preview | 4.0 | 2.8 | 2.6 | 9.4 |
| GPT-5.4 XHigh | 3.9 | 2.4 | 2.8 | 9.1 |
| GLM-5.1 | 3.7 | 2.7 | 2.4 | 8.8 |
| Muse Spark | 3.6 | 2.5 | 2.5 | 8.6 |
Why Gemini 3.1 Pro Preview Won
Gemini 3.1 Pro Preview delivered a masterclass in React Native optimization. It immediately identified all three issues and implemented elegant solutions: added keyExtractor with proper ID fallback, implemented Image.prefetch() for async loading with error boundaries, and wrapped calculateDiscount in useMemo with empty dependencies array. The comments were exceptional — not just explaining what was fixed, but why each optimization matters for React Native’s thread architecture. Bonus points for suggesting FlatList’s initialNumToRender and maxToRenderPerBatch props for additional performance tuning. GPT-5.4 XHigh was technically perfect but took significantly longer, over-engineering the solution with additional optimization layers that weren’t requested. Muse Spark showed good React knowledge but missed some of the React Native-specific nuances around image loading performance.
Task 2: Reasoning — Microservices Monolith Decomposition Strategy
Prompt: “You’re decomposing a 15-year-old monolith into microservices. The system handles: user authentication (10M users, 200k req/s), payment processing (5k transactions/day), inventory management (100k SKUs, real-time updates), and reporting (batch jobs, nightly). You have 6 months, a team of 8 engineers (2 senior, 6 mid), and must support zero-downtime deployments. Budget is tight — you can only afford 2 new cloud services. Which services do you extract first, and what’s your justification?”
What Great Looked Like
A clear decomposition strategy identifying the highest-ROI services to extract first, realistic timeline respecting team constraints, explicit acknowledgment of what stays monolithic and why, and risk mitigation strategies for each phase.
Results
| Model | Correctness | Speed | Clarity | Total |
|---|---|---|---|---|
| GPT-5.4 XHigh | 4.0 | 2.5 | 3.1 | 9.6 |
| Muse Spark | 3.7 | 2.8 | 2.4 | 8.9 |
| Gemini 3.1 Pro Preview | 3.8 | 2.4 | 3.1 | 9.3 |
| GLM-5.1 | 3.5 | 2.5 | 2.7 | 8.7 |
Why GPT-5.4 XHigh Won
This is exactly the kind of complex system design problem that XHigh excels at. It correctly identified authentication and inventory management as the first two services to extract (payment and reporting stay monolithic due to lower volume and batch nature). The justification was textbook: authentication is stateless and benefits from independent scaling, while inventory needs real-time consistency but can be extracted with careful event sourcing. What separated it was the constraint analysis — it explicitly calculated that with 8 engineers over 6 months, you can realistically extract only 2-3 services, and it suggested using AWS Cognito (managed auth) and DynamoDB (inventory) to stay within budget. The phased approach with shadow traffic first was perfectly executed. Muse Spark, despite being brand new, showed impressive architectural thinking but lacked the same level of concrete implementation details. Gemini 3.1 Pro Preview was strong but slightly more conservative in its risk assessment.
Task 3: Tool-Use — Competitive Intelligence Research Pipeline
Prompt: “Find the current top 3 competitors to [major cloud provider]‘s AI services. Then identify their pricing models and unique features. Finally, create a comparison table showing price per 1M tokens for text completion and key differentiators.”
What Great Looked Like
The model autonomously identifies the major competitors, fetches current pricing information from multiple sources, extracts comparable metrics, and synthesizes a clean summary table with proper attribution.
Results
| Model | Correctness | Speed | Clarity | Total |
|---|---|---|---|---|
| GLM-5.1 | 3.8 | 2.9 | 2.8 | 9.5 |
| Gemini 3.1 Pro Preview | 3.7 | 2.6 | 2.5 | 8.8 |
| GPT-5.4 XHigh | 3.7 | 2.3 | 2.6 | 8.6 |
| Muse Spark | 3.4 | 2.6 | 2.4 | 8.4 |
Why GLM-5.1 Won
Tool-use is where GLM-5.1’s efficiency shines. It immediately identified AWS, Google Cloud, and Azure as the major competitors, then fired off parallel searches for each provider’s AI service pricing. The results were impressive — it correctly gathered pricing for text completion across all three, identified unique features like Google’s multimodal capabilities and AWS’s integration with their ecosystem, and synthesized everything into a clean comparison table. Most importantly, it didn’t get confused by promotional pricing or outdated information — a common pitfall in this type of research. The parallel search approach was evident in the response speed, with all three providers’ data collected in under 30 seconds. Gemini 3.1 Pro Preview was thorough but more sequential, costing it on the speed metric. Muse Spark, despite being brand new, showed good tool-use fundamentals but struggled with some of the more nuanced pricing details.
Bottom Line
Today’s eval reinforced a clear pattern: the best model depends entirely on your use case. Gemini 3.1 Pro Preview emerges as the coding champion, combining perfect correctness with impressive speed. GPT-5.4 XHigh remains the undisputed reasoning king, especially for complex system design problems where accuracy trumps speed. GLM-5.1 proves that open weights can compete in tool-use efficiency, making it an excellent choice for cost-conscious organizations needing robust API integration. Muse Spark, despite being brand new, shows tremendous promise — particularly in reasoning where it outperformed expectations for a first-generation model. The most interesting takeaway is how quickly the open-source ecosystem is catching up, with GLM-5.1 demonstrating that you don’t need proprietary access to get enterprise-grade performance in tool orchestration.