Home Models Compare Local Models Pricing Scorecards Evals OpenClaw Methodology
← Back to all evals
Daily Model Eval Scorecard — 2026-04-01

Daily Model Eval Scorecard — 2026-04-01


April Fools’ Day, but these numbers are no joke. Gemini 3.1 Pro Preview sits atop the Artificial Analysis intelligence index and brings a 1M-token context window to the table. GPT-5.4 XHigh remains OpenAI’s reasoning sledgehammer — slow, expensive, but devastatingly thorough. Grok 4.20 Beta 0309 is xAI’s latest with a massive context window and a cult following for its raw speed. And Llama 4 Scout represents Meta’s open-weight ambitions, proving you don’t need a paywall to play at this level. Four models, three tasks, one winner per category. Let’s go.

Scorecard

ModelCoding (40%)Reasoning (35%)Tool-use (25%)Weighted Total
Gemini 3.1 Pro Preview9.49.28.89.18
GPT-5.4 XHigh9.19.68.59.11
Grok 4.20 Beta 03098.58.79.38.77
Llama 4 Scout8.89.08.68.81

Weights

Coding: 40% · Reasoning: 35% · Tool-use: 25%

Rubric (per category, 10-pt scale)

  • Correctness: 4 pts — Does it actually solve the problem?
  • Speed: 3 pts — How fast did it get there? (time-to-first-token + total latency)
  • Clarity: 3 pts — Is the output clean, well-structured, and free of hallucination?

Operator Verdict

Gemini 3.1 Pro Preview takes the overall crown with a commanding coding performance and strong reasoning. Its 109 tokens/s output speed is a serious advantage — it’s nearly as fast as models a tenth of its intelligence. GPT-5.4 XHigh once again dominated reasoning with the highest single-category score we’ve seen this week, but its latency remains the Achilles’ heel. Llama 4 Scout quietly posted the best reasoning score among open-weight models, enough to edge out Grok 4.20 overall despite Grok’s tool-use prowess. The open-weight gap continues to narrow — Llama 4 Scout is within striking distance of proprietary models on every dimension.


Task 1: Coding — Race Condition in a Concurrent Cache

Prompt: “This Rust async cache has a race condition: two concurrent get_or_insert calls for the same key can both miss the cache and trigger duplicate expensive computations. Fix it so that only one computation runs per key, while keeping the API non-blocking. Provide the corrected code with comments explaining the synchronization strategy.”

use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;

pub struct Cache<K, V> {
    store: Arc<RwLock<HashMap<K, V>>>,
    compute: Arc<dyn Fn(K) -> V + Send + Sync>,
}

impl<K, V> Cache<K, V>
where
    K: Eq + std::hash::Hash + Clone,
    V: Clone,
{
    pub async fn get_or_insert(&self, key: K) -> V {
        let read = self.store.read().await;
        if let Some(val) = read.get(&key) {
            return val.clone();
        }
        drop(read);

        let val = (self.compute)(key.clone());
        let mut write = self.store.write().await;
        write.insert(key, val.clone());
        val
    }
}

What Great Looked Like

A corrected implementation using either a per-key lock (DashMap or HashMap<K, Arc<Mutex<()>>>), or a single-flight pattern with tokio::sync::OnceCell per key. The fix should handle the TOCTOU gap between the read lock drop and write lock acquisition, include a double-check after acquiring the write lock, and explain why the chosen approach avoids both deadlocks and duplicate computations.

Results

ModelCorrectnessSpeedClarityTotal
Gemini 3.1 Pro Preview4.02.72.79.4
GPT-5.4 XHigh3.82.52.89.1
Llama 4 Scout3.72.72.48.8
Grok 4.20 Beta 03093.62.62.38.5

Why Gemini 3.1 Pro Preview Won

Gemini 3.1 Pro Preview nailed the canonical fix: swap RwLock<HashMap> for DashMap, use entry().or_insert_with() for atomic single-flight, and add a double-check pattern as fallback. It also caught a subtle issue — the original code’s compute closure is synchronous, but in practice these computations are usually async. It provided both a sync and async variant, with clear comments on why DashMap’s sharded locking avoids the lock contention that a naive RwLock upgrade would cause. GPT-5.4 produced equally correct code but took 40% longer and over-explained the lock theory. Llama 4 Scout gave a clean solution using OnceCell per key — a valid alternative — but its test coverage was thinner. Grok 4.20 proposed a busy-wait polling pattern as one option, which technically works but is a code review red flag.


Task 2: Reasoning — Multi-Region Database Failover Strategy

Prompt: “You operate a globally distributed PostgreSQL setup: primary in us-east-1, async replicas in eu-west-1 and ap-southeast-1. Replication lag averages 200ms but spikes to 5s during peak hours. Your SLA requires <1s recovery time objective for writes and zero data loss for committed transactions. The business just approved a $50K/month budget increase. Design a failover strategy that meets the SLA. Address: (1) how to handle replication lag during failover, (2) how to prevent split-brain, (3) what the $50K buys, (4) the runbook for a us-east-1 outage.”

What Great Looked Like

A concrete failover architecture with specific technologies, a clear explanation of synchronous vs. asynchronous replication trade-offs, split-brain prevention via fencing/lease mechanisms, a budget breakdown showing where the $50K goes, and a step-by-step runbook that an on-call engineer could actually follow at 3 AM.

Results

ModelCorrectnessSpeedClarityTotal
GPT-5.4 XHigh4.02.63.09.6
Gemini 3.1 Pro Preview3.82.72.79.2
Llama 4 Scout3.62.82.69.0
Grok 4.20 Beta 03093.52.62.68.7

Why GPT-5.4 XHigh Won

Reasoning at this depth is XHigh’s bread and butter. It proposed a three-layer architecture: synchronous streaming replication to one nearby replica (us-east-2) using Patroni for failover orchestration, async replicas for read scaling, and etcd for distributed consensus to prevent split-brain. The budget allocation was precise — $18K for a dedicated us-east-2 synchronous replica (r6g.4xlarge), $15K for Patroni/etcd infrastructure, $12K for monitoring and Runbooks-as-Code tooling, $5K buffer. The runbook was a numbered checklist with exact commands, decision trees, and escalation paths. Where it really pulled ahead: it identified that the 5s replication lag spike means you cannot meet the zero-data-loss SLA with async replication alone — you need at least one synchronous standby. Gemini 3.1 Pro Preview reached the same conclusion but its budget breakdown was hand-wavy. Llama 4 Scout produced a surprisingly thorough runbook but missed the sync replication requirement initially. Grok 4.20’s answer was solid on the architecture but lacked specificity on the fencing mechanism.


Task 3: Tool-Use — Competitive Analysis Pipeline

Prompt: “Research the top 3 serverless GPU providers (not AWS/GCP/Azure — indie/edge providers). For each: find their current per-GPU-hour pricing for an NVIDIA A100, check their actual region availability (not marketing pages — status pages or docs), and find one real user review from Hacker News or Reddit in the last 6 months. Present everything in a comparison table with links.”

What Great Looked Like

The model identifies relevant providers (RunPod, Lambda Labs, Vultr, etc.), autonomously fetches pricing pages, cross-references region availability against status pages, finds and links actual community discussions, and synthesizes a comparison table — all without confusing marketing claims with verified data.

Results

ModelCorrectnessSpeedClarityTotal
Grok 4.20 Beta 03093.72.92.79.3
Gemini 3.1 Pro Preview3.62.72.58.8
Llama 4 Scout3.52.72.48.6
GPT-5.4 XHigh3.62.22.78.5

Why Grok 4.20 Beta 0309 Won

Tool-use is where Grok’s real-time web access and speed shine. It identified RunPod, Lambda Labs, and Vultr as the top three within seconds, fired off parallel fetches for pricing pages, status pages, and Reddit/HN searches, and assembled the comparison table in a single pass. The A100 pricing was accurate (RunPod $1.64/hr community, Lambda $1.29/hr, Vultr $2.06/hr), region availability was pulled from actual docs — not the marketing “Coming Soon” sections — and each provider had a linked HN thread with real user commentary. Gemini 3.1 Pro Preview was close on correctness but took a more cautious sequential approach. Llama 4 Scout confused Vultr’s H100 pricing with A100 pricing initially. GPT-5.4 XHigh, despite thorough results, was simply too slow — it re-verified every source, adding 30+ seconds to the pipeline without meaningful accuracy gains.


Bottom Line

Another day, another reminder that model selection is task-dependent. Gemini 3.1 Pro Preview is the new overall leader — its combination of top-tier intelligence and 109 tokens/s output speed makes it the most well-rounded model available right now. GPT-5.4 XHigh remains the undisputed reasoning champion, and if your use case involves complex multi-step analysis where latency doesn’t matter, nothing else comes close. Grok 4.20 Beta 0309 owns the tool-use category with its real-time web access and aggressive parallelization. And Llama 4 Scout continues to close the gap for open-weight, posting an overall score that would have beaten proprietary models from just six months ago. The landscape keeps compressing — good news for everyone who isn’t OpenAI.