Mar 10, 2026

What Is MMLU-Pro? The Advanced AI Benchmark Explained

MMLU-Pro is an enhanced version of the classic MMLU benchmark, designed to better discriminate between advanced AI models. It features 12,000 graduate-level questions across 14 subject areas, with 10 answer choices instead of 4—making it significantly harder and more resistant to guessing.

Published at NeurIPS 2024, MMLU-Pro was created because frontier models had begun to saturate the original MMLU, making it difficult to distinguish between them.

What MMLU-Pro Tests

MMLU-Pro evaluates multi-task language understanding across diverse academic and professional domains:

14 subject areas: math, physics, biology, chemistry, computer science, engineering, philosophy, law, health, psychology, business, economics, history, and other
12,000+ questions at graduate/professional level
10 answer choices per question (vs. 4 in original MMLU)
Reasoning-focused: requires deeper analysis, not just knowledge recall

The benchmark emphasizes reasoning over recall. Many questions require multi-step thinking, and the expanded answer choices make guessing much harder.

How MMLU-Pro Differs from MMLU

The original MMLU (Massive Multitask Language Understanding) was introduced in 2021 and became a standard benchmark. But by 2024, models had plateaued:

“GPT-4 achieved 86.4% in March 2023… Most recent frontier models like GPT-4-Turbo, Gemini-1.5-Pro, Claude, and LLaMA-3-400B all settle at an accuracy between 86% - 87%.” — MMLU-Pro Paper, NeurIPS 2024

Key Differences

Aspect	MMLU	MMLU-Pro
Questions	~16,000	12,032
Answer choices	4	10
Difficulty	Undergraduate	Graduate-level
Focus	Knowledge recall	Reasoning + knowledge
Guess probability	25%	10%
Prompt sensitivity	4-5% variance	2% variance

MMLU-Pro is 16-33% harder than the original MMLU. Top models that score 86-88% on MMLU drop to 60-75% on MMLU-Pro.

Why the Expanded Choices Matter

With 4 choices, random guessing yields 25% accuracy. A model only needs to “know” the answer partially to improve significantly over random.

With 10 choices:

Random guessing drops to 10%
Models must truly understand the problem
Distractors are more sophisticated and plausible

This makes MMLU-Pro more discriminative—it better separates truly capable models from those that are just good at test-taking.

Chain of Thought Matters

Unlike the original MMLU, where direct answering often worked well, MMLU-Pro rewards Chain of Thought (CoT) reasoning:

“Models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU.” — MMLU-Pro Paper

This indicates MMLU-Pro contains more complex reasoning questions where working through the problem step-by-step helps.

Prompt Stability

MMLU-Pro is more stable under varying prompts:

Original MMLU: 4-5% score variance across 24 different prompt styles
MMLU-Pro: Only 2% variance across the same prompts

This reduced sensitivity makes scores more reliable and less dependent on prompt engineering.

Official Scores and Leaderboards

Primary Sources

Paper: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (NeurIPS 2024, Spotlight)
GitHub: TIGER-AI-Lab/MMLU-Pro
Dataset: Hugging Face - TIGER-Lab/MMLU-Pro

Third-Party Leaderboards

Artificial Analysis MMLU-Pro Leaderboard — Independently benchmarked, includes cost/performance analysis

Approximate Score Ranges (2026)

Model Tier	MMLU-Pro Score
Top frontier models	85-90%
Mid-tier models	70-80%
Open-source leaders	65-75%

According to Artificial Analysis, top performers include Gemini 3 Pro Preview and Claude Opus 4.5 (Reasoning) at approximately 89-90%.

Note: Scores vary by evaluation methodology. Always check the source leaderboard for current, specific numbers.

Example Questions

MMLU-Pro questions span many domains and require genuine reasoning:

Math Example:

The symmetric group Sn has n! elements, hence it is not true that S10 has 10 elements. Find the characteristic of the ring 2Z.

Options: 0, 30, 3, 10, 12, 50, 2, 100, 20, 5

Health Example:

Which of the following is the body cavity that contains the pituitary gland?

Options: Ventral, Dorsal, Buccal, Thoracic, Pericardial, Abdominal, Spinal, Pelvic, Pleural, Cranial

Business Example:

In contrast to _______, _______ aim to reward favourable behaviour by companies…

(10 fill-in-the-blank options with various combinations)

How to Interpret MMLU-Pro Scores

Score Range	Interpretation
<50%	Struggles with graduate-level reasoning
50-65%	Competent on most subjects, gaps in harder areas
65-80%	Strong academic reasoning, competitive
80-90%	Elite performance on complex multi-domain tasks
>90%	Not yet achieved; would indicate exceptional capability

Why MMLU-Pro Matters

For Model Developers

MMLU-Pro provides better discrimination at the top of the capability distribution. When models cluster at 85-88% on original MMLU, MMLU-Pro spreads them across a wider range.

For Model Users

If you need models for:

Academic/scientific work: MMLU-Pro is highly relevant
Professional domains (law, medicine, engineering): Tests domain knowledge
Complex reasoning tasks: The CoT correlation signals reasoning ability

For Benchmarking

MMLU-Pro is included in composite benchmarks like the Artificial Analysis Intelligence Index, which aggregates multiple challenging evaluations.

Limitations

Multiple choice only: Doesn’t test open-ended generation
English-focused: Primarily English questions
Academic bias: Tests academic knowledge, not practical skills
Static dataset: Like all benchmarks, may become contaminated over time

Primary Sources

Paper: arXiv:2406.01574
Authors: Yubo Wang, Xueguang Ma, Ge Zhang, et al. (University of Waterloo, University of Toronto)
Venue: NeurIPS 2024, Datasets and Benchmarks Track (Spotlight)
GitHub: TIGER-AI-Lab/MMLU-Pro
Leaderboard: Artificial Analysis MMLU-Pro

The Bottom Line

MMLU-Pro is the more challenging successor to MMLU, designed for an era where frontier models had saturated the original benchmark. With 10 answer choices, graduate-level questions, and better prompt stability, it provides a more reliable signal of advanced reasoning capability.

For evaluating general-purpose intelligence and academic reasoning, MMLU-Pro is now the preferred benchmark over the original MMLU. Combine it with domain-specific benchmarks (SWE-Bench for coding, GPQA for science) for a complete picture.

Last updated: March 10, 2026

Related: