What Is MMLU-Pro? The Advanced AI Benchmark Explained
MMLU-Pro is an enhanced version of the classic MMLU benchmark, designed to better discriminate between advanced AI models. It features 12,000 graduate-level questions across 14 subject areas, with 10 answer choices instead of 4—making it significantly harder and more resistant to guessing.
Published at NeurIPS 2024, MMLU-Pro was created because frontier models had begun to saturate the original MMLU, making it difficult to distinguish between them.
What MMLU-Pro Tests
MMLU-Pro evaluates multi-task language understanding across diverse academic and professional domains:
- 14 subject areas: math, physics, biology, chemistry, computer science, engineering, philosophy, law, health, psychology, business, economics, history, and other
- 12,000+ questions at graduate/professional level
- 10 answer choices per question (vs. 4 in original MMLU)
- Reasoning-focused: requires deeper analysis, not just knowledge recall
The benchmark emphasizes reasoning over recall. Many questions require multi-step thinking, and the expanded answer choices make guessing much harder.
How MMLU-Pro Differs from MMLU
The original MMLU (Massive Multitask Language Understanding) was introduced in 2021 and became a standard benchmark. But by 2024, models had plateaued:
“GPT-4 achieved 86.4% in March 2023… Most recent frontier models like GPT-4-Turbo, Gemini-1.5-Pro, Claude, and LLaMA-3-400B all settle at an accuracy between 86% - 87%.” — MMLU-Pro Paper, NeurIPS 2024
Key Differences
| Aspect | MMLU | MMLU-Pro |
|---|---|---|
| Questions | ~16,000 | 12,032 |
| Answer choices | 4 | 10 |
| Difficulty | Undergraduate | Graduate-level |
| Focus | Knowledge recall | Reasoning + knowledge |
| Guess probability | 25% | 10% |
| Prompt sensitivity | 4-5% variance | 2% variance |
MMLU-Pro is 16-33% harder than the original MMLU. Top models that score 86-88% on MMLU drop to 60-75% on MMLU-Pro.
Why the Expanded Choices Matter
With 4 choices, random guessing yields 25% accuracy. A model only needs to “know” the answer partially to improve significantly over random.
With 10 choices:
- Random guessing drops to 10%
- Models must truly understand the problem
- Distractors are more sophisticated and plausible
This makes MMLU-Pro more discriminative—it better separates truly capable models from those that are just good at test-taking.
Chain of Thought Matters
Unlike the original MMLU, where direct answering often worked well, MMLU-Pro rewards Chain of Thought (CoT) reasoning:
“Models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU.” — MMLU-Pro Paper
This indicates MMLU-Pro contains more complex reasoning questions where working through the problem step-by-step helps.
Prompt Stability
MMLU-Pro is more stable under varying prompts:
- Original MMLU: 4-5% score variance across 24 different prompt styles
- MMLU-Pro: Only 2% variance across the same prompts
This reduced sensitivity makes scores more reliable and less dependent on prompt engineering.
Official Scores and Leaderboards
Primary Sources
- Paper: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (NeurIPS 2024, Spotlight)
- GitHub: TIGER-AI-Lab/MMLU-Pro
- Dataset: Hugging Face - TIGER-Lab/MMLU-Pro
Third-Party Leaderboards
- Artificial Analysis MMLU-Pro Leaderboard — Independently benchmarked, includes cost/performance analysis
Approximate Score Ranges (2026)
| Model Tier | MMLU-Pro Score |
|---|---|
| Top frontier models | 85-90% |
| Mid-tier models | 70-80% |
| Open-source leaders | 65-75% |
According to Artificial Analysis, top performers include Gemini 3 Pro Preview and Claude Opus 4.5 (Reasoning) at approximately 89-90%.
Note: Scores vary by evaluation methodology. Always check the source leaderboard for current, specific numbers.
Example Questions
MMLU-Pro questions span many domains and require genuine reasoning:
Math Example:
The symmetric group Sn has n! elements, hence it is not true that S10 has 10 elements. Find the characteristic of the ring 2Z.
Options: 0, 30, 3, 10, 12, 50, 2, 100, 20, 5
Health Example:
Which of the following is the body cavity that contains the pituitary gland?
Options: Ventral, Dorsal, Buccal, Thoracic, Pericardial, Abdominal, Spinal, Pelvic, Pleural, Cranial
Business Example:
In contrast to _______, _______ aim to reward favourable behaviour by companies…
(10 fill-in-the-blank options with various combinations)
How to Interpret MMLU-Pro Scores
| Score Range | Interpretation |
|---|---|
| <50% | Struggles with graduate-level reasoning |
| 50-65% | Competent on most subjects, gaps in harder areas |
| 65-80% | Strong academic reasoning, competitive |
| 80-90% | Elite performance on complex multi-domain tasks |
| >90% | Not yet achieved; would indicate exceptional capability |
Why MMLU-Pro Matters
For Model Developers
MMLU-Pro provides better discrimination at the top of the capability distribution. When models cluster at 85-88% on original MMLU, MMLU-Pro spreads them across a wider range.
For Model Users
If you need models for:
- Academic/scientific work: MMLU-Pro is highly relevant
- Professional domains (law, medicine, engineering): Tests domain knowledge
- Complex reasoning tasks: The CoT correlation signals reasoning ability
For Benchmarking
MMLU-Pro is included in composite benchmarks like the Artificial Analysis Intelligence Index, which aggregates multiple challenging evaluations.
Limitations
- Multiple choice only: Doesn’t test open-ended generation
- English-focused: Primarily English questions
- Academic bias: Tests academic knowledge, not practical skills
- Static dataset: Like all benchmarks, may become contaminated over time
Primary Sources
- Paper: arXiv:2406.01574
- Authors: Yubo Wang, Xueguang Ma, Ge Zhang, et al. (University of Waterloo, University of Toronto)
- Venue: NeurIPS 2024, Datasets and Benchmarks Track (Spotlight)
- GitHub: TIGER-AI-Lab/MMLU-Pro
- Leaderboard: Artificial Analysis MMLU-Pro
The Bottom Line
MMLU-Pro is the more challenging successor to MMLU, designed for an era where frontier models had saturated the original benchmark. With 10 answer choices, graduate-level questions, and better prompt stability, it provides a more reliable signal of advanced reasoning capability.
For evaluating general-purpose intelligence and academic reasoning, MMLU-Pro is now the preferred benchmark over the original MMLU. Combine it with domain-specific benchmarks (SWE-Bench for coding, GPQA for science) for a complete picture.
Last updated: March 10, 2026
Related: