What Does MMLU Measure?
MMLU tests whether AI models have broad knowledge and can apply reasoning across many domains. Think of it as a comprehensive standardized test covering subjects from high school through graduate-level expertise.
The benchmark consists of nearly 16,000 multiple-choice questions across 57 subjects, organized into four categories:
- STEM: Physics, chemistry, biology, computer science, mathematics
- Humanities: History, philosophy, world religions, moral scenarios
- Social Sciences: Economics, psychology, sociology, political science
- Other: Professional fields like law, business, and medicine
MMLU doesn't just test memorization. Many questions require multi-step reasoning, applying concepts to new situations, or combining knowledge from different areas. A model that simply memorized facts would struggle to score well.
Sample Subjects in MMLU
Plus 41 more subjects covering the full range of academic knowledge.
How MMLU Works
The benchmark was introduced in 2021 by researchers at UC Berkeley and others. Here's the methodology:
- Question Format: Each question is multiple-choice with 4 possible answers. Models must select the correct option.
- Difficulty Levels: Questions range from elementary (e.g., "High School Mathematics") to professional (e.g., "Professional Law").
- Scoring: Models are scored on accuracy—what percentage of questions they answer correctly, averaged across all subjects.
- Evaluation: Models typically answer questions using few-shot prompting (seeing a few examples before answering).
5-Shot Evaluation
Most MMLU evaluations use "5-shot" prompting: the model sees 5 example questions with answers before answer each test question. This gives the model context about the format and difficulty level without providing specific knowledge about the test content.
Current MMLU Leaderboard (2026)
Here are representative scores from leading models. Note that MMLU scores have become compressed at the top, with many models scoring above 85%. Check our daily scorecards for current standings.
| Rank | Model | MMLU Score | Notes |
|---|---|---|---|
| 1 | Claude Opus 4.6 | ~92% | Strong across all categories |
| 2 | Gemini 2.5 Pro | ~91% | Excellent at STEM subjects |
| 3 | 5.3-Codex-Spark | ~90% | Balanced performance |
| 4 | GLM-5 | ~87% | Competitive knowledge base |
| 5 | Kimi K2.5 | ~85% | Strong value for cost |
| - | Human Expert (avg) | ~89.8% | Reference point |
Percentages are approximate. Some models may have benefited from test data appearing in training corpora.
Limitations and Criticisms
MMLU has been influential, but it has significant limitations:
- Data contamination: Many models may have seen MMLU questions during training, inflating scores. Some estimates suggest contamination affects 10-20% of questions.
- Multiple-choice format: Real-world tasks rarely involve selecting from 4 options. This format doesn't test generation ability.
- English-only: MMLU tests knowledge in English only and may not reflect capability in other languages.
- Score compression: Top models now cluster within a few percentage points, making differentiation harder.
- Static knowledge: The benchmark doesn't test current events or rapidly-changing fields.
A high MMLU score doesn't guarantee a model will perform well on your tasks. A model with 90% MMLU might struggle with creative writing, code debugging, or following complex instructions. Always test on your actual workload.
MMLU-Pro and Alternatives
To address some limitations, researchers have created harder variants:
- MMLU-Pro: More challenging questions, 10 options instead of 4, requiring deeper reasoning.
- GPQA: Graduate-level Google-proof questions that require expert-level reasoning.
- MMLU-Redux: A cleaned version with errors and ambiguities removed from the original.
When to Use MMLU for Model Selection
MMLU is most useful when:
- You need a general-purpose model for diverse knowledge tasks
- You're comparing models from different providers on a common metric
- You want to evaluate new or less-known models
- Your use case involves factual Q&A or knowledge retrieval
It's less useful when:
- You need coding ability (use SWE-bench instead)
- You care about conversational quality (use Chatbot Arena)
- You're comparing among top-tier models (scores are too similar)
- You need domain-specific expertise not covered by the 57 subjects
For comprehensive evaluation, combine MMLU with other benchmarks and our daily operational benchmarks that test real-world task performance.