Home Models Coding Agents Compare Pricing Model Picker Source Data Local Models OpenClaw
Knowledge Benchmark

What is MMLU?

MMLU (Massive Multitask Language Understanding) is one of the most widely-used benchmarks for evaluating AI models. It tests knowledge and reasoning across 57 diverse academic subjects, from elementary math to professional law.

What Does MMLU Measure?

MMLU tests whether AI models have broad knowledge and can apply reasoning across many domains. Think of it as a comprehensive standardized test covering subjects from high school through graduate-level expertise.

The benchmark consists of nearly 16,000 multiple-choice questions across 57 subjects, organized into four categories:

  • STEM: Physics, chemistry, biology, computer science, mathematics
  • Humanities: History, philosophy, world religions, moral scenarios
  • Social Sciences: Economics, psychology, sociology, political science
  • Other: Professional fields like law, business, and medicine
Key Insight

MMLU doesn't just test memorization. Many questions require multi-step reasoning, applying concepts to new situations, or combining knowledge from different areas. A model that simply memorized facts would struggle to score well.

Sample Subjects in MMLU

Abstract Algebra
Anatomy
Business Ethics
College Chemistry
Computer Security
Econometrics
Electrical Engineering
Formal Logic
Global Facts
High School Physics
International Law
Jurisprudence
Machine Learning
Medical Genetics
Philosophy
Professional Accounting

Plus 41 more subjects covering the full range of academic knowledge.

How MMLU Works

The benchmark was introduced in 2021 by researchers at UC Berkeley and others. Here's the methodology:

  1. Question Format: Each question is multiple-choice with 4 possible answers. Models must select the correct option.
  2. Difficulty Levels: Questions range from elementary (e.g., "High School Mathematics") to professional (e.g., "Professional Law").
  3. Scoring: Models are scored on accuracy—what percentage of questions they answer correctly, averaged across all subjects.
  4. Evaluation: Models typically answer questions using few-shot prompting (seeing a few examples before answering).

5-Shot Evaluation

Most MMLU evaluations use "5-shot" prompting: the model sees 5 example questions with answers before answer each test question. This gives the model context about the format and difficulty level without providing specific knowledge about the test content.

Current MMLU Leaderboard (2026)

Here are representative scores from leading models. Note that MMLU scores have become compressed at the top, with many models scoring above 85%. Check our daily scorecards for current standings.

Rank Model MMLU Score Notes
1 Claude Opus 4.6 ~92% Strong across all categories
2 Gemini 2.5 Pro ~91% Excellent at STEM subjects
3 5.3-Codex-Spark ~90% Balanced performance
4 GLM-5 ~87% Competitive knowledge base
5 Kimi K2.5 ~85% Strong value for cost
- Human Expert (avg) ~89.8% Reference point

Percentages are approximate. Some models may have benefited from test data appearing in training corpora.

Limitations and Criticisms

MMLU has been influential, but it has significant limitations:

  • Data contamination: Many models may have seen MMLU questions during training, inflating scores. Some estimates suggest contamination affects 10-20% of questions.
  • Multiple-choice format: Real-world tasks rarely involve selecting from 4 options. This format doesn't test generation ability.
  • English-only: MMLU tests knowledge in English only and may not reflect capability in other languages.
  • Score compression: Top models now cluster within a few percentage points, making differentiation harder.
  • Static knowledge: The benchmark doesn't test current events or rapidly-changing fields.
Caution

A high MMLU score doesn't guarantee a model will perform well on your tasks. A model with 90% MMLU might struggle with creative writing, code debugging, or following complex instructions. Always test on your actual workload.

MMLU-Pro and Alternatives

To address some limitations, researchers have created harder variants:

  • MMLU-Pro: More challenging questions, 10 options instead of 4, requiring deeper reasoning.
  • GPQA: Graduate-level Google-proof questions that require expert-level reasoning.
  • MMLU-Redux: A cleaned version with errors and ambiguities removed from the original.

When to Use MMLU for Model Selection

MMLU is most useful when:

  • You need a general-purpose model for diverse knowledge tasks
  • You're comparing models from different providers on a common metric
  • You want to evaluate new or less-known models
  • Your use case involves factual Q&A or knowledge retrieval

It's less useful when:

  • You need coding ability (use SWE-bench instead)
  • You care about conversational quality (use Chatbot Arena)
  • You're comparing among top-tier models (scores are too similar)
  • You need domain-specific expertise not covered by the 57 subjects

For comprehensive evaluation, combine MMLU with other benchmarks and our daily operational benchmarks that test real-world task performance.

Related Benchmarks

Frequently Asked Questions

What does MMLU stand for?

+

MMLU stands for Massive Multitask Language Understanding. The name reflects its design: a massive dataset covering many different subjects, testing whether language models can understand and reason across diverse domains.

What is a good MMLU score?

+

Context matters. Human expert performance varies by subject but averages around 89.8% on MMLU. The best AI models now exceed 90% on the benchmark, though this may partly reflect test data appearing in training data. A score above 80% is strong; above 85% is competitive among leading models.

Is MMLU still relevant in 2026?

+

Yes, but with caveats. Many top models now score similarly high, reducing MMLU's ability to distinguish between them. However, it remains useful for evaluating new models, tracking progress over time, and assessing broad knowledge. Most evaluators now use MMLU alongside other benchmarks like MMLU-Pro (harder version) and GPQA.

Why do some models score higher on MMLU than others?

+

Several factors influence MMLU scores: training data (some models may have seen test questions), model size (larger models tend to score higher), and training methodology (models trained on academic text often perform better). This is why comparing raw scores can be misleading.

How is MMLU different from other benchmarks?

+

Unlike SWE-bench (coding) or Chatbot Arena (human preference), MMLU tests knowledge retention and reasoning in a multiple-choice format. It doesn't require models to write code or engage in conversation—it measures whether models "know" facts and can apply reasoning to select correct answers.

See Real-World Model Performance

MMLU tests knowledge, but what about actual tasks? Our daily scorecards evaluate models on coding, reasoning, and tool use.

View Latest Scorecards