Home Models Coding Agents Compare Pricing Model Picker Source Data Local Models OpenClaw
Human Preference Benchmark

What is Chatbot Arena?

Chatbot Arena is a crowdsourced platform where humans vote on AI model responses in blind comparisons. These votes generate Elo ratings that rank models by actual user preference—the "taste test" of AI benchmarks.

What Does Chatbot Arena Measure?

Unlike benchmarks that test specific capabilities (coding, knowledge, math), Chatbot Arena measures overall human preference. When people use AI models in real conversations, which responses do they prefer?

This captures qualities that automated tests miss:

  • Helpfulness: Does the response actually solve the user's problem?
  • Tone: Is the model pleasant to interact with?
  • Clarity: Is the response easy to understand?
  • Creativity: Does the model produce interesting, engaging content?
  • Safety: Does the model avoid harmful or inappropriate content?
Key Insight

Chatbot Arena answers a simple but crucial question: "Which model would users rather talk to?" This is often more relevant to real-world deployment than scores on academic benchmarks.

How Chatbot Arena Works

Chatbot Arena is run by LMSYS (Large Model Systems Organization), a research group. Here's the methodology:

  1. Anonymous Battle: A user submits a prompt. Two randomly-selected models (labeled only as "Model A" and "Model B") each generate a response.
  2. Blind Vote: The user compares both responses without knowing which models produced them. They vote for the better response (or declare a tie).
  3. Identity Reveal: After voting, the model identities are revealed.
  4. Elo Update: The voting result updates each model's Elo rating.
Example Battle
Model A
VS
Model B
User votes for Model A → Model A gains Elo points, Model B loses points

Understanding Elo Ratings

Elo (developed for chess) converts head-to-head results into a single rating number:

  • Starting point: All models begin at 1000 Elo
  • Point exchanges: When you win, you gain points; when you lose, you lose points
  • Expected outcomes: Beating a higher-rated model earns more points than beating a lower-rated one
  • Probability: A 100-point difference means the higher-rated model wins ~64% of the time

Elo Rating Scale for AI Models

1400+
Top-tier
1300-1400
Strong
1200-1300
Competitive
1100-1200
Developing

Current Chatbot Arena Leaderboard (2026)

Here are representative Elo ratings from leading models. Rankings shift as new models release and games are played—check the official Arena or our daily scorecards for current standings.

Rank Model Elo Rating 95% CI
1 Claude Opus 4.6 ~1430 +5 / -5
2 Gemini 2.5 Pro ~1410 +6 / -6
3 5.3-Codex-Spark ~1395 +5 / -5
4 GLM-5 ~1360 +7 / -7
5 Kimi K2.5 ~1345 +8 / -8
6 MiniMax M2.5 ~1320 +9 / -9

Ratings are approximate and based on publicly available data. Confidence intervals (CI) indicate uncertainty—narrower is better.

Limitations and Criticisms

Chatbot Arena is widely respected, but has important limitations:

  • Voter bias: Users may prefer confident, verbose, or friendly responses even when they're wrong. Style can overshadow substance.
  • Population skew: Voters are mostly AI enthusiasts, researchers, and developers. They may not represent your actual users.
  • Prompt distribution: The types of questions asked may not match your use case. Coding questions, creative writing, and math each favor different models.
  • Gaming risk: Model developers could theoretically submit favorable prompts or votes, though LMSYS has safeguards.
  • New model volatility: Newer models have fewer votes, making their ratings less stable.
Caution

A model with higher Elo might lose to a lower-rated model on your specific tasks. Elo measures average preference across many users and prompts—your mileage may vary.

Category-Specific Leaderboards

Chatbot Arena now offers category-specific rankings, which are more useful than overall Elo for specific use cases:

  • Coding: Which model writes the best code?
  • Math: Which model handles mathematical reasoning best?
  • Creative writing: Which model produces the best stories and content?
  • Hard prompts: Which model handles complex, challenging questions?

When to Use Chatbot Arena for Model Selection

Chatbot Arena is most useful when:

  • You're building a general-purpose chatbot or assistant
  • User satisfaction and conversation quality are primary metrics
  • You want to compare models across a wide range of interactions
  • You care about subjective qualities like tone and creativity

It's less useful when:

  • You have specific, narrow tasks (use SWE-bench for coding)
  • You need objective correctness measures (use MMLU for knowledge)
  • Your users differ significantly from typical Arena voters
  • You need to evaluate cost, latency, or reliability

For a complete picture, combine Chatbot Arena Elo with our daily operational benchmarks that measure real task performance, cost, and speed.

Related Benchmarks

Frequently Asked Questions

What is Chatbot Arena?

+

Chatbot Arena is a crowdsourced benchmark run by LMSYS (Large Model Systems Organization). Users chat anonymously with two AI models side-by-side and vote for which response they prefer. These votes are converted into Elo ratings, creating a leaderboard based on actual human preferences rather than automated tests.

What is Elo in Chatbot Arena?

+

Elo is a rating system originally developed for chess. In Chatbot Arena, each model starts at 1000 Elo. When users vote, the winning model gains points and the losing model loses points. The amount depends on the expected outcome—beating a higher-rated model earns more points. A 100-point difference means the higher-rated model is expected to win about 64% of matchups.

Is Chatbot Arena reliable?

+

Chatbot Arena is one of the most respected benchmarks because it reflects real human preferences rather than synthetic tests. However, it has limitations: voters may prefer confident-sounding but wrong answers, and the voting population may not represent your specific users. Use it as one signal among many.

How many votes does Chatbot Arena collect?

+

As of 2026, Chatbot Arena has collected over 2 million human votes. This large sample size makes the Elo ratings statistically meaningful. Models need hundreds of votes before their ratings stabilize; newer models may have more uncertain rankings.

What is the difference between Chatbot Arena and other benchmarks?

+

Unlike MMLU (knowledge testing) or SWE-bench (coding tasks), Chatbot Arena measures how much humans prefer one model's responses over another's. It captures qualities hard to automate: helpfulness, tone, creativity, and overall conversation quality. Think of it as a "taste test" for AI models.

Go Beyond Elo Ratings

Human preference matters, but so does task performance. Our daily scorecards test models on coding, reasoning, and real workflows—with cost and latency data.

View Latest Scorecards