What is Elo in Chatbot Arena?

Elo is a rating system originally developed for chess. In Chatbot Arena, each model starts at 1000 Elo. When users vote, the winning model gains points and the losing model loses points. The amount depends on the expected outcome—beating a higher-rated model earns more points. A 100-point difference means the higher-rated model is expected to win about 64% of matchups.

Is Chatbot Arena reliable?

Chatbot Arena is one of the most respected benchmarks because it reflects real human preferences rather than synthetic tests. However, it has limitations: voters may prefer confident-sounding but wrong answers, and the voting population may not represent your specific users. Use it as one signal among many.

How many votes does Chatbot Arena collect?

As of 2026, Chatbot Arena has collected over 2 million human votes. This large sample size makes the Elo ratings statistically meaningful. Models need hundreds of votes before their ratings stabilize; newer models may have more uncertain rankings.

What is the difference between Chatbot Arena and other benchmarks?

Unlike MMLU (knowledge testing) or SWE-bench (coding tasks), Chatbot Arena measures how much humans prefer one model's responses over another's. It captures qualities hard to automate: helpfulness, tone, creativity, and overall conversation quality. Think of it as a "taste test" for AI models.

What is Chatbot Arena? Elo Ratings Explained

Q: What is Chatbot Arena?

Chatbot Arena is a crowdsourced benchmark run by LMSYS (Large Model Systems Organization). Users chat anonymously with two AI models side-by-side and vote for which response they prefer. These votes are converted into Elo ratings, creating a leaderboard based on actual human preferences rather than automated tests.

What Does Chatbot Arena Measure?

Unlike benchmarks that test specific capabilities (coding, knowledge, math), Chatbot Arena measures overall human preference. When people use AI models in real conversations, which responses do they prefer?

This captures qualities that automated tests miss:

Helpfulness: Does the response actually solve the user's problem?
Tone: Is the model pleasant to interact with?
Clarity: Is the response easy to understand?
Creativity: Does the model produce interesting, engaging content?
Safety: Does the model avoid harmful or inappropriate content?

Key Insight

Chatbot Arena answers a simple but crucial question: "Which model would users rather talk to?" This is often more relevant to real-world deployment than scores on academic benchmarks.

How Chatbot Arena Works

Chatbot Arena is run by LMSYS (Large Model Systems Organization), a research group. Here's the methodology:

Anonymous Battle: A user submits a prompt. Two randomly-selected models (labeled only as "Model A" and "Model B") each generate a response.
Blind Vote: The user compares both responses without knowing which models produced them. They vote for the better response (or declare a tie).
Identity Reveal: After voting, the model identities are revealed.
Elo Update: The voting result updates each model's Elo rating.

Example Battle

Model A

Model B

User votes for Model A → Model A gains Elo points, Model B loses points

Understanding Elo Ratings

Elo (developed for chess) converts head-to-head results into a single rating number:

Starting point: All models begin at 1000 Elo
Point exchanges: When you win, you gain points; when you lose, you lose points
Expected outcomes: Beating a higher-rated model earns more points than beating a lower-rated one
Probability: A 100-point difference means the higher-rated model wins ~64% of the time

Elo Rating Scale for AI Models

1400+

Top-tier

1300-1400

Strong

1200-1300

Competitive

1100-1200

Developing

Current Chatbot Arena Leaderboard (2026)

Here are representative Elo ratings from leading models. Rankings shift as new models release and games are played—check the official Arena or our daily scorecards for current standings.

Rank	Model	Elo Rating	95% CI
1	Claude Opus 4.6	~1430	+5 / -5
2	Gemini 2.5 Pro	~1410	+6 / -6
3	5.3-Codex-Spark	~1395	+5 / -5
4	GLM-5	~1360	+7 / -7
5	Kimi K2.5	~1345	+8 / -8
6	MiniMax M2.5	~1320	+9 / -9

Ratings are approximate and based on publicly available data. Confidence intervals (CI) indicate uncertainty—narrower is better.

Limitations and Criticisms

Chatbot Arena is widely respected, but has important limitations:

Voter bias: Users may prefer confident, verbose, or friendly responses even when they're wrong. Style can overshadow substance.
Population skew: Voters are mostly AI enthusiasts, researchers, and developers. They may not represent your actual users.
Prompt distribution: The types of questions asked may not match your use case. Coding questions, creative writing, and math each favor different models.
Gaming risk: Model developers could theoretically submit favorable prompts or votes, though LMSYS has safeguards.
New model volatility: Newer models have fewer votes, making their ratings less stable.

Caution

A model with higher Elo might lose to a lower-rated model on your specific tasks. Elo measures average preference across many users and prompts—your mileage may vary.

Category-Specific Leaderboards

Chatbot Arena now offers category-specific rankings, which are more useful than overall Elo for specific use cases:

Coding: Which model writes the best code?
Math: Which model handles mathematical reasoning best?
Creative writing: Which model produces the best stories and content?
Hard prompts: Which model handles complex, challenging questions?

When to Use Chatbot Arena for Model Selection

Chatbot Arena is most useful when:

You're building a general-purpose chatbot or assistant
User satisfaction and conversation quality are primary metrics
You want to compare models across a wide range of interactions
You care about subjective qualities like tone and creativity

It's less useful when:

You have specific, narrow tasks (use SWE-bench for coding)
You need objective correctness measures (use MMLU for knowledge)
Your users differ significantly from typical Arena voters
You need to evaluate cost, latency, or reliability

For a complete picture, combine Chatbot Arena Elo with our daily operational benchmarks that measure real task performance, cost, and speed.

What is Chatbot Arena?