Chatbot Arena ELO Ratings Explained

How ELO Ratings Work

Chatbot Arena uses the ELO rating system — the same method used in chess — to rank language models based on pairwise comparisons. Users chat with two anonymous models and vote for the better response.

Large-scale public voting collected since launch
Randomized battles prevent gaming
Confidence intervals show uncertainty
Open methodology — read the paper

ELO ratings measure human preference, not objective correctness. A model with higher ELO may still fail at specific tasks.

Current Leaderboard

For current model ranks, ELO scores, vote counts, and confidence intervals, use the official LM Arena leaderboard. This page explains how to interpret those numbers without republishing a static copy that can drift.

What These Numbers Mean

ELO Score: Relative strength. A 100-point gap means the higher-rated model wins ~64% of battles.
95% CI (Confidence Interval): Uncertainty range. A model with ELO 1500 ±5 is reliably better than one at 1490 ±10.
Votes: Sample size. More votes = narrower confidence interval = more reliable rating.

Limitations

Human preference ≠ truth. Models may sound confident while being wrong.
Bias toward verbose responses. Longer answers often score higher.
No task-specific ratings. Coding, math, and reasoning aren't separated.
New models need time. Fresh releases may have wider confidence intervals until enough battles accumulate.

For task-specific benchmarks, see our coding benchmark and reasoning benchmark scorecards.

Other Leaderboards

ELO measures human preference. For objective benchmarks, see:

SWE-Bench — Real bug fixes
HumanEval — Code generation
Artificial Analysis — Performance + price

How to Cite

If you use these ratings in your work, cite the official source:

LMSYS Organization. "Chatbot Arena: An Open Platform for 
Evaluating LLMs by Pairwise Comparison." 
arXiv:2403.04132 (2024).
https://lmsys.org/blog/2023-05-03-arena/