Home Models Coding Agents Compare Pricing Model Picker Source Data Local Models OpenClaw

How ELO Ratings Work

Chatbot Arena uses the ELO rating system — the same method used in chess — to rank language models based on pairwise comparisons. Users chat with two anonymous models and vote for the better response.

  • Large-scale public voting collected since launch
  • Randomized battles prevent gaming
  • Confidence intervals show uncertainty
  • Open methodologyread the paper

ELO ratings measure human preference, not objective correctness. A model with higher ELO may still fail at specific tasks.

Current Leaderboard

For current model ranks, ELO scores, vote counts, and confidence intervals, use the official LM Arena leaderboard. This page explains how to interpret those numbers without republishing a static copy that can drift.

What These Numbers Mean

ELO Score
Relative strength. A 100-point gap means the higher-rated model wins ~64% of battles.
95% CI (Confidence Interval)
Uncertainty range. A model with ELO 1500 ±5 is reliably better than one at 1490 ±10.
Votes
Sample size. More votes = narrower confidence interval = more reliable rating.

Limitations

  • Human preference ≠ truth. Models may sound confident while being wrong.
  • Bias toward verbose responses. Longer answers often score higher.
  • No task-specific ratings. Coding, math, and reasoning aren't separated.
  • New models need time. Fresh releases may have wider confidence intervals until enough battles accumulate.

For task-specific benchmarks, see our coding benchmark and reasoning benchmark scorecards.

Other Leaderboards

ELO measures human preference. For objective benchmarks, see:

How to Cite

If you use these ratings in your work, cite the official source:

LMSYS Organization. "Chatbot Arena: An Open Platform for 
Evaluating LLMs by Pairwise Comparison." 
arXiv:2403.04132 (2024).
https://lmsys.org/blog/2023-05-03-arena/