Home Models Coding Agents Compare Pricing Model Picker Source Data Local Models OpenClaw
Coding Benchmark

What is SWE-bench?

SWE-bench is the definitive benchmark for testing AI models on real software engineering tasks. It uses actual GitHub issues to measure whether models can understand bugs, navigate codebases, and write working patches.

What Does SWE-bench Measure?

SWE-bench evaluates an AI model's ability to perform real software engineering work. Unlike coding challenges that test algorithms in isolation, SWE-bench asks models to solve actual problems from real open-source projects.

Each task in SWE-bench comes from a real GitHub issue in popular Python repositories like Django, Flask, NumPy, and Scikit-learn. The model receives:

  • The original issue description (bug report or feature request)
  • The complete codebase at the version where the issue occurred
  • Any relevant context from the repository

The model must then generate a code patch that fixes the issue and passes the project's existing test suite. This is exactly what human software engineers do every day.

Key Insight

SWE-bench is intentionally difficult. Even the best AI models in 2026 struggle to solve more than half of the tasks. This isn't a flaw—it reflects the genuine complexity of software engineering.

How SWE-bench Works

The benchmark was created by researchers at Princeton and Salesforce. Here's the methodology:

  1. Issue Collection: Pull requests from 12 popular Python repositories were analyzed to find those that fixed bugs or added features in response to GitHub issues.
  2. Task Creation: Each task includes the issue text, the codebase state before the fix, and the test cases that verify the fix works.
  3. Model Evaluation: The model is given the issue and codebase, then must generate a patch. The patch is applied and tests are run.
  4. Scoring: A task is only marked correct if all relevant tests pass. Partial credit is not given.

The SWE-bench Dataset Sizes

  • SWE-bench (Full): 2,294 tasks from 12 repositories. The original, complete dataset.
  • SWE-bench Verified: 500 tasks manually reviewed to ensure reliability. This is now the preferred subset for fair comparisons.

Current SWE-bench Leaderboard (2026)

Here are representative scores from leading models on SWE-bench Verified. Note that rankings change frequently as models are updated—check our daily scorecards for current standings.

Rank Model SWE-bench Verified Notes
1 Claude Opus 4.6 ~65% Strong at understanding context
2 5.3-Codex-Spark ~60% Excellent at code generation
3 Gemini 2.5 Pro ~55% Good reasoning capabilities
4 GLM-5 ~48% Competitive at lower cost
5 Kimi K2.5 ~45% Strong value proposition

Percentages are approximate and based on publicly available benchmarks. Actual performance varies by evaluation method.

Limitations and Criticisms

SWE-bench is valuable, but it's not a complete picture of coding ability. Important limitations:

  • Python only: All tasks are in Python. Performance may not transfer to other languages.
  • Bug-fix focus: Most tasks are bug fixes, not feature development or architecture work.
  • Test coverage: Some tasks have incomplete test suites, meaning a model might "pass" without truly solving the problem.
  • Context length: Models with larger context windows may have advantages that don't reflect real ability.
  • Contamination risk: Some models may have seen the test data during training, inflating scores.
Caution

Don't rely solely on SWE-bench for model selection. A model with a 45% score might be better for your specific use case than one with 55%, depending on your tech stack, task types, and budget.

When to Use SWE-bench for Model Selection

SWE-bench is most useful when:

  • You're building a coding assistant that works with existing codebases
  • You need models that can understand natural language bug reports
  • Your team works primarily in Python
  • You want a rough comparison of coding capabilities between models

It's less useful when:

  • You need models for greenfield development (new projects from scratch)
  • Your work involves languages other than Python
  • You're focused on code explanation or documentation rather than patches
  • You need to compare models on speed, cost, or reliability

For a more complete picture, combine SWE-bench with MMLU (reasoning), Chatbot Arena (human preference), and our own daily operational benchmarks.

Related Benchmarks

Frequently Asked Questions

What is SWE-bench?

+

SWE-bench (Software Engineering Benchmark) is a dataset that tests AI models on real software engineering tasks. It contains actual bug reports and feature requests from popular open-source Python projects, along with the code changes that human developers made to fix them. Models must understand the problem, navigate a codebase, and generate correct patches.

What does a good SWE-bench score look like?

+

SWE-bench is intentionally difficult. As of early 2026, the best models score around 50-65% on SWE-bench Verified (a cleaner subset). Scores above 40% are considered strong. The full SWE-bench dataset is even harder, with top models typically scoring 30-45%. Any score above 50% represents genuinely useful software engineering capability.

How is SWE-bench different from other coding benchmarks?

+

Unlike benchmarks that test isolated functions or algorithmic puzzles, SWE-bench requires models to work with real codebases spanning thousands of lines, understand issue descriptions written by humans, and produce patches that pass existing test suites. It tests practical software engineering, not just code generation.

Should I use SWE-bench scores to choose a coding assistant?

+

SWE-bench is one useful signal, but it has limitations. It only tests Python, focuses on bug fixes rather than feature development, and may not reflect your specific use cases. Use it alongside other benchmarks and, ideally, test models on your own codebase before committing.

What is SWE-bench Verified?

+

SWE-bench Verified is a curated subset of 500 tasks from the original SWE-bench dataset. Human annotators reviewed each task to ensure the test suite is reliable and the task is solvable. Many leaderboard rankings now use this subset because it provides more consistent, fair comparisons between models.

See How Models Perform Today

Our daily scorecards test models on real tasks, not just benchmarks. Get current rankings for coding, reasoning, and tool use.

View Latest Scorecards