What Does SWE-bench Measure?
SWE-bench evaluates an AI model's ability to perform real software engineering work. Unlike coding challenges that test algorithms in isolation, SWE-bench asks models to solve actual problems from real open-source projects.
Each task in SWE-bench comes from a real GitHub issue in popular Python repositories like Django, Flask, NumPy, and Scikit-learn. The model receives:
- The original issue description (bug report or feature request)
- The complete codebase at the version where the issue occurred
- Any relevant context from the repository
The model must then generate a code patch that fixes the issue and passes the project's existing test suite. This is exactly what human software engineers do every day.
SWE-bench is intentionally difficult. Even the best AI models in 2026 struggle to solve more than half of the tasks. This isn't a flaw—it reflects the genuine complexity of software engineering.
How SWE-bench Works
The benchmark was created by researchers at Princeton and Salesforce. Here's the methodology:
- Issue Collection: Pull requests from 12 popular Python repositories were analyzed to find those that fixed bugs or added features in response to GitHub issues.
- Task Creation: Each task includes the issue text, the codebase state before the fix, and the test cases that verify the fix works.
- Model Evaluation: The model is given the issue and codebase, then must generate a patch. The patch is applied and tests are run.
- Scoring: A task is only marked correct if all relevant tests pass. Partial credit is not given.
The SWE-bench Dataset Sizes
- SWE-bench (Full): 2,294 tasks from 12 repositories. The original, complete dataset.
- SWE-bench Verified: 500 tasks manually reviewed to ensure reliability. This is now the preferred subset for fair comparisons.
Current SWE-bench Leaderboard (2026)
Here are representative scores from leading models on SWE-bench Verified. Note that rankings change frequently as models are updated—check our daily scorecards for current standings.
| Rank | Model | SWE-bench Verified | Notes |
|---|---|---|---|
| 1 | Claude Opus 4.6 | ~65% | Strong at understanding context |
| 2 | 5.3-Codex-Spark | ~60% | Excellent at code generation |
| 3 | Gemini 2.5 Pro | ~55% | Good reasoning capabilities |
| 4 | GLM-5 | ~48% | Competitive at lower cost |
| 5 | Kimi K2.5 | ~45% | Strong value proposition |
Percentages are approximate and based on publicly available benchmarks. Actual performance varies by evaluation method.
Limitations and Criticisms
SWE-bench is valuable, but it's not a complete picture of coding ability. Important limitations:
- Python only: All tasks are in Python. Performance may not transfer to other languages.
- Bug-fix focus: Most tasks are bug fixes, not feature development or architecture work.
- Test coverage: Some tasks have incomplete test suites, meaning a model might "pass" without truly solving the problem.
- Context length: Models with larger context windows may have advantages that don't reflect real ability.
- Contamination risk: Some models may have seen the test data during training, inflating scores.
Don't rely solely on SWE-bench for model selection. A model with a 45% score might be better for your specific use case than one with 55%, depending on your tech stack, task types, and budget.
When to Use SWE-bench for Model Selection
SWE-bench is most useful when:
- You're building a coding assistant that works with existing codebases
- You need models that can understand natural language bug reports
- Your team works primarily in Python
- You want a rough comparison of coding capabilities between models
It's less useful when:
- You need models for greenfield development (new projects from scratch)
- Your work involves languages other than Python
- You're focused on code explanation or documentation rather than patches
- You need to compare models on speed, cost, or reliability
For a more complete picture, combine SWE-bench with MMLU (reasoning), Chatbot Arena (human preference), and our own daily operational benchmarks.