SWE-bench Verified

Human-verified subset of SWE-bench. Tests whether a model can resolve real GitHub issues by editing code.

Category: coding · Metric: pass@1 · Source: www.swebench.com ↗

Leaderboard

Rank	Model	Provider	Score	Measured	Source
1	Claude Opus 4.7	Anthropic	77.2	2026-04-22	↗
2	GPT-5	OpenAI	74.9	2025-08-07	↗
3	Claude Sonnet 4.6	Anthropic	72.5	2026-03-12	↗
4	o3	OpenAI	71.7	2025-04-16	↗
5	Qwen3-Coder-480B	Alibaba	69.6	2025-07-22	↗
6	Gemini 2.5 Pro	Google	63.8	2025-03-25	↗

What this benchmark measures, in detail

Human-verified subset of SWE-bench. Tests whether a model can resolve real GitHub issues by editing code.

Different benchmarks measure different things. A model that excels on SWE-bench Verified may underperform on real-world workloads if the benchmark's distribution doesn't match your data. Use benchmark scores as a triage signal — narrow to a shortlist — then evaluate on your actual workload before committing.

Methodology notes

Scores in the leaderboard are taken from the model's release announcement or model card, cited via the "Source" link. Where two sources disagree (which happens often for SWE-bench and IFEval), the linked primary source wins. Reproducibility for some benchmarks (notably anything graded by an LLM) varies by run — treat the score as ±2-3 points unless the source is a peer-reviewed result.

SWE-bench Verified

Leaderboard

What this benchmark measures, in detail

Methodology notes

Related benchmarks