What it evaluates
Repository-level software issue resolution
Unlike toy coding tasks, SWE-Bench asks an agent to work inside an existing codebase and submit a patch for a real issue sourced from GitHub.
Benchmark guide / issue-to-patch evaluation
SWE-Bench is one of the most referenced benchmarks for coding agents because it asks models to resolve real GitHub issues against real repositories. That makes it useful, but only if you read the benchmark, its variants, and its scores with context.
This page is an independent explainer. It is not the official SWE-Bench site and it does not mirror live leaderboard numbers that may go stale.
Key facts
What it evaluates
Unlike toy coding tasks, SWE-Bench asks an agent to work inside an existing codebase and submit a patch for a real issue sourced from GitHub.
What the agent gets
The agent receives the original issue description and repository snapshot. The tests that judge success are not shown to the model.
How it is checked
A submitted patch must satisfy the issue-specific FAIL_TO_PASS tests and the regression-oriented PASS_TO_PASS tests.
Why dates matter
Variants, harness quality, contamination, and reporting norms all change over time. A raw percentage without a date tells you less than people think.
What it is
The original SWE-Bench paper frames the benchmark around resolved GitHub issues and merged pull requests. That makes the task legible to both researchers and practitioners: agents read a problem, inspect a repo, edit code, and live or die by the project’s tests.
The official project site describes the benchmark as real-world GitHub issues plus a reproducible, Docker-based evaluation harness. That combination is why the benchmark became a default reference point in coding-agent discussions.
Real repos are messy. Some issues are underspecified, some tests are too narrow, some tasks can depend on environment details, and public benchmarks can get contaminated once their tasks and solutions have circulated long enough.
So the useful question is not “what is the latest number?” but “which variant, with which harness, on what date, and with which known caveats?”
How evaluation works
Each task pairs the original issue with the repository state before the human fix was merged.
The model can read files, reason about the codebase, and propose edits, but it does not get the hidden evaluation tests.
These tests fail before the real fix and should pass if the issue is properly resolved.
These guard unrelated behavior so a patch does not “solve” the issue by breaking something else.
Important dates
October 2023 / ICLR 2024
The benchmark was introduced in the 2023 paper and later published at ICLR 2024, positioning repository-level issue resolution as a more realistic coding evaluation setting.
June 27, 2024
The official project site lists a Docker-based harness milestone, important because reproducibility is part of the benchmark’s value proposition.
August 13, 2024
OpenAI and the benchmark authors released a 500-task verified subset intended to remove problems judged unfair or broken by human reviewers.
February 23, 2026
OpenAI reported contamination and test-quality issues severe enough that it no longer recommends SWE-Bench Verified as a frontier coding capability metric.
Variants and caveats
Broad and historically important, but some tasks were later criticized as underspecified or impossible to solve fairly.
A filtered version intended to be easier and cheaper to evaluate, but still not equivalent to a human-validated solvable set.
A 500-task subset released on August 13, 2024 after human review. It improved reliability, but OpenAI later argued it had become too contaminated and too brittle for frontier measurement.
Treat benchmark numbers as dated evidence, not timeless truth. Keep the benchmark name, subset, date, harness, and reporting source attached to any claim you repeat.
FAQ
No. Its defining trait is repository context: real files, real issues, real tests, and patches that have to survive evaluation inside an existing project.
Because it remains historically important and operationally useful. The mistake is not citing it. The mistake is pretending a single score settles the whole question of software engineering ability.
No. It improved quality in 2024, but OpenAI said on February 23, 2026 that contamination and test design issues now make it unsuitable for frontier model launches.
If you want a broader view of agent systems, evaluation surfaces, and production-facing AI workflows, use the hero CTA to jump to Graphify. This page stays focused on explaining the benchmark itself.
Primary sources
Final note
If someone quotes a benchmark percentage without the subset and the date, ask for both before you draw any conclusions.
Continue to Graphify