SWE-Bench is a benchmark for evaluating language models on real-world software issues collected from GitHub. Agents receive a repository snapshot and an issue description, then generate a patch that is checked by tests.

What does SWE-Bench Verified change?

SWE-Bench Verified is a 500-problem subset released by OpenAI and the benchmark authors on August 13, 2024. It filtered the original test set to remove tasks judged problematic by human reviewers.

Why should benchmark scores be read with dates?

Because leaderboards move, evaluation harnesses change, and some benchmarks become less reliable over time. On February 23, 2026, OpenAI said SWE-Bench Verified was no longer suitable for measuring frontier coding capability.

Do agents see the tests?

No. The issue statement and repository are available to the agent, but the harness uses hidden FAIL_TO_PASS and PASS_TO_PASS tests to evaluate the submitted patch.

Benchmark guide / issue-to-patch evaluation

SWE-Bench, explained without turning it into empty SEO filler.

SWE-Bench is one of the most referenced benchmarks for coding agents because it asks models to resolve real GitHub issues against real repositories. That makes it useful, but only if you read the benchmark, its variants, and its scores with context.

Explore Graphify

This page is an independent explainer. It is not the official SWE-Bench site and it does not mirror live leaderboard numbers that may go stale.

Key facts

The benchmark in one screen

What it evaluates

Repository-level software issue resolution

Unlike toy coding tasks, SWE-Bench asks an agent to work inside an existing codebase and submit a patch for a real issue sourced from GitHub.

What the agent gets

Issue text and the codebase

The agent receives the original issue description and repository snapshot. The tests that judge success are not shown to the model.

How it is checked

Execution-based evaluation

A submitted patch must satisfy the issue-specific FAIL_TO_PASS tests and the regression-oriented PASS_TO_PASS tests.

Why dates matter

Scores age faster than definitions

Variants, harness quality, contamination, and reporting norms all change over time. A raw percentage without a date tells you less than people think.

What it is

SWE-Bench tries to measure something closer to software engineering than code completion.

Why it caught on

The original SWE-Bench paper frames the benchmark around resolved GitHub issues and merged pull requests. That makes the task legible to both researchers and practitioners: agents read a problem, inspect a repo, edit code, and live or die by the project’s tests.

The official project site describes the benchmark as real-world GitHub issues plus a reproducible, Docker-based evaluation harness. That combination is why the benchmark became a default reference point in coding-agent discussions.

Why that still is not enough

Real repos are messy. Some issues are underspecified, some tests are too narrow, some tasks can depend on environment details, and public benchmarks can get contaminated once their tasks and solutions have circulated long enough.

So the useful question is not “what is the latest number?” but “which variant, with which harness, on what date, and with which known caveats?”

How evaluation works

The issue-to-patch loop is the core mechanic.

Start from a resolved GitHub issue

Each task pairs the original issue with the repository state before the human fix was merged.

Let the agent inspect the repository

The model can read files, reason about the codebase, and propose edits, but it does not get the hidden evaluation tests.

Run FAIL_TO_PASS tests

These tests fail before the real fix and should pass if the issue is properly resolved.

Run PASS_TO_PASS tests

These guard unrelated behavior so a patch does not “solve” the issue by breaking something else.

Important dates

Useful benchmark history, with exact dates instead of hand-wavy “recently”.

October 2023 / ICLR 2024

Original SWE-Bench paper

The benchmark was introduced in the 2023 paper and later published at ICLR 2024, positioning repository-level issue resolution as a more realistic coding evaluation setting.

June 27, 2024

Containerized evaluation harness

The official project site lists a Docker-based harness milestone, important because reproducibility is part of the benchmark’s value proposition.

August 13, 2024

SWE-Bench Verified released

OpenAI and the benchmark authors released a 500-task verified subset intended to remove problems judged unfair or broken by human reviewers.

February 23, 2026

Verified no longer recommended for frontier launches

OpenAI reported contamination and test-quality issues severe enough that it no longer recommends SWE-Bench Verified as a frontier coding capability metric.

Variants and caveats

“SWE-Bench score” is incomplete unless you know which subset you are reading.

Original SWE-Bench

Broad and historically important, but some tasks were later criticized as underspecified or impossible to solve fairly.

SWE-Bench Lite

A filtered version intended to be easier and cheaper to evaluate, but still not equivalent to a human-validated solvable set.

SWE-Bench Verified

A 500-task subset released on August 13, 2024 after human review. It improved reliability, but OpenAI later argued it had become too contaminated and too brittle for frontier measurement.

What to do with scores now

Treat benchmark numbers as dated evidence, not timeless truth. Keep the benchmark name, subset, date, harness, and reporting source attached to any claim you repeat.

FAQ

Short answers to the questions that usually get flattened in social posts.

Is SWE-Bench the same thing as a coding interview benchmark?

No. Its defining trait is repository context: real files, real issues, real tests, and patches that have to survive evaluation inside an existing project.

Why do people still cite it if it has limitations?

Because it remains historically important and operationally useful. The mistake is not citing it. The mistake is pretending a single score settles the whole question of software engineering ability.

Did SWE-Bench Verified fix the benchmark forever?

No. It improved quality in 2024, but OpenAI said on February 23, 2026 that contamination and test design issues now make it unsuitable for frontier model launches.

Where does Graphify fit here?

If you want a broader view of agent systems, evaluation surfaces, and production-facing AI workflows, use the hero CTA to jump to Graphify. This page stays focused on explaining the benchmark itself.

Primary sources

Read the benchmark, not just commentary about the benchmark.

Final note

If someone quotes a benchmark percentage without the subset and the date, ask for both before you draw any conclusions.

Continue to Graphify