Latent Space · 2026-02-23

OpenAI on the End of SWE-Bench Verified and Future of Coding Benchmarks

Hosts: Alessio Fanelli, Swyx

Guests: Olivia Watkins, Mia Glaese

coding benchmarksAI evaluationbenchmark contaminationSWE-Bench VerifiedSWE-Bench ProOpenAI Frontier Evalshuman-in-the-looppreparedness framework

Why it matters

SWE-Bench Verified is retired due to saturation and contamination, no longer effectively measuring coding progress.

Key claims

  • SWE-Bench Verified is retired due to saturation and contamination, no longer effectively measuring coding progress.
  • OpenAI invested heavily in human curation of SWE-Bench Verified, involving expert software engineers reviewing and validating tasks.
  • Contamination issues include training data leakage and overly narrow tests that unfairly penalize valid solutions.
  • SWE-Bench Pro is proposed as a more challenging successor with longer, more complex tasks and less contamination.

Episode summary

Summary

In this episode of Latent Space, Mia Glaese and Olivia Watkins from OpenAI's Frontier Evals and Human Data teams discuss the retirement of the SWE-Bench Verified coding benchmark. They explain that SWE-Bench Verified, once a key benchmark for measuring coding progress, has become saturated and contaminated, limiting its usefulness for tracking improvements in AI coding capabilities. OpenAI is advocating for the community to move towards more challenging and realistic benchmarks like SWE-Bench Pro, which feature longer, more complex tasks and reduced contamination.

The guests highlight the extensive human effort involved in curating SWE-Bench Verified, including multiple expert reviews to ensure task fairness. They also discuss the challenges of contamination from training data leakage and overly narrow test criteria that do not fairly assess model coding ability. Looking forward, OpenAI emphasizes the need for benchmarks that measure open-ended design decisions, maintainability, and real-world coding workflows, potentially incorporating human-in-the-loop evaluation and proxy models for grading.

The conversation also touches on OpenAI's preparedness framework, which tracks frontier risks including research automation and model autonomy, with coding benchmarks playing a key role. The guests call for more community collaboration in creating and sharing diverse, high-quality evaluations to better measure AI progress and real-world impact, including metrics on AI usage and job augmentation.

  • SWE-Bench Verified is retired due to saturation and contamination, no longer effectively measuring coding progress.
  • OpenAI invested heavily in human curation of SWE-Bench Verified, involving expert software engineers reviewing and validating tasks.
  • Contamination issues include training data leakage and overly narrow tests that unfairly penalize valid solutions.
  • SWE-Bench Pro is proposed as a more challenging successor with longer, more complex tasks and less contamination.
  • Future benchmarks should assess open-ended design decisions, code maintainability, and realistic coding workflows.
  • Human-in-the-loop evaluation and LLM-based proxies are explored as scalable grading methods.
  • OpenAI’s preparedness framework tracks frontier risks including research automation, with coding benchmarks as a key component.
  • Call for community collaboration to create diverse, realistic benchmarks and track real-world AI usage and impact.

Source material

Transcript

Okay, hi, we're here in the OPI Studio with Mia Olivia from the Frontier Evels team or however you want to introduce yourself.

Maybe you want to introduce name what you do at OpenEI and we can get it started.

Sure, hi, I'm Olivia, I'm on the Frontier Evels team.

Great, sure.

Nice.

Hi, I'm Mia.

I'm a VP of free search at OpenEI and my team are the Codex team, the Suminator team and the LineMent team and we work a lot with Olivia's team on Frontier Evels.

Yeah, very exciting.

And as by my understanding, you were part of the original team that worked on Siebets verify as well.

Yeah, Olivia's team, the Frontier Evels team and the Suminator team collaborates on creating Siebets verify.

So you've seen the evolution of coding benchmarks over time and I think you was run about to the mid to mid to mid to mid to mid to mid to mid to mid for the first coverage Siebets verify.

They said we've evolved a lot since then.

What's the blog post that you have worked on there that we're producing today?

Like what is the sort of concept, the list of mean thesis that you're pushing out?

So the main thesis is that Siebets verified has been one of the North Star coding benchmarks that the field has looked at to measure coding progress.

But recently, we've seen that progress is kind of stalled and this we realize that this is because the Evel is effectively saturated and also highly contaminated.

So at this point, we think that it's not really measuring coding performance improvements well anymore and we think that the field should move away from this towards other benchmarks.

Like super inch pro.

Like super.

Yeah, amazing.

Yeah, I, I, I one of the jokes I always have is like, there's a group chat with all the labs and everyone just takes turns to increment that you 0.1 on trucks and then it's like, okay, well, you have the best coding model I guess because you're 0.1 percent higher, but it's not super convincing at this point.

Oh, yeah.

So cool.

I think the let's sort of reset on like what was your original work that you guys did for Siebets very fine, which I think was pretty substantial.

Like it was like a very significant investment for an open AI, which like people still don't appreciate.

And then what would it assess sessions that we that we found over time, right?

So that like what what was we meant for a fight?

No, should the people should know about.

So when to verify it was a kind of a cleanup of original benchmark from a lab at Princeton, called Sweetbench and the agent is basically given a code base and a task that was sourced from a real world repository and gave a tissue and was asked to sell the task and is graded on whether or some tests pass.

And at the time, this was quickly became a popular benchmark because at the time, the field didn't really have good real world coding benchmarks, but then when open AI took look at the benchmark as part of one of the evolves we wanted to track in our preparedness framework, folks started realizing that some of the cases where agents were failing were due to bad problem setups, rather than just to models being done.

So folks open AI did a pretty extensive human data campaign hiring almost 100 real world software engineers to go through the problems and figure out, like, or the task well-classified or the tasks actually fair and kind of created a curated set of like 500 tasks that without where much better.

It's just it's maybe it's hard to to overstate like the amount of, all right, but it took today to create that benchmark as literally like many expert software engineers reviewing the problems, like, definitely multiple times and to, you know, basically like, we different experts independently decide a bit.

Yeah, you didn't have to do that.

You just tripled your costs for just a bit.

I mean, we had to do it.

We had to do it actually because it's quite a hard task to like look at something like a problem and the part and then like, it's not just the problem and the part where you have to like understand it in the context of the code base that the human or the models and to sort of the task.

So it's a very complex problem and it was definitely needed to have three reviews and I think like maybe we should have done more, but it was definitely a lot of effort to get there.

Yeah, and there's more, but people kind of read the block was so that I will note that you guys said a trend in verifying that first because I just recently saw, I think when had a HLE verified for communities like this, I'm verified, which is like, so now I want to verify everything, which is nice and good, make sure call it either.

Okay, so, but I think that the meat of it is that this, this was a lot of like, well, here's the issue or problem statements and then here's the divs, here's the golden test and here's some regression tests, right?

That's like the rough set up with these 500 problems.

And there's some contamination always happens because all this we measure if I was fully open, I think.

You did have canaries, but like stuff, stuff weeks.

There's a multiple avenues that like the problems are sourced from open sourced repose.

Yes, so it's not just like when we usually publish evaluations, we publish evaluations and then we add kind of a strengths to ensure that you know there are easily fertile out at training time.

Obviously, if you use sort of data from like, open sourced, you don't have actually like a kind of re-strength.

Yeah.

And these are also, like some of these are very popular.

They're repose, like the Jango repository, so you're going to see like many instances that didn't use kind of throughout the road.

Yeah, yeah, yeah, just before recording, you were telling me that you found this in your own sheet of thought, if the five also seeing that like they had an extra knowledge or something.

Yes, so this was an example where the task asked the agent to influence something, but it wasn't full that there was this specific argument that the test was going to be looking for it using, but in the GP55.2 tank of thought, we actually saw instances of the model reasoning like, hey, I think that at some longer version of this repository that implemented this particular argument, maybe I should add it in.

So this is an example of the test that like would be pretty impossible to pass without this contamination Yeah.

And I think you found that sort of first ride in a triggered like a whole investigation, both like in our own muddots.

And also in other frontier muddots, like in the market, and like understanding how contaminated the benchmark is like across the industry.

What else did you find?

I mean, it's that I have to double click on this.

So we, when I say we, this is mostly for other folks, our team, not, I don't think they're really like, yes.

But so we did so analysis on first of all, or the tests actually fair.

And so this happened by first taking all the problems that O3E couldn't solve reliably.

And then again, again, I get a lot of humans to do basically another pass of kind of digging into, you know, what's wrong?

Is it the same exact analysis or what they're reading O3's output and going just with O3 went wrong?

I think it was, I mean, it was definitely like a scope to the set of problems that models failed.

And I believe they were able to look at like what the model solutions look like versus what the test is.

So this isn't the same work as the original system.

It's not a big actually the same work.

It was like a deeper dive.

It's like, okay, which are the problems that we don't see any murder solving?

It's like, if there's something fundamentally wrong with those problems.

So it's something, you know, wrong with the other muddots, just not smart enough to solve the problems.

So that's kind of like what we read, read, read again.

Yeah, and you found some.

Oh, yes, like an over half of the problems that we're investigating that deep dive.

There was one problem or the other.

I think the most common problem are like overly narrow tests where there's some particular implementation detail that the test we're looking for, but wasn't specified in the problem description.

So isn't fair to expect that model to make that particular design choice.

Like one pretty blatant example are cases where the task asks you to influence the feature, then the test are looking for you naming that argument or that function with a particular name, but if you make chose another reasonable name, the test would fail.

Yeah.

And another set of types of bad test or test that are just looking for additional features that we never mentioned in the problem description.

Well, that's a significant mistake.

That means that if you pass a test actually like you probably did like a really good job, but just because you didn't pass a test doesn't mean that your implementation wasn't like a good one.

Right.

So it was just like we only accept like very narrow versions of solutions and like not the whole space of like viable and sort of like good solutions to the problem.

Yeah.

I think it's important that you're doing this because in some way it is you in 2025, 6 going back in time and correcting your own work, right?

Because you could have caught all this in the original verified work.

I think so.

It's definitely much harder to find a problem in the abstract than when you're looking at a very smart agent's best effort solution and try to compare it.

It is harder or harder or it's much easier when you have it.

I think I think also like at the time and three bench verified was published.

I think it was like a very strong benchmark.

It's not like we're like, oh, this was not, this wasn't like a strong benchmark at the time.

I think this is something that a lot of bench marks go through like as an evolution, right?

Like when they start to become like popular and like viable, it's because they match on something like important and muddits maybe do like 20% correct and then sometimes even less.

And sort of like people have something to hold on and improve muddits on these benchmarks and by the time that you hit like very high performance on the bench marks like additional like 0.1% improvements.

It becomes sort of like meaningless and sort of like at the time I think, you know, that benchmark with the super valuable and it taught like us and like the industry a lot.

It's just like now at the point that we're out now where it's our as strong as they're now.

We're kind of starting to measure not necessarily like what we want to measure which is like coding capability of our agents but like the agency ability today correctly guess how to name a specific function and and that isn't really what we are like want to measure at this point.

Yeah, I think that's fair.

Is there I mean if I if I asked you to ballpark it like most models or most frontier models are not like 80 something.

Is there like what's the actual like number on super certified that you did you guess is like the ceiling or I guess that's really hard to say.

Like I when jibvy 5.2 came out but folks took a look and found that it was solving like 31 problems that were in the set of should be very hard to solve without contamination problems.

So I think it's quite possible that that number is already something that we've hit if you didn't have for the nation at all.

Or just it though.

Yeah, cool.

We're going to stop reporting super certified right and then two bench pro will be sort of the next one which is an effort from scale.

What you had sort of comparison analysis what's what attracts you to see much cool.

The first one I think is just that it's harder for a suitment verified.

I think that's only like 90% of the problems are things that were estimated to take like an expert software engineer like less than an hour.

They're very well specified, very self-contained.

And the sewage pro problems are just bigger and harder and there's much more had everyone that you've developed because it's not saturated.

Yeah, like categories of like one to four hours and four plus.

Yeah, and there's more diverse lots of repositories multiple languages qualitatively, more different types of problems.

So that's great.

On the contamination side, we also think it's better there.

So the way we were measuring for contamination first we went verified was with this little like contamination order at our agent, which is given the description of the task and the path and the task ID and told us to go take this target model and kind of open-ended like set of questions, try to find questions that will manage to kind of reveal what contamination might be lurking in that model.

And in this we went to verified we found many instances of contamination across open-eyed models across like quad, up to 4.5, Gemini Flash.

And and all of these we saw things like regurgitating the ground for solutions, things like, in some cases, getting like the task IDs and other things they're pretty clear evidence of.

Edminimum familiarity was the reminaries.

Yeah.

So we went into the early task guided path.

Yeah.

Yeah.

It was a bit strong at the other hand.

We don't see this.

I think they're the honor agent found some like very light evidence that maybe a couple of models might be very lightly familiar with like one or two of the source repositories, but it's very different than soup into verified.

So that's the narration is good.

I think they're also like we should expect that at some point like that that's not going to be like the right benchmark anymore and like it's a field we kind of have to continue to like move on and like find harder and more representative problems that we can match our capabilities.

Awesome.

So let's go into that.

I think that there are a lot of I think we also practice in the in the pre chat was people feel quoted at the difference when they're using 5.1 to 5.2 to 5.3 and it's not super expressed in the these benchmarks because they're on a number of these things.

What capabilities do you really want to benchmark in a ideal coding benchmark?

You know, I guess like agent decoding much, whatever you call it.

What thing is kind of open-ended design decisions places where the home maybe is a little bit under specified and seeing if the model can make reasonable design decisions.

What's a reasonable prompt for that?

Like this 5.5.me, a B2B SaaS to make no mistakes or you know like that's that's the me and but like okay what's like what's like an actual usable open-ended problem like that?

Like sure.

I mean maybe an example it could be finding a way to speed up a particular part of a code base but they're more multiple different ways to.

Yeah, they're dedicated performance benchmarks.

I think you guys have one.

It's efficiency or is it?

I said I don't know.

I think that's that's a lot of your heart's this group.

But yeah, yeah, I mean that that is a good one.

I think there's just many, many things that people like buy you about working with soft engineering agents.

They think three events verified.

Obviously, they measure like some, they measure like some important capability which is like given like a description of a good type issue.

Can you produce a patch that solves that issue, you know, satisfactionally and like obviously there's like some issues with the with a benchmark that means that now that we're at like 80% we don't really trust like further improvements on it.

But like it does much or something that is like a viewer, they they capability of models.

But I think as a field, we're like moving beyond sort of, you know, can my coding agent they solve a small like GitHub issue for me, right?

And so we're starting to look at like much more longer term tasks, right?

Like that don't take like 15 minutes, but maybe like our sometimes days.

And then beyond sort of like what kind of tasks can my agent solve?

Like there might be things that are kind of a bit harder to grasp, but like Olivia talked about sort of like, does it have like design tastes, right?

Like does it solve the problem in the way that, you know, my team likes to solve problems?

That is the code nice, right?

Like is it, is it, is it, is it sort of like clean code, right?

I people care about these, is it maintainable in the future?

People care about a lot of these maybe less tangible, less tangible, and like harder to measure frankly things that I still like super meaningful for people that are working with coding agents.

Yeah, so I mean, these are all qualities that are obviously the no longer the whole logging and fruit, like we have no idea how to eat all this.

I think that the the simple question maybe that there's sort of two folks in the road.

One is the sort of very human intensive money intensive path, which is higher bunch of contractors and try to annotate this.

The other is using LLM to proxy it and try to line LLM so that it can give you a reasonable proxy, which of those would you want to what you want to do both?

I think like it may be used to talk about GDP law.

I was like an example.

Sure, so GDP Bell is an Evel that was again produced by a collaboration between Human Data team and the friendly Evel team, and it's trying to measure whether agents can do kind of a variety of like real world white color work.

That was an Evel where grading is very hard requires kind of a lot of kind of like doing knowledge on exactly what are you looking for in each different context?

Yeah, across like 15, 16 white color jobs professions like that take of a synchron part of GDP which is great.

Kind of like high level professions and a lot of like different granular some professions.

I have said like I'm a big fan.

This is the Evel for BGI basically.

But part of because it was so hard to acquire so much kind of like domain knowledge that a human data team hired like a lot of people from these professions to be very involved in creating tasks and creating the gold solutions and trying to help create rubrics and so forth so they can So basically take the GDP Bell which is a general listing take that same approach to apply it to code and you roughly have like a rough road.

I think it's an interesting solution I think but you're pointing out it's an important problem which is sort of this this like how realistic is it?

And like do you know what would be wanted was like cutting agents to dried code that you know we think it's good and so it's like asking human it's actually like a good way to ensure that it's also kind of a slower like complex way to do that and so part of why I think you know so we went to verify it ended up being super popular and where we are seeing like upper benchmark like this being super popular like it's very easy it could even be easier but like validating that the solution passes or the test is like fairly trivial once you can like run the tests in in like your on your computer or wherever are you running them and you can kind of like okay it's a correct or not correct and you can kind of aggregate that and that it's super simple but it doesn't tell you like you know it did the method they solve the problem like well like you know actually like what if actually like an open source maintainer of that project have like marked that PR like that it doesn't tell you but there's a lot of value in having benchmarks that are both like easy to compare across the industry and also that can be so far really fast without human involvement yeah amazing your teams also put out other kinds of emails that are related like the I think there's an RL paper bench and then the sort of like the most sort of but recursive in self improvement type emails how much should that figure into mainstream coding emails you know like is there is some way in which those things join together so we're asking like should we build it should else be you build emails for the self improvement emails are saying to coding emails currently cover that man I think I just think like those are some of the most advanced emails that we have and we're not using them in the normal path and it's just it's an interesting split between well here's emails for coding normal things and here's the one for a machine learning that is like completely different right yeah I think you can begin what I mean and that's mostly a safety argument I guess but also like it's actually really useful for people to understand if the model is really good at like AI code basically yeah yeah like my guess is that part of the reason that a lot of benchmarks of our haven't focused as much on the AI coding is just a question of like what data sets are easy to gather because a lot of the like you know state of the art AI code bases are proprietary so if we make emails for that like for probably not gonna release them and it's not hard for people in the field so make emails that kind of measure like is this a realistic research coding workflow I do think that it's good for the field to try to measure these skills in a public way and then each hard to make it realistic or and then one more thing that a lot of people are trying to do which is like sort of well instead of like a percentage of zero to one hundred maybe we we don't redenominate in dollars right so you have to be lancer and all that other people are doing like fending bench whatever any any alpha knows or are they you you still want like a traditional academic benchmark I think in a way like that's like different ways to measure the same thing right if we're like oh this is like how much money it produces yeah it's a fairly similar thing to saying like all this problem would take like a human you know to ours to solve something like that usually they're like fairly like correlated right like however you know much it would take like a human to solve that problem kind of determines the value that we describe like a solution and so I do think that is like an important thing is like how complex and how sort of long running are the tasks that we are they able to entrust our agents with yeah and so I think that that's like an important piece but I think here sort of monetary value time a complexity they're all kind of like tried to capture like a similar thing yeah okay so they're they're all proxies for some amount of increasing capacity that we want to measure I think that's a good thing I think the only other sort of major player in this field is meter which has done the sort of long grass and congrats you guys have completely destroyed the curve for that any takes on that obviously you come up really well so like it looks good but I don't know if like that approach is something that you want to incorporate in your work making you thus this is the long autonomy test for yeah yeah and and we we work with with meter and these evaluations so like we we do appreciate them I think then they're using time right then of using money so I think that that was your question I think like complexity however we can sort of like quantify it that's really important to understand like where our models are are getting to okay complexity is the abstracting and they projects down the time projects down to story points whatever dollars great one last question on just like just the overall preparedness framework is that you know it's actually kind of looking at people mentioned the preparedness framework a lot I don't think it's well I explain to a lot of people and you actually have a nice website where it's like I think it's like test and like inform and teach something and I feel like you you actually do a lot of work there and and it only if you want to talk about how the preparedness framework applies so the preparedness framework is organized kind of like public framework for how we track the frontier risks so these are kind of capabilities that are typically dual use like you can use them for good things are bad things but we want to at least keep an eye out for the bad things to make sure that we have both we as a company and like the broader society are kind of prepared to handle the potential downsides and so the moment we kind of track three different categories one is kind of bioresk another cybersecurity and a third is kind of research automation and model autonomy and that's kind of what ties the most into the Swedish, the genre where coding is not all of automating research but it is one very important key component and so we initially created Swedish verified as part of like building out evels for that model autonomy workstream and now I think for like we have to move beyond that towards looking more at like Ken was actually starting to actually automate research approach yeah music okay I mean anything else to add on just the general what people should know about preparedness and how evels and human data and lime and all together and I think maybe the thing that I would say is that we really appreciate we work really hard to build this event and so that's where we published revenge verified and that's where we're like sharing gdp by these sorts of things we also deeply appreciate like other people and then entire fear to kind of build evels and and share them and we use them like we benchmarking like yes but that's a bad idea why not we should use them so we'd really encourage people to find more ways to create and share our evels that we can we and the entire field can use to measure like progress on on like a variety of capabilities think doing including coding because it's important to understand so where we are we have to leave but we're we're just kind of talking a little bit about like the future directions that we want evels to go and I think here here we can dive in on like give us work good work on these these things we'll talk to you you know I here's your platform to make a call for what you're looking for I think a few things that would be useful it's a first of all really really hard task that the kinds of things that would take top-notch engineers months or teens weeks would be quite good especially if breeding is reliable and breeding is like you know you have for example a grubrics that have been sourced and validated by many people on the field I think that would be quite valuable I think also benchmark so I'm kind of creating products and to end I think if you were by putting more that would be quite useful I think a third thing that I'd say that is maybe not quite an evel but I think it's still relevant to the kind of overall mission of like we as a field and as a world should be tracking like where are these capabilities going I'd like to see more metrics tracking like real world usage like how much is AI actually being used in the field how much is it replacing people's jobs how much is it you know augmenting people of speeding people up just like real world networks yeah yeah the the replacement thing is always like a sensitive one on a sort of PR side of things but you know we create new jobs that manage the old jobs that's how it is yourself like you know I think in terms of the frontier evels that that all the guys really going to excited to push like you you put out really good work every single time what should people expect from from OPEI itself I'm not sure I can say well we're gonna general directions I mean as general directions I think looking at real world impacts like real world real really go to yeah yeah yeah amazing okay well I'm excited for more real world impact I I think you guys have you know really made a lot of progress and I think taking a lot of industry leadership for a superhero fight and and now moving on to see why she goes so thank you for doing this thank you for being so transparent and I think people respond in kind yeah for a time thank you