
No Priors · 2026-06-26
Noam Brown on Why Traditional Benchmarks Fail Modern AI Models
Hosts: Sarah Guo
Guests: Noam Brown
Why it matters
Traditional single-number benchmark scores are misleading because they don't account for test-time compute; GPT-5.5 looked like a small improvement on paper but was substantially better in practice due to more efficient reasoning
Key claims
- Traditional single-number benchmark scores are misleading because they don't account for test-time compute; GPT-5.5 looked like a small improvement on paper but was substantially better in practice due to more efficient reasoning
- Modern models can productively think for weeks on hard problems before performance plateaus, making 'run until plateau' infeasible as an evaluation protocol
- Responsible scaling policies and preparedness frameworks don't account for the inference budget, so a model's dangerous capability depends on how much money an actor spends running it
- Model release cycles of 2-3 months mean labs cannot exhaustively evaluate their own models before shipping the next one, and external users are also under-spending on already-released models
Episode summary
Summary
OpenAI research scientist Noam Brown argues that standard benchmark grids are misleading because they fail to control for test-time compute. He illustrates this with the GPT-5.5 release, where benchmark scores showed only modest improvements over 5.4, yet users found 5.5 substantially better because it was far more efficient with its thinking time. He proposes that evaluations should either fix a token/cost/time budget or plot performance as a function of test-time compute, rather than reporting single numbers.
A central concern is that responsible scaling policies and preparedness frameworks were designed before test-time compute became important, so they don't account for the fact that a model's capability is now a function of the budget spent on inference. A model that looks benign at $10 of compute could be far more capable at $10,000 or $10 million. Combined with model release cycles of 2-3 months, Brown argues labs cannot fully characterize their models' ceilings before the next one ships, and external users are systematically under-exploring what's already released (e.g., 5.5 could likely have disproven the Erdős unit distance conjecture with $100K of compute).
On the trajectory question, Brown pushes back on the idea of an overnight intelligence explosion, arguing that heavy reliance on test-time compute makes wall-clock time a fundamental bottleneck. He sees AI research taste as the current limiting factor and views multi-agent coordination as a still-underdeveloped frontier. On competition between labs, he describes an intensely competitive but broadly aligned research community. He closes by encouraging the industry to break out of the "bad equilibrium" of publishing uninformative benchmark grids and to embrace budget-controlled evaluations, including for routing-based systems.
- Traditional single-number benchmark scores are misleading because they don't account for test-time compute; GPT-5.5 looked like a small improvement on paper but was substantially better in practice due to more efficient reasoning
- Modern models can productively think for weeks on hard problems before performance plateaus, making 'run until plateau' infeasible as an evaluation protocol
- Responsible scaling policies and preparedness frameworks don't account for the inference budget, so a model's dangerous capability depends on how much money an actor spends running it
- Model release cycles of 2-3 months mean labs cannot exhaustively evaluate their own models before shipping the next one, and external users are also under-spending on already-released models
- Brown rejects the 'overnight intelligence explosion' framing, arguing that test-time compute dependence makes wall-clock time a hard bottleneck on progress
- Current models accelerate but don't replace research, with 'research taste' as the binding constraint; multi-agent knowledge accumulation is an underexplored frontier
- The industry is stuck in a bad equilibrium where everyone publishes uninformative benchmark grids; Brown urges labs to instead plot performance vs. a token/cost/time axis
- On routing and consensus systems: gains should be evaluated against letting a single model think longer at the same test-time compute budget
Source material
Transcript
with GPT-3, you couldn’t scale test-time compute.
Like, if you gave it a budget of $10 million and said, "Okay, well, let's see what GPT-3 can do," it really can do that much.
The Pecernus frameworks and responsible scaling policies, they don't really account for the amount of test-time compute.
They just say, "Okay, well, what's the capability of the model?"
The problem is, we're in a world now where the capability of the model is a function of how much money you put into it, basically.
If you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10.
If you give it a budget of $10 million, it can do even more.
At what budget should you evaluate these models?
The policies that exist today don't really address that question.
Hi listeners, I'm Sarah Goa, and welcome back to No Pries.
Today, I'm here with Noam Brown, one of our godfathers of AI reasoning.
We talk about the broken state of evaluations, very large-scale test-time compute, how he thinks about recursive self-improvement, and what's next on the horizon for competition at the frontier.
Welcome.
Noam, I'm so excited to have you back.
That's great to be back, yeah.
You are our first guest.
I'm very proud of my taste in friends and researchers for the pod, given how important inference-time scaling has become to the industry.
You should be proud too, having actually pioneered it.
Played a part, yeah, among many others.
You just wrote this essay that really resonated about large-scale test-time compute and why the industry is not evaluating these models as robustly as it should be.
What was the motivation for it?
Yeah, the motivation was we released 5.5, and the initial reaction was kind of skepticism, that it was a substantially better model.
To be fair, that only lasted for a few hours before people had some time to play around with it and tried out themselves, and they saw that it was actually substantially better.
But I think a lot of the skepticism came from the benchmark grid that was published.
Basically, whenever a new model is released, there is this benchmark grid where they show all these different benchmarks on the x-axis and then the performance of different models on the y-axis.
And you can just compare different models.
It's like a single number for a model on a single benchmark.
And if you look on paper at the difference between 5.5 and 5.4 or other models, it was an improvement, but it wasn't a huge improvement.
It was only a few percentage points in some benchmarks.
So people looked at that, and they were skeptical that it was actually a better model.
Once they played around with it, the story changed.
I think the reason why it doesn't show up as so much better on the benchmarks is because the benchmarks are being presented, the benchmark results are being presented in the wrong way.
They're not controlling for the amount of test time compute that is being used on that benchmark question.
It turned out that 5.5 is just much more efficient with its thinking.
If you run it at max settings, 5.4 is thinking for a lot longer.
It takes longer to get back a response than 5.5.
And once you control for the amount of thinking time, actually you can see that 5.5 is a substantial jump over 5.4.
That is, I think, people's day-to-day experience with it.
And then when I mention this to people, the reaction, the typical question I get is like, "Okay, well, why not just have 5.5 think for as long as 5.4?"
And the question is like, "Well, how long should they think for?"
Typically, the response I get is, "Well, until the performance plateaus."
Right?
This is at some point where the performance on the benchmark is going to plateau, and you just evaluate to that point.
The thing is, the point at which it plateaus is actually really far out these days.
I mean, it's true.
In GPD 3.0 land back in 2022, the models couldn't really think productively for that long.
And so you could just run them until they plateau.
It's not that far away.
But what we're seeing today with the modern models is that 5.5 and other models can think for, if you scaffold them reasonably well, can think for weeks even before having a performance plateau on some of these benchmarks.
And so the point at which they plateau is simply too far out to reasonably test.
We all need to actually reinforce either a patient's limit or a budget limit from a token perspective now.
And that wasn't true a few years ago.
Exactly.
And so I think the proper way to-- and so my claim is the proper way to evaluate the models now is you either have some kind of budget for the benchmark, whether it's tokens or cost or time or whatever, or you plot the performance as a function of the amount of test time compute that's going into the model.
And then it becomes much more clear how to compare the performance between these different models.
Given the model evaluation cycle and the fact that performance does not asymptote for many tasks over quite a long period of time, what do you do about that issue?
The fact that some of the evals that you would want to run are both beyond the scope of budget or time that's reasonable given the current model release cycle.
I mean, I think for things like cyber, we've seen-- and actually the AISI in their evaluations has shown that the models continue to improve at 100 million tokens.
If you run them for 100 million tokens, they're still improving beyond that point.
And that can take a very long time to run.
But you also do see that the performance is not just a discontinuous jump.
It's actually like you can see the slope of improvement over those 100 million tokens.
And so you could probably do some kind of evaluation up to a certain budget and then just say, OK, well, this is what we project the performance to look like.
And I think there hasn't been a lot of research on this yet.
I actually think this would be a great paper to publish if there's any academics out there looking for something to research.
Can you predict what the performance looks like at an inference budget of, let's say, $10,000, only using inference budgets up to $10,000?
So maybe an orthogonal question for you.
Do you think users are systematically not thinking long enough with their models about problems?
What do you mean by not thinking long enough?
If you can build an agent or control the amount of test time compute being used, like there's what is done by the model itself and there's what the user can do.
Do you think that the industry is using test time compute an optimal amount, weight undershooting it, or it's a problem in the models where they just need to be able to do that thinking faster?
I think it depends on the problem.
I think this idea that the models, you just let them think for a week or whatever and then they respond.
It sounds nice.
And yes, the benchmarks look great, but it's not very practical when working because you ask the model a question and then you sit there for a week waiting for it to come back to you.
I think what people have found most effective is to iterate quickly with the models.
And so the thinking time, I think, needs to be flexible.
When it makes sense to respond quickly to the user, it should respond quickly.
And then when it makes sense to think for a long time and the user wants it to think for a long time, then it makes sense to think for a long time.
I think people have been striking the right balance given what they have to deal with right now.
How would you characterize, you know, there's a lot of talk about benchmark maxing and the ability to game different benchmarks.
What would you characterize the landscape of benchmarks as today?
And then do you have favorites that you think are more indicative of capability than others?
So the benchmark maxing thing is also motivation for running an essay that I think it's really easy to show you can do much better than previous benchmarks or previous models on benchmarks by just, for example, scaffolding a bunch of models together.
So if you say, OK, well, we're going to instead of just running this model once, we're going to run it five times and take the best of the five responses or like ask a judge which one it thinks is best, then you can get much higher scores than that model.
And so it's really easy to make something that looks a lot better on paper, but is actually not better once you control for the amount of test time compute.
That is one thing that I'm worried about when it comes to benchmark maxing.
I mean, it's a little misleading is the only concern that I have.
And then as far as like the benchmarks themselves, I think there is always a risk of like just optimizing for the benchmark.
And I've certainly encouraged my team and I think at OpenAI, we're pretty good about not trying to optimize for specific benchmarks.
But once you put out a benchmark, it's there's it's always at risk of just being optimized for.
And I think one way one way to address that is to keep a held out private set that isn't publicly available.
The most popular fallback advice for, you know, figure out if a model is significantly better or not is to just play with it for a while.
Do you have anything more sophisticated than that that you suggest people do?
Do you create your own set of new evaluations each time besides private hold back at OpenAI?
I think everybody has their own set of questions that they like to ask the model whenever it comes out.
For me lately, it's been I use them to make poker bots and see how good they can make a poker bot.
I think it's a nice eval because there is very little open source code for making poker bots.
And there's a lot of published essays, there's a lot of published papers on it, but you really have to reason through everything.
And it's like it requires a lot of just reasoning and iteration and like a lot of small gotchas that I can kind of I've already worked through myself so I can see where the models fail along the way.
They've gotten really good at it now.
Can you describe perhaps like with your poker bot creation, like how reasoning might have progressed in model releases for you guys over a few releases?
Yeah, when the early models were really bad at it, like they could not basically do anything.
And then 5.2, I was able to work with it to make a reverse solver.
So that's like the final stage of poker.
And that itself was, I thought, really impressive and had to work with it a little bit.
But I was actually really impressed because I was able to make the reverse solver probably about five times faster than I would have alone.
There were a couple of things that it got tripped up on.
Blockers was always a big, big issue.
But overall, like, you know, with a bit of gentle steering, it just kind of like it kind of felt like a grad student where, OK, they would run into issues.
But at least like I would know what those issues were and know how to fix it.
I could make suggestions and it would go off and then do it.
And then pretty quickly would actually come back with something really good.
And then especially the optimization, I thought, was very impressive.
It was able to make it like 10 times faster than what I was able to do because it was just able to optimize the code so well.
The downsides with five point two is I felt like it was gaslighting me a lot.
And I always had to be very careful checking it and making sure like, OK, is it actually doing what it said it did?
Are there any things that are like glaring issues that it's not recognizing or it's just pretending aren't issues?
I remember there was like one point where for one of the models I was playing around with it, not five point two.
I kind of like as a unit test, I told it, OK, well, let's say I have one hundred dollars in the pot and I fold.
How much am I losing?
And the model said ninety two dollars.
And I was like, that's crazy.
I have one hundred dollars in the pot.
I just folded.
How do I not lose one hundred dollars?
And it said, oh, you know, it's ninety two.
It's close to one hundred.
It's fine.
It's no big deal.
And I was like, clearly this is a problem.
Right.
So the models did have this problem where they would gaslight you a lot.
But once we got to five point five, I actually thought it was way better.
It was able to basically do it zero shot.
And in fact, I've been working on just doing a full scale poker solver and it's basically able to do the whole thing.
With some gentle steering from me.
And I wouldn't be surprised if, you know, six months or a year from now, the model is able to do zero shot.
An entire poker solver.
Basically my entire PhD thesis in one go.
Let's talk about the larger implications of needing to evaluate these models relative to, let's say, like speed of their reasoning or efficiency versus token volume.
Right.
Or dollar budget or whatever.
Whatever your scaler is.
Can you describe some of the larger implications in your essay, including around like safety evaluations?
Yeah, the safety evaluations thing.
It's a bit of an inconvenient truth thing where.
OK, so I guess for background, a lot of the all of the labs have these things called either responsible scaling policies, preparedness frameworks.
They go by various names.
But the idea is that whenever a model is released, they go through a series of evaluations to measure.
Are there dangerous capabilities?
Could these models do things that we're we wouldn't want a bad actor to do?
And if the model isn't very capable, then it's no big deal.
But if it is very capable, if it could be used, for example, to make bioweapons, then you want to put in mitigations against that.
But the question is, OK, well, how do you evaluate whether the model is capable of that?
And they have like various protocols about like how they do these valuations.
But a lot of these frameworks were developed around the era of chat GPT.
Neither before or after when test time compute scaling was not really as much of a thing.
And it made sense like with GPT three, you couldn't scale test time compute.
Like if you gave it a budget of 10 million dollars and said, OK, well, let's see what GPT three can do.
It really can't do that much more than what you could do with like 10 dollars or one dollar.
The preparedness frameworks and responsible scaling policies, they don't really account for the amount of test time compute.
They just say, OK, well, what's the capability of the model?
The problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically.
If you give it a budget of 10 thousand dollars, it can do a lot more than what it can do with a budget of 10 dollars.
If you give it a budget of 10 million dollars, it can do even more.
And so at what budget should you evaluate these models?
The policies that exist today don't really address that question.
Some do some do better than others.
But for the most part, this is not really a factor that's being heavily considered.
Now, whether it should be released anyway, I don't want to wade into this question.
I think there's, you know, there's arguments on both sides.
But I think the important thing to recognize is that this is a question that is not being we're just kind of like, you know, pretending that this issue doesn't exist.
And I think it's important to just, you know, one way or the other account for it.
Yeah, it was the mirror image of the capability question of if the models can continue to do more and more without asymptoting on some tasks at very large budgets, then they should also be able to do so for tasks we don't want them to do as a society.
Right.
And so testing for that and what budget is allocated, it also seems out of sync from the model release cycle itself.
Right.
There's been in this acceleration of, you know, you get a new model.
Every sometimes few days and weeks at this point versus six months.
And you have a line in the essay where you say like the the only way to truly evaluate an agent on some very long running task might be to run it for a year.
And that's going to be true of both like useful and negative tasks.
Right.
And so how do you think about that versus the model release cycle?
Yeah, this this is also an interesting dynamic where basically as the models have become stronger, they've they're more they're better able to operate over longer horizons.
So again, with GPT three, if you wanted to run it for, you know, a week, there's really not much you could do to scaffold it into something useful that could actually run for a week.
But we're seeing now with the most recent models that you can actually scaffold, for example, five point five into doing a series of experiments that can run for weeks, for months.
Have you given your poker solver tasks like infinite budget?
Yeah, I haven't really scaffolded something together where I just tell it like, OK, just run this for four weeks until I could probably give it slash goal and just like, yeah, tell it to go nuts.
But I think at this point it could 100 percent do the river solver if I just give it slash goal.
I don't think it's at the level yet where it could do like the full poker solver if I gave it just slash goal and told it, yeah, go go run for a month.
But we're going to pretty soon be at that point where I probably could just tell it like, yeah, go work on this for a month and then come back to me with a full complete poker solver that state of the art.
And the problem is if you want to evaluate the capabilities of a model, what it can do after running for a month, the only way to be fully sure is to actually run it for a month.
And if you want to know after six months, the only way to know fully is to run it for six months.
Now, I'll get to like things we could do to address that a little bit later, but like it's important to recognize the model release cycle is like we're releasing new models like every two or three months at this point.
And so a model comes out, it takes two or three months to push it to its limits, and then you have another model come out.
And so nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell.
When Slash Goal came out, for example, I mean, people started running things that it took over a week for it to finish.
And so people actually didn't realize that this is a big deal until after a week, until a week after it was released.
I think that's going to be more and more true.
You know, the implications of that are, I think, pretty interesting because what do the labs do to like fully evaluate their models before the release?
It's actually very difficult because, yeah, you would have to.
The only way to really do the evaluation is to then delay the model release cycle.
And there's a lot of competitive pressure right now to not do that.
Do you think there's like exciting capability in the models that are already released that people have not fully explored given timeline?
I think absolutely.
I think actually a really great example is the Erdos unit distance problem.
So for the viewers that don't know, like we used an internal model at OpenAI a few weeks ago to disprove the Erdos unit distance conjecture.
Now, I'm not a mathematician, but this seems like it was a pretty big deal.
And the math community, it was like the first first problem that a lot of mathematicians had really spent a lot of time on.
And the model was able to do something that they weren't able to do and do it in a way that was actually interesting and useful for mathematicians.
Honestly, it did it at a budget that was dirt cheap.
I mean, we didn't put a lot of effort into this.
We just we trained a new model and we were just curious what it could do.
And we ran it through some problems.
And this one at a pretty low budget, it was like, oh, yeah, I think I have a disprove.
And then we were able to verify that.
Yeah, the disprove was correct.
After we announced the results, a bunch of people found that you could get the answer out of five point five as well.
If now it's not as simple as just asking five point five.
Hey, here's the unit distance conjecture.
What's the disprove?
You had to scaffold it a bit.
You had to like steer it a bit.
And so somebody found, OK, you ask five point five list a bunch of ways that you could tackle this problem.
And then for it lists one of the paths that are actually promising to get to the disprove.
And then you tell it like, OK, explore this some more.
And then if you do this enough times, it actually ends up arriving at the disprove.
Now, what this means is you could, in principle, ask five point five to, you know, as a general purpose scaffold list a bunch of different strategies.
And then for each strategy, tell it to investigate that strategy.
And then it would probably be able to arrive at the disprove with a general purpose scaffold.
Now, that scaffold would be very expensive.
I mean, it would probably cost I just ballpark like thousand to ten to one hundred thousand dollars.
But it would be possible and it would have been possible for somebody to disprove the erdition of this distance conjecture before we did using a general purpose model.
And nobody had explored sufficiently of what happens if I put one hundred thousand dollars worth of compute into five point five.
What could it do?
And the answer is like, yeah, you probably could get stuff like that out of it.
So people should be experimenting more with the current generation in terms of.
Well, this is I think is an interesting question of is it worth it to experiment with?
Because again, the model release cycle is every every couple of months we put out a new model that's even more powerful.
And so the cost of disproving the erdition of distance conjecture drops by like 10 or 100 X with every model release cycle, probably in some cases more.
So you've seen the meme that's like, oh, I like why why bother doing any engineering work when I should just wait for the next one?
Yeah, just go on vacation and come back two months later and then it's, you know, a thousand times cheaper.
So you agree with that?
Is that what you're doing right now at opening?
I just waiting for the next model release.
I think I mean, I will say we're in a period where progress is very fast and like, yeah, the models are becoming more capable.
I can say like at opening, I one of the things that we're we're not doing.
And look, we have a lot of mathematicians.
We have a lot of physicists.
People are very excited about what these models can do right now, especially, you know, the internal models.
We are trying to encourage people to not spend all their time just like going through all the mathematical open problems, physics problems and just seeing pushing the models to their limits to see what they can prove or disprove.
Because we really think the focus should be on how do we make even more capable models?
How can we get them get them out safely to the world as quickly as possible so that all the scientists in the world can use these models to solve the problems themselves?
So, yeah, in some sense, we are thinking about this that, yes, it's really tempting to just put all of our efforts into scaling of these models and see what they can do at their limits right now.
But really, the focus should be on how do we use these models to make even more powerful models, even more capable models that can do everything much more cost effectively?
What is changing about the direction or allocation of resources for research in your mind, given your beliefs about this very large scale, the impact of very large scale test time compute?
How does this interact with the idea of recursive self-improvement, for example, where it's a dominant idea for how any lab gets to the best capability model?
So one thing I should clarify, I don't think we're at the point where, OK, you just give it an arbitrary, extremely high inference budget, and it's just super intelligent across the board.
Slash goal.
Yeah.
Make GPD 7 or whatever.
And yeah, just go nuts.
What's between us and there then?
I think having played around with the model.
So, OK, so first of all, there are some benchmarks where the models will just not improve if they have more inference budget.
So I think a lot of factual retrieval kind of questions fall into this category of if you ask a person, when was Abraham Lincoln born?
And they don't know the date.
They could sit there.
They could think about it for a week.
If they don't have access to Wikipedia or something, they're not going to be able to do better answering that question if they thought about it for a week compared to five seconds.
Same with the model.
If you actually interestingly enough, if you give the model these kinds of like factual retrieval questions and you give them a little bit of time to think, they do actually do better.
But if you give them a week, they're not suddenly going to do better at remembering dates.
There are so there's some benchmarks where they clearly improve with more test time compute and there's some where they don't.
I think on the other extreme, there are benchmarks where they kind of obviously will keep improving limit without limit with more test time compute.
So the example I like to point to you is Sudoku.
If you is there's a really simple strategy to solving Sudoku, which is just try a bunch of different random numbers and then see if it fits the criteria.
If it matches all the constraints and if it doesn't just try a different random combination of numbers.
And clearly with enough time, you will be able to solve any pseudo-composal with this strategy.
You can kind of trivially see like, OK, any model could keep doing better and better if it was just given more test time compute.
So you have and all the benchmarks kind of exist somewhere between these two extremes.
The models are not at the level where if you just give them enough test time compute, they will be able to do all of our jobs just because, yeah, there's some benchmarks.
So they will not improve.
There are some things where they will not improve.
One thing I see for research in particular is they don't have very good research taste right now.
And so I think they're actually a very good complement to researchers, especially, you know, I've found that I've found out much more effective by using these models, but they're not able to fully replace the whole research cycle.
Now, does that change with time?
Probably.
I mean, I think the models are getting better across the board.
Some some things are getting better the faster than others, but they're not at the point where they're fully replacing researchers with just enough test time compute.
Can you give an example or two of like asking the model to do a research test for just like this is a terrible idea?
I mean, I think going back to my poker solver example, I was really impressed with the model's ability to optimize the algorithms that I had developed in my in my PhD.
It was honestly it was it was shocking to see how inefficient I was in retrospect.
And they were able to make it like, you know, ten hundred X faster.
And then I was like, OK, can you come up with an algorithm that is better than the algorithms that I came up with or that anybody else came up with?
You go ahead and like look at all the published work and synthesize that and then try to come up with something novel and it's not able to do it.
And I can give it a lot of time and it's still not able to do it.
Now, it's possible that if I scaffold it something and like kind of constrain it a bit more, that maybe it could eventually come up with something better.
But it would take a lot of it's not just as simple as saying, OK, please come up with a better algorithm.
And how do you think that that gets improved?
What I've seen is with every model release cycle, it does get better at this sort of thing.
It's still it's still bad, in my opinion, but it's not as bad as it used to be.
And I wouldn't be surprised if at some point, same thing with coding, same thing with math, where there's just like this inflection point where suddenly it's actually good enough to be useful.
I wouldn't be surprised if we encounter that point for research taste as well.
Even that, what do you like, what is your framing of our side today?
Like, how should we think about it?
The models are definitely accelerating what researchers can do inside the labs, but I think they are accelerating some things and not other things.
And currently we're at the point where, OK, if something goes 100 X faster, you get bottlenecked by the things that don't go 100 X faster.
Over time, the things that we're getting bottlenecked on are going to shrink.
And there will be, I think, kind of a gradual takeoff in that respect.
But it's more about transforming.
Right now, it's more about transforming what researchers do rather than fully replacing the researchers.
So that actually implies that you don't think we're close to a very fast takeoff right now.
I think fast takeoff is relative.
Things are moving very fast.
But I think there is this hypothesis that you could have basically an overnight intelligence explosion where the models discover some kind of breakthrough to make themselves smarter and then that leads to more breakthroughs that make themselves even smarter immediately.
And you have basically in an instance, the models just becoming very superhuman across the board in moments.
And I don't think we're headed to that world largely because of the fact that the models rely so much on large scale test time compute in order to achieve their greatest intelligence.
If it requires so much test time compute to unlock the full capabilities of the model, then that means your bottlenecked by time.
Things can only go so fast because the models need to run for long enough to actually do something really, really powerful.
Time itself becomes a bottleneck to what we can do.
And I think that is the case right now for a lot of the labs that ultimately I think the biggest bottleneck for all of us is time.
And that's why all the researchers are working so intensely right now.
It's just so many so many hours per week are being put into this because we all see what the overhang is.
We see what the capabilities are and we're just bottlenecks by how quickly can we do things?
What do you think is on the frontier that is less explored now?
Like we've talked about multi agent before.
I think multi agent is quite explored.
I think that's sufficient scale.
I think there's a lot more that could be done.
But it's also one of the things that's hard to do.
A lot of research is hard to do in small scale.
I think multi agent in particular, it really requires it ever to fully unlock the capabilities.
It requires like frontier models.
I think we've seen some pretty interesting multi agent scaffolds.
I think they're able to do a lot, but I think it's really just scratching the surface of what it will be able to do.
I mean, one way that I think about it is if you look at human civilization, it's not that humans have become smarter over.
It's not that they evolved to become smarter over the past 50,000 years.
It's that humans are able to do a lot more today than they were back in caveman times because there have been billions of humans thinking for a long time and building off of each other's accumulated knowledge.
We have like very good retrieval and scaffolding versus 50,000 years ago.
It's not even I wouldn't call it a scaffold.
This is like this is a very organic emergent property of just like humans being able to accumulate knowledge, share it and build off of it.
We're not seeing that with AI models today.
They kind of they're born into a world for and they exist for a very short context window and then they just like disappear.
And yeah, there are things that you can kind of do to like continue them, but it's very limited.
I do think eventually we will and we're starting to see like signs that we're entering a world where they can coordinate on a large scale.
I think multiple and open claw when they first came out, I think it was obviously a bit overhyped, but they were an indication of where things could go in the future.
And I do think that eventually we get to that kind of world of some sort of coordinated compounding state.
Yeah, the ability of the models to to share knowledge on a more global level and be able to build on that knowledge productively.
Given this set of beliefs and your work, like how would you characterize just competition at the frontier between the between the three kingdoms if there is no overnight takeoff?
It's just researchers grinding away trying to make good high taste algorithmic and investment decisions about where to go and then compute allocation and then policy decisions and eval decisions.
It feels like slightly more grounded than I support, I suppose, like racing towards some immediate hard takeoff that nobody can catch you on.
I think the competition is very intense right now.
I do think the models that exist today are accelerating what researchers at the Frontier Labs can do.
There's obviously, like I said, limits to that right now, but the ability to use the models to improve the model research is a real thing.
And it's it is like an amplifying force.
I think that will continue to be true.
I think they'll become more true over time.
One thing that I am comforted by is I think all the researchers at the Frontier Labs, all the Frontier Labs, I think, recognize what is at stake and what these models like what the risks are.
And that's something that I find comforting that I think everybody really understands, like, OK, this is a pretty serious thing and it can lead to really great things or it could lead to really bad things.
And yes, there's a competitive dynamic between the labs, but like we can also try to figure out how we all get to the positive outcomes rather than the very negative outcomes.
I think, you know, I'd be remiss to ask just because you have been right very early for a long time on the importance of test time compute and reasoning as a framework.
Like, are there ways in which you use the models that you should you would encourage others to write?
Is it just goal everything?
I think for a lot of people, they worked.
I mean, this is probably not even true for your audience necessarily, but there's a lot of people that experimented with AI back in like 2023 and felt like they couldn't trust the outputs and then don't use it for really high stakes decisions.
And actually, I think the models have progressed to a point where they are very good for these kinds of things.
I mean, I asked them tax advice or I bought a condo recently and I was asking it for advice on like, OK, well, what's all the paperwork that I have to fill out and like how do I what does it all mean?
It's actually really good for these kinds of questions.
So I use it day to day for a lot of this kind of stuff.
And I think they're at a point now where they've actually been at a point for a while now where I feel like I can just trust the outputs arguably more than I could trust the output from a human person.
An expert human.
Yeah.
OK.
I have two two final questions for you.
One is, is there something you think that the rest of the research community doesn't agree with you on or doesn't understand the importance of quite yet?
This is such a good question.
I wish I had time to think about this ahead of time.
You can just hang out with me and think about it.
Yeah.
It's weird to be like consensus now.
You're a bit salty three years ago when you're like, why don't people understand how important this is?
I still I still feel like it's not consensus, though, because like, you know, people still don't publish the benchmarks this way.
That's true.
Yeah.
I think that's actually why I wrote inertia.
That's kind of.
Yeah, but that's kind of why I wrote the essay.
I was just like, look, I mean, we can talk about this.
But like, yeah, part of the motivation is like I would talk to researchers about.
We it makes sense to show the benchmarks with an X axis, whether it's tokens or cost or time, there should be an X axis.
And everybody would say, like, yeah, that makes sense.
We should do that.
But everybody not acting with the importance of like good heart like this is we have to measure the correct thing.
Well, really, the response is people expect us to publish the grid.
And then, well, why do people expect the grid to be published because everybody publishes the grid?
And so you kind of end up in this this bad equilibrium where everybody kind of knows that it's a bad equilibrium.
But like nobody wants to break out.
And I felt like, OK, well, if I just hopefully come out and say, like, look, guys, let's all recognize that we're in a bad equilibrium and let's move to this different equilibrium where we're plotting things with an X axis that hopefully that can know next time there's a model release, a company can feel comfortable not publishing the grid, at least not at the very front top line.
And we can have a more productive evaluation of these models.
Then a last question for you.
How do you think about companies across all of these specialized domains who feel the value that they have is essentially like the routing layer, the choice layer of, you know, my goal is composed of a bunch of discrete tasks.
Some require more intelligence and less.
And within my job as a vendor is to solve that problem or achieve the optimal outcome with taking into account the budget constraints.
And so I will manage like the paralyzation and how much inference to spend on it from what model.
Because I think the the frontier lab point of view is that that routing happens both within the, you know, behind the API, behind the application and then some of it in the model itself.
And that's pieces of that are clearly being externalized in all of these applications.
Yeah, I do.
I do think this is related to the fact that like benchmarks should be evaluated with an X axis of tokens or cost.
I have seen some evals recently that show like, OK, well, with with a routing layer, you can achieve much better performance by basically doing consensus on the models.
Yeah.
And I definitely believe that if you do consensus on the models, they are going to achieve better performance than any individual model.
But it's important to ask, like, are you going to do better than having that model basically think for longer?
Like once you control for the amount of test time compute, is it is it actually still doing better?
That's that's the question that you want to figure out.
That's very principle of view, which is like, yes, routing is fine, but it's all subject to the same budget question.
Yeah, right.
If you put it on the same scaler, then you can make an optimal decision.
And I think maybe I win.
I don't even know necessarily that I would believe that the routing does better.
But then there's still a question of is it going to do significantly better?
Is it very fragile?
Is it reflective of real world use cases compared to benchmarks?
Because like one issue you could run into is that you could optimize for certain benchmarks with the routing and the show like, oh, yeah, we see this big improvement on these benchmarks.
But in real world use cases, it actually ends up not being a significant improvement.
So I would say at the very least, like I would say, you want to control for test time compute and then you also want to have all the same skepticism about benchmarks that you would normally have.
Awesome.
Noam, thanks so much.
And for being on the mission for breaking us out of this false equilibrium.
That's great feedback.
Find us on Twitter at No Prior's Pod.
Subscribe to our YouTube channel if you want to see our faces.
Follow the show on Apple Podcasts, Spotify or wherever you listen.
That way you get a new episode every week.
And sign up for emails or find transcripts for every episode at no prior's dot com.