Latent Space · 2025-06-19

Noam Brown on Scaling Test-Time Compute to Multi-Agent Systems at OpenAI

Hosts: Alessio, Michael Hausenblas

Guests: Noam Brown

test-time computereasoning modelso-seriesmulti-agent systemsself-playCicerodiplomacy gamepoker AImodel routersagentic harnessesreinforcement learningpre-training vs post-trainingIlya Sutskeverroboticsdata efficiency

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

Cicero (2022) used a ~2.7B parameter language model and was bottlenecked on LM quality.

Key claims

Leads OpenAI's multi-agent team, which is also focused on scaling test-time compute from minutes to hours/days, and has not yet publicly announced its approach
Cicero (2022) used a ~2.7B parameter language model and was bottlenecked on LM quality; today o3 and GPT-4o would pass Turing-style tests in diplomacy chat
Personally won the 2025 World Diplomacy Championship; credits insight gained from watching and debugging Cicero, but did not use the bot in tournament play
Reasoning models require sufficient pre-trained 'system 1' capability to benefit from extra thinking; chain-of-thought on small models like GPT-2 yielded almost nothing

Episode summary

Summary

Noam Brown, who leads a multi-agent team at OpenAI, joined Latent Space to discuss the trajectory of reasoning models and where the field is heading. He argued that the test-time compute paradigm (now visible in the o-series) will continue scaling rapidly, with models eventually thinking for hours, days, or longer. He pushed back on the notion that reasoning models only work in easily verifiable domains, pointing to Deep Research as evidence the approach generalizes well into fuzzy, subjective tasks. On the long-running debate about whether pre-training alone would yield superintelligence, Brown said he told Ilya Sutskever in late 2021 that the field would need an additional reasoning paradigm; Sutskever agreed it was needed but was less pessimistic about how hard it would be to find.

Brown was candid about internal OpenAI dynamics, including significant debate about whether the o-series was actually a big deal before o1 shipped. He described a colleague who left before the announcement and only recognized the paradigm's significance after seeing competitor labs pivot their entire research agendas in response. He predicted that common engineering patterns like agentic scaffolds, model routers, and Pokemon-style harnesses will be progressively washed away as base models improve, with OpenAI aiming for a single unified model that handles routing internally.

On multi-agent systems, Brown said the team is taking a 'very principled' and bitter-lesson-aligned approach distinct from existing heuristic methods, with ambitions that range from collaborative problem-solving to competitive self-play. However, he cautioned that the AlphaGo-style self-play narrative breaks down outside two-player zero-sum games, since self-play objectives become ill-defined for tasks like mathematics. He also flagged soft ceilings for test-time compute: rising inference cost and wall-clock time, the latter being especially binding for domains like drug discovery where experiments cannot be parallelized.

Leads OpenAI's multi-agent team, which is also focused on scaling test-time compute from minutes to hours/days, and has not yet publicly announced its approach
Cicero (2022) used a ~2.7B parameter language model and was bottlenecked on LM quality; today o3 and GPT-4o would pass Turing-style tests in diplomacy chat
Personally won the 2025 World Diplomacy Championship; credits insight gained from watching and debugging Cicero, but did not use the bot in tournament play
Reasoning models require sufficient pre-trained 'system 1' capability to benefit from extra thinking; chain-of-thought on small models like GPT-2 yielded almost nothing
Deep Research cited as existence proof that reasoning models succeed in non-verifiable, subjective domains
OpenAI's stated goal is a single unified model rather than separate fast/slow models with external routers
Described internal debate pre-o1 where some leaders and departing employees did not initially grasp the significance of the reasoning paradigm
Two soft ceilings on test-time compute scaling: rising inference cost, and wall-clock time for serial experiments (especially binding in domains like drug discovery)

Source material

Transcript

[MUSIC PLAYING] Hey, everyone.

Welcome to the "Live in Space" podcast.

This is Alessio, partner and CTO at Decibel, and I'm joined by Michael Hausbuxx, founder of SmallAI.

Hello, hello.

And we're here recording on a holiday Monday with Noam Brown for an open AI.

Welcome.

Thank you.

So glad to have you finally join us.

A lot of people have heard you.

You've been rather generous of your time on the podcast.

Lex Friedman, and you've done a TED Talk recently.

Just talking about the thinking paradigm.

But I think maybe perhaps your most interesting recent achievement is winning the World Diplomacy Championship.

In 2022, you built sort of Cicero, which was top 10% of human players.

I guess my opening question is, how has your diplomacy playing changed since working on Cicero and now personally playing it?

When you work on these games, you kind of have to understand the game well enough to be able to debug your bot.

Because if the bot does something that's really radical and that humans typically wouldn't do, you're not sure if that's a mistake or if it's a bug in the system or it's actually just the bot being brilliant.

When we were working on diplomacy, I kind of did this deep dive, trying to understand the game better.

I played in tournaments.

I watched a lot of tutorial videos and commentary videos on games.

And over that process, I got better.

And then also seeing the bot, the way it would behave in these games.

Sometimes it would do things that humans typically wouldn't do.

And that taught me about the game as well.

When we released Cicero, we announced it in late 2022.

I still found the game really fascinating.

And so I kept up with it.

I continued to play.

And that led to me winning the championship in the World Championship in 2025, so just a couple months ago.

There's always a question of center systems where humans and machines work together.

Was there an equivalent of what happened in Go where you updated your play style?

You're asking if I used Cicero when I played in the tournament.

The answer is no.

Seeing the way the bot played and taking inspiration from that, I think, did help me in the tournament.

Yeah.

Do people now ask Turing questions every single time when they're playing diplomacy?

Ask to try to tell if the person they're playing with is a bot or-- Yeah, that's the one thing you were worried about when you started.

It was really interesting when we were working on Cicero, because we didn't have the best language models.

We were really bottlenecked on the quality of the language models.

And sometimes the bot would say bizarre things.

99% of the time, it was fine.

But then every once in a while, it would say this really bizarre thing.

It would just hallucinate about something.

Somebody would reference something that they said earlier in a conversation with the bot, and the bot would be like, I have no idea what you were talking about.

I never said that.

And then the person would be like, look, you could just scroll up in the chat.

It's literally right there.

And the bot would be like, no, you're lying.

Oh, context windows.

And when it does these kinds of things, people just shrug it off as like, oh, that's just-- the person's tired, or they're drunk, or whatever, or they're just trolling me.

But I think that's because people weren't looking for a bot.

They weren't expecting a bot to be in the games.

We were actually really scared, because we were afraid that people would figure out at one point that there's a bot in these games.

And then they would just always be on the lookout for it.

And if you're looking for it, you're able to spot it.

That's the thing.

So I think now that it's announced and that people know to look for it, I think they would have an easier time spotting it.

Now, that said, the language models have also gotten a lot better since 2022.

It's adversarial.

Yeah.

So at this point, the truth is, GPT-40 and O3, these models are passing the touring test.

So I don't think they can really ask that many touring complete questions that would actually make a difference.

And Tizer was very small, like 2.7b, right?

It was a very small language model, yeah.

It was one of the things we realized over the course of the project that, oh, yeah, you really benefit a lot from just having larger language models.

Right, yeah.

How do you think about today's perception of AI and a lot of maybe the safety discourse of, you know, you're going to build a bot that is really good at persuading people into helping them win a game?

And I think maybe today, labs want to say they don't work on that type of problem.

How do you think about that dichotomy, so to speak, between the two?

Honestly, after we released Cicero, a lot of the AI safety community was really happy with the research and the way it worked, because it was a very controllable system.

We conditioned Cicero on certain concrete actions, and that gave it a lot of steerability to say, OK, well, it's going to pursue a behavior that we can very clearly interpret and very clearly define.

It's not just like, oh, it's a language model like running loose and doing whatever it feels like.

No, it's actually pretty steerable, and there's this whole reasoning system that steers the way the language model interacts with the human.

Actually, a lot of researchers reached out to me and said, we think this is potentially a really good way to achieve safety with these systems.

I guess the last diplomacy-related questions that we might have is, have you updated or tested all series models on diplomacy, and would you expect a lot more difference?

I have not.

I think I said this on Twitter at one point.

I think this would be a great benchmark.

I would love to see all the leading bots play a game of diplomacy with each other and see who does best.

And I think a couple of people have taken inspiration from that and are actually building out these benchmarks and evaluating the models.

My understanding is that they don't do very well right now.

But I think it really is a fascinating benchmark, and I think it would be a really cool thing to try out.

Well, we're going to go a little bit into all series now.

I think the last time you did a lot of publicity, you were just launching O1, you did your TED Talk and everything.

How have the vibes changed just in general?

You said you were very excited to learn from domain experts in chemistry, like how they review the O series models.

How have you updated since, let's say, end of last year?

I think the trajectory was pretty clear pretty early on in the development cycle.

And I think that everything that's unfolded since then has been pretty on track for what I expected.

So I wouldn't say that my perception of where things are going has honestly changed that much.

I think we're going to continue to see-- I said before that we're going to see this paradigm continue to progress rapidly.

And I think that that's true even today, that we saw that with going from O1 preview to O1 to O3, consistent progress.

And we're going to continue to see that going forward.

And I think that we're going to see a broadening of what these models can do as well.

We're going to start seeing agentic behavior.

We're already starting to see agentic behavior.

Honestly, for me, O3, I've been using it a ton in my day to day life.

I just find it so useful, especially the fact that I can now browse the web and do meaningful research on my behalf.

It's kind of like a mini deep research that you can just get a response in three minutes.

So yeah, I think it's just going to continue to become more and more useful and more powerful as time goes on and pretty quickly.

Yeah, and talking about deep research, you tweeted about if you need proof that we can do this in number viable domains, deep research is kind of like a great example.

Can you maybe talk about if there's something that people are missing?

I feel like I hear that repeated a lot.

It's easy to do encoding in math, but not in these other domains.

I frequently get this question, including from pretty established AI researchers, that, OK, we're seeing these reasoning models exceed in math and coding and these easily verifiable domains, but are they ever going to succeed in domains where success is less well-defined?

I'm surprised that this is such a common perception because we've released deep research, and people can try it out.

People do use it.

It's very popular.

And that is very clearly a domain where you don't have an easily verifiable metric for success.

It's very like, what is the best research report that you could generate?

And yet these models are doing extremely well at this domain.

So I think that's an existence proof that these models can succeed in tasks that don't have as easily verifiable reports.

Is it because there's also not necessarily a wrong answer?

There's a spectrum of deep research quality.

You can have a report that looks good, but the information is kind of so-and-so, and then you have a great report.

Do you think people have a hard time understanding the difference when they get the result?

My impression is that people do understand the difference when they get a result.

And I think that they're surprised at how good the deep research results are.

They're certainly-- it's not 100%.

It could be better, and we're going to make it better.

But I think people can tell the difference between a good report and a bad report, and certainly a good report and a mediocre report.

And that's enough to feed the loop later to build the product and improve the model performance.

I think if you're in a situation where people can't tell the difference between the outputs, then it doesn't really matter if you're hill climbing on progress.

These models are going to get better at domains where there is a measure of success.

Now, I think this idea that it has to be easily verifiable or something like that, I don't think that's true.

I think that you can have these models do well, even in domains where success is a very difficult to define thing, could sometimes even be subjective.

People lean on a lot.

You've done as well, is the thinking fastest snow analogy for just thinking models.

And I think it's reasonably well diffused now the idea of-- that this is the next scaling paradigm.

All analogies are imperfect.

What is one way in which thinking fast and slow or system 1, system 2 doesn't transfer to how we actually scale these things?

One thing that I think is underappreciated is that the models, the pre-trained models, need a certain level of capability in order to really benefit from this extra thinking.

This is why you've seen the reasoning paradigm emerge around the time that it did.

I think it could have happened earlier.

But if you try to do the reasoning paradigm on top of GPD 2, I don't think it would have gotten you almost anything.

Is this emergence?

Hard to say if it's emergence necessarily, but I haven't done the measurements to really define that clearly.

But I think it's pretty clear.

People tried Chain of Thought with GPD, really small models.

And they saw that it just didn't really do anything.

Then you go to bigger models, and it starts to give a lift.

I think there's a lot of debate about the extent to which this kind of behavior is emergent.

But clearly, there is a difference.

So it's not like there are these two independent paradigms.

I think that they are related in the sense that you need a certain level of system one capability in your models in order to have system two.

Is it being able to benefit from system two?

I have tried to play amateur neuroscientists before and try to compare it to the evolution of the brain and how you have to evolve the cortex first before you evolve the other parts of the brain.

And perhaps that is what we're doing here.

Yeah.

You could argue that actually this is not that different from the system one, system two paradigm.

Because if you ask a pigeon to think really hard about playing chess, it's not going to get that far.

It doesn't matter if it thinks for 1,000 years.

It's not going to be able to be better at playing chess.

So maybe you do still also with animals and humans that you need a certain level of intellectual ability just in terms of system one in order to benefit from system two as well.

Yeah.

Just the site tangent, does this also apply to visual reasoning?

So let's say we have-- now we have the 4.0 natively omni-model type of thing.

Then that also makes 0.3 really good at GeoGuessr.

Does that apply to other modalities too?

I think the evidence is yes.

It depends on exactly the kinds of questions that you're asking.

There are some questions that I think don't really benefit from system two.

I think GeoGuessr certainly won, where you do benefit.

I think image recognition, if I had a guess, it's one of those things that you probably benefit less from system two thinking.

Because you know it or you don't.

Yeah, exactly.

There's no way.

Yeah.

And the thing I typically point to is just information retrieval.

If somebody asks you, when was this person born, and you don't have access to the web, then you either know it or you don't.

And you can sit there and you can think about it for a long time.

Maybe you can make an educated guess.

So you can say, well, this person probably lived around this time.

And so this is a rough date.

But you're not going to be able to get the date unless you actually just know it.

But spatial reasoning like tic-tac-toe might be better because you have all the information there.

Yeah.

And I think it's true that with tic-tac-toe, we see that GPD 4.5 falls over.

It plays decently well.

I shouldn't say it falls over.

It does reasonably well.

You can draw the board.

It can make legal moves.

But it will make mistakes sometimes.

And if you really need that system two to enable it to play perfectly.

Now, it's possible that if you got to GPD 6 and you just did system one, it would also play perfectly.

I guess we'll know one day.

But I think right now, you would need a system two to really do well.

What do you think are the things that you need in system one?

So obviously, general understanding of game rules.

Do you also need to understand some sort of meta game of-- usually, this is how you value pieces in different games, even though it's how you generalize in system one so that then in system two, you can get to the gameplay, so to speak.

I think the more that you have in the system one-- this is the same thing with humans.

Humans are when they're playing for the first time a game like chess.

They can apply a lot of system two thinking to it.

And if you apply a ton of system two thinking to it, if you just present a really smart person with a completely novel game and you tell them, OK, you're going to play this game against an AI or a human that's mastered this game.

And you tell them to sit there and think about it for three weeks about how to play this game.

My guess is they could actually do pretty well.

But it certainly helps to build up that system one thinking, like build up intuition about the game because it will just make you so much-- yeah, so much faster.

I think the Pokemon example is a good one of the system one kind of has maybe all this information about games.

And then once you put in the game, it still needs a lot of harnesses to work.

And I'm trying to figure out how much of-- can we take things from the harness and have them in system one so that then system two is as harness-free as possible?

But I guess that's the question about generalizing games and AI.

Yeah, I guess I view that as a different question.

I think the question about harnesses, in my view, is that the ideal harness is no harness.

I think harnesses are like a crutch that eventually we're going to be able to move beyond.

So only two calls.

And you could ask-- you could just ask O3.

And actually, it's interesting because when this playing Pokemon thing kind of emerged as this benchmark, I was actually pretty opposed to eval-ing this with our opening AI models because my feeling is like, OK, if we're going to do this eval, let's just do it with O3.

How far does O3 get without any harness?

How far does it get playing Pokemon?

And the answer is not very far.

And that's fine.

I think it's fine to have an eval where the models do terribly.

And I don't think the answer to that should be like, well, let's build a really good harness so that now it can do well in this eval.

I think the answer is like, OK, well, let's just improve the capabilities of our models so they can do well at everything.

And then they also happen to make progress on this eval.

Would you consider things like checking for a valid move, a harness, or is this in the model?

You know, like chess.

It's like you can either have the model learn in system one what moves are valid and what it can and cannot do versus-- Search.

--in system two figuring out what are not.

I think there's like-- a lot of this is design questions.

Like for me, I think you should give the model the ability to check if a move is legal if you want.

Like that could be an option in the environment of like, OK, here's an action that you can-- like a tool call you can make to see if an action is legal.

If it wants to use that, it can.

And then there's like design the question of like, well, what do you do if the model makes an illegal move?

And I think it's totally reasonable to say like, well, if they make an illegal move, then they lose the game.

Like, I don't know.

What happens when a human makes an illegal move in a game of chess?

I actually don't know.

I haven't been chess that much.

Yeah, me neither.

Just not allowed to?

Yeah.

Like, do you just lose the game?

I don't know.

So if that's the case, then I think it's totally reasonable to say, yeah, we're going to have an eval where that's also the criteria for the AI models.

Yeah, but I think like maybe one way to interpret that in sort of researcher terms is, are you allowed to do search?

And one of the famous findings from DeepSeek is that MCTS wasn't that useful to them.

But I think like there are a lot of engineers trying out search and spending a lot of tokens doing that.

And maybe it's not worth it.

Well, I'm making a distinction here between like a tool call to check whether a move is legal or illegal is different from actually making that move and then seeing whether it ended up being legal or illegal.

Right.

So if that tool call is available, I think it's totally fine to make that tool call and check whether a move is legal or illegal.

I think it's different to have the model say, oh, I'm making this move.

Yeah.

And then it gets feedback that like, oh, you made an illegal move.

And so then it's like, oh, just kidding.

Like, I'm going to do something else now.

So that's the distinction I'm drawing.

Some people have tried to classify that second type of playing things out as test time compute.

You would not classify that as test time compute.

There's a lot of reasons why you would not want to rely on that paradigm when you're going to-- imagine you have a robot, and your robot takes some action in the world, and it breaks something.

You're just like, oh, you can't say, oh, just kidding.

I didn't mean to do that.

I'm going to undo that action.

The thing is broken.

So if you want to simulate what would happen if I move the robot in this way, and then in the simulation, you saw that this thing broke, and then you decide not to do that action, that's totally fine.

But you can't just undo actions that you've taken in the world.

There's a couple more things I wanted to cover in this rough area.

I actually had an answer on the thinking fast and slow side, which maybe-- I'm curious what you think about.

A lot of people are trying to put in effectively model router layers, let's say between the fast response model and the long thinking model.

Enthropic is explicitly doing that.

And I think there is a question about always, do you need a smart judge to route, or do you need a dumb judge to route because it's fast?

So when you have a model router, let's say you're passing requests between system one side and system two side, does the router need to be as smart as the smart model or dumb to be fast?

I think it's possible for a dumb model to recognize that a problem is really hard and that it won't be able to solve it and then route it to a more capable model.

But it's also possible for a dumb model to be fooled or to be overconfident.

I don't know.

I think there's a real trade-off there.

But I will say, I think there are a lot of things that people are building right now that will eventually be washed away by scale.

So I think harnesses are a good example, where I think eventually the models are going to be-- and I think this actually happened with the reasoning models.

Like before the reasoning models emerged, there was all of this work that went into engineering these agentic systems that made a lot of calls to GPT-40 or these non-reasoning models to get reasoning behavior.

And then it turns out, oh, we just created reasoning models.

And you don't need this complex behavior.

In fact, in many ways, it makes it worse.

You just give the reasoning model the same question without any sort of scaffolding.

And it just does it.

Now, you can still-- and so people are building scaffolding on top of the reasoning models right now.

But I think in many ways, those scaffolds will also just be replaced by the reasoning models and models in general becoming more capable.

And similarly, I think things like these routers, we've said pretty openly that we want to move to a world where there is a single unified model.

And in that world, you shouldn't need a router on top of the model.

So I think that the router issue will eventually be solved also.

Like you're building the router into the model weights itself.

I don't think there will be a benefit for-- I shouldn't say, because I could be wrong about this.

And certainly, maybe there's reasons to route to different model providers or whatever.

But I think that routers are going to eventually go away.

And I can understand why it's worth doing it in the short term, because the fact is, it is beneficial right now.

And if you're building a product and you're getting a lift from it, then it's worth doing right now.

One of the tricky things I'd imagine that a lot of developers are facing is that you have to plan for where these models are going to be in six months and 12 months, and that's very hard to do because things are progressing very quickly.

You don't want to spend six months building something and then just have it be totally washed away by scale.

But I think I would encourage developers, when they're building these kinds of things, like scaffolds and routers, keep in mind that the field is evolving very rapidly.

Things are going to change in three months, let alone six months.

And that might require radically changing these things around or tossing them out completely.

So don't spend six months building something that might get tossed down in six months.

It's so hard, though.

Everyone says this.

And then no one has concrete suggestions on how.

What about reinforcement fine tuning?

Is it something that, obviously, you just released it a month ago at Urban Eye?

Is it something people should spend time on right now or maybe wait until the next jump up?

I think reinforcement fine tuning is pretty cool.

And I think it's worth looking into because it's really about specializing the models for the data that you have.

And I think that something that's worth looking into for developers.

We're not suddenly going to have that data baked into the raw model a lot of times.

So I think that's kind of a separate question.

So creating the environment and the reward model is the best thing people can do right now.

I think the question that people have is, should I rush to fine tune the model using RFT?

Or should I build the harness to then RFT the models as they get better?

I think the difference is that for reinforcement fine tuning, you're collecting data that's going to be useful as the models improve as well.

So if we come out with future models that are even more capable, you could still fine tune them on your data.

That's, I think, actually a good example where you're building something that's going to complement the model's scaling and becoming more capable rather than necessarily getting washed away by the scale.

One last question on Ilya.

You mentioned on, I think, the Sarah and Elad podcast where you had this conversation with Ilya a few years ago about more RL and reasoning and language models.

Just any speculation or thoughts on why his attempt when he tried it, it didn't work or the timing wasn't right and why the time is right now?

I don't think I would frame it that way that his attempt didn't work.

In many ways it did.

So Ilya, for me, I saw that in all of these domains that I had worked on, in poker and Hanabi and diplomacy, having the models think before acting made a huge difference in performance.

Like orders of magnitude difference.

Like 10,000 times.

Yeah, like 1,000 to 100,000 times.

It's the equivalent of a model that's 1,000 to 100,000 times bigger.

And in language models, you weren't really seeing that.

The models would just respond instantly.

Some people in the LLM field were convinced that, OK, we just keep scaling, pretraining, we're going to get to super intelligence.

And I was kind of skeptical of that perspective.

In late 2021, I was having Emil with Ilya.

He asked me what my H.I.

timelines are, a very standard SF question.

And I told him, look, I think it's actually quite far away because we're going to need to figure out this reasoning paradigm in a very general way.

And with things like LLMs, LLMs are very general, but they don't have a reasoning paradigm that's very general.

And until they do, they're going to be limited in what they can do.

We're going to scale it.

We're going to scale these things up by a few more orders of magnitude.

They're going to become more capable.

But we're not going to see super intelligence from just that.

And yes, if we had a quadrillion dollars to train these models, then maybe we would.

But you're going to hit the limits of what's economically feasible before you get to super intelligence, unless you have a reasoning paradigm.

And I was convinced incorrectly that the reasoning paradigm would take a long time to figure out because it's like this big unanswered research question.

And Ilya agreed with me.

And he said, yeah, I think we need this additional paradigm.

But his take was that maybe it's not that hard.

I didn't know it at the time, but he and others at OpenAI had also been thinking about this.

They'd also been thinking about RL.

They had been working on it.

And I think they had some success.

But with most research, you have to iterate on things.

You have to try out different ideas.

You have to try out different things.

And then also, as the models become more capable, as they become faster, it becomes easier to iterate on experiments.

And I think that the work that they did, even though it didn't result in a reasoning paradigm, it all builds on top of previous work.

So they built a lot of things that over time led to this reasoning paradigm.

For listeners, Noam can talk about this.

But the rumor is that that thing was codenamed GPT 0, if you want to search for that line of work.

I think there was a time where basically, RL went through a dark age, when everyone went all in on it.

And then nothing happens.

And they gave up.

And now it's the Golden Age again.

So that's what I'm trying to identify.

Why?

What is it?

And it could just be that we have smarter-based models and better data.

I don't think it's just that we have smarter-based models.

I think it's that.

Yeah.

We did end up getting a big success with reasoning.

But I think it was, in many ways, a gradual thing.

To some extent, it was gradual.

There were signs of life.

And then we iterated and tried out some more things.

We got better signs of life.

I think it was around November 2023 or October 2023, when I think I was convinced that we had very conclusive signs of life that, oh, this is the paradigm.

And it's going to be a big deal.

That was, in many ways, a gradual thing.

I think what OpenAI did well is when we got those signs of life, they recognized it for what it was and invested heavily in scaling it up.

And I think that's ultimately what led to reasoning models arriving when they did.

Was there any disagreement internally, especially because OpenAI pioneered pre-training scaling and computers all you need?

And then you're saying maybe that's not how we get there.

Was it clear to everybody that, OK, this is going to work?

Or was it controversial?

There's always different opinions about this stuff.

I think there were some people that felt that pre-training was all we need.

That we scaled it up to infinity and were there.

I think a lot of the leadership actually at OpenAI recognized that there was another paradigm that was needed.

And that was why they were investing all this research effort into this RL stuff.

And I think that's also to the credit of OpenAI, that, OK, yes, they figured out the pre-training paradigm.

And they were very focused on scaling it up.

In fact, the vast majority of resources were focused on scaling it up.

But they also recognized the value that something else was going to be needed.

And it was worth putting research or effort into other directions to figure out what that extra paradigm was going to be.

There was a lot of debate about, first of all, what is that extra paradigm?

So I think a lot of the researchers looked at reasoning and an RL was not really about scaling test time compute.

It was more about data efficiency.

Because the feeling was that we have tons and tons of compute, but we actually are more limited by data.

So there's the data wall.

And we're going to hit that before we hit limits on the compute.

So how do we make these algorithms more data efficient?

They are more data efficient.

But I think that also they are also just the equivalent of scaling up compute also by a ton.

That was interesting.

There was a lot of debate around, OK, what exactly are we doing here?

And then I think also, even when we got the signs of life, I think there was a lot of debate about the significance of it.

That, OK, how much should we invest in scaling up this paradigm?

I think especially when you're in a small company, OpenAI in 2023 was not as big as it is today.

And compute was more constrained than it is today.

And if you're investing resources in a direction, that's coming at the expense of something else.

And so if you look at these signs of life on reasoning, and you're saying, OK, well, this looks promising.

We're going to scale this up by a ton and invest a lot more resources into it.

Where are those resources coming from?

You have to make that tough call about where to draw the resources from.

And that is a very controversial, very difficult call to make that makes some people unhappy.

And I think there was debate about whether we're focusing too much on this paradigm, whether it's really a big deal, whether we would see it generalize and do various things.

And I remember it was interesting that I talked to somebody who left OpenAI after we had discovered the reasoning paradigm, but before we announced a one.

And they ended up going to a computing lab.

I saw them afterwards after we announced a one.

And they told me that at the time, they really didn't think this reasoning thing, these O-series, the strawberry models, were that big of a deal.

It was like they thought we're making a bigger deal of it than it really deserved to be.

And then when we announced a one and they saw the reaction of their coworkers at this computing lab about how everybody was like, oh, crap, this is a big deal.

And they pivoted the whole research agenda-- Oh, my god.

--to focus on this.

Then they realized, oh, actually, this maybe is a big deal.

A lot of this seems obvious in retrospect.

But at the time, it's actually not so obvious and can be quite difficult to recognize something for what it is.

I mean, OpenAI is a great history of just making the right bet.

GBD models are kind of similar, where it started with games, NRL.

And then it's like, maybe we can just scale these language models instead.

And I'm just impressed by the leadership and, obviously, the research team that keeps coming up with these insights.

Looking back on it today, it might seem obvious that, oh, of course, these models get better with scale.

So you should just scale them up a ton and it'll get better.

But it really is-- the best research is obvious in retrospect.

And at the time, it's not as obvious as it might seem today.

Follow questions on data efficiency.

This is a pet topic of mine.

It seems that our current methods of learning are so inefficient still.

Compared to the existence proof of humans, we take five samples and we learn something.

Machines-- 200?

Maybe per whatever data point you might need.

Anyone doing anything interesting in data efficiency?

Or do you think there's just a fundamental inefficiency that machine learning has that will just always be there compared to humans?

I think it's a good point that if you look at the amount of data these models are trained on and you compare it to the amount of data that a human observes to get the same performance, I guess, pre-training, it's a little hard to make an Apple-style comparison.

Because I don't know, how many tokens does a baby actually absorb when they're developing?

But I think it's a fair statement to say these models are less data efficient than humans.

And I think that that's an unsolved research question and probably one of the most important unsolved research questions.

Maybe more important than algorithmic improvements.

Because we can increase the supply of data out of the existing set of the world and humans.

I guess that's a good-- so a couple thoughts on that.

One is that the answer might be an algorithmic improvement.

Maybe algorithmic improvements do lead to greater data efficiency.

And the second thing is that it's not like humans learn from just reading the internet.

So I think it's certainly easiest to learn from just data that's on the internet.

But I don't think that's the limit of what data you could collect.

The last follow-up before we change topics to coding, any other anecdotes or insights from Ilya just in general?

Because you've worked with them, so there's not that many people that we can talk to that have worked with them.

I think I've just been very impressed with his vision.

That I think, especially when I joined and I saw the internal documents at OpenAI of what he had been thinking about back in 2021, 2022, even earlier, I was very impressed that he had a clear vision of where this was all going and what was needed.

Some of his emails from 2016, '17 when they were founding OpenAI was published.

And even then, he was talking about how he thinks one big experiment is much more valuable than 100 small ones.

That was a core insight that differentiated them from brain, for example.

It just seems very insightful that he just sees things much more clearly than others.

And I just wonder what his production function is.

How do you make a human like that?

And how do you improve your own thinking to better model it?

I think it is true that-- one of OpenAI's big success was betting on the scaling paradigm.

It is just kind of odd because they were not the biggest lab.

It was difficult for them to scale.

Back then, it was much more common to do a lot of small experiments, more academic style.

People were trying to figure out these various like, algorithmic improvements.

And OpenAI bet pretty early on like, large scale.

We had David Luan on, who I think was VP-Eng at the time of GPT-1 and 2.

And he talked about how the differences between brain and OpenAI was basically the cause of Google's inability to come out with a scaled model.

Just structurally, everyone had allocated compute.

And you had to pull resources together to make bets.

And you just couldn't.

I think that's true that OpenAI was structured differently.

And I think that really helped them.

OpenAI functions a lot like a startup.

And other places tended to function more like universities or research labs as they traditionally existed.

The way that OpenAI operates more like as a startup with this mission of building AGI and super intelligence, that helped them organize, collaborate, pull resources together, make hard choices about how to allocate resources.

And I think a lot of the other labs like have now been trying to adopt paradigms more like that, like setups more like that.

Let's talk about maybe the killer use case, at least in my mind, of these models, which is coding.

You released codecs recently.

But I would love to talk through the known Brown coding stack.

What models you use, how you interact with them.

Cursor, Windsurf.

Lately, I've been using Windsurf and codecs.

Like actually a lot of codecs.

I've been having a lot of fun.

You just give it a task and it just goes off and does it.

And it comes back five minutes later with like a pull request.

And is it core research task or like side stuff that you don't super care about?

I wouldn't say it's like side stuff.

I would say basically anything that I would normally try to code up, I try to do it with codecs first.

Well, for you, it's free.

But yeah, for everybody, it's free right now.

And I think that's partly because it's the most effective way for me to do it.

And also, it's good for me to get experience working with this technology and then also seeing the shortcomings of it.

It just helps me better understand, OK, this is the limits of these models and what we need to push on next.

Have you felt the AGI?

I've felt the AGI multiple times, yes.

How should people push codecs in ways that you've done?

And I think you see it before others because obviously you were closer to it.

I think anybody can use codecs and feel the AGI.

It's kind of funny how you feel the AGI and then you get used to it very quickly.

So it's really dissatisfied with where it's lacking.

Yeah, it's magical.

I was actually looking back at the old Sora videos when they were announced.

Because remember when Sora came out, it was just like-- The biggest it was ever.

It was just magical.

You look at that and you're like, oh, it's really here.

This is AGI.

But if you look at it now, and it's kind of like, oh, the people don't move very organically.

And there's a lack of consistency in some ways.

And you see all these flaws in it now that you just didn't really notice when it first came out.

And yeah, you get used to this technology very quickly.

And but I think what's cool about it is that because it's developing so quickly, you get those feel the AGI moments every few months.

So something else is going to come out.

It's magical to you.

And then you get used to it very quickly.

What are your windsurf pro tips?

Not that you've immersed in it.

I think one thing I'm surprised by is how few people-- I mean, maybe your audience is going to be more comfortable with reasoning models and use reasoning models more.

But I'm surprised at how many people don't even know that O3 exists.

I've been using it day to day.

It's basically replaced Google search for me.

I just use it all the time.

And also for things like coding, I tend to just use the reasoning models.

My suggestion is if people have not tried the reasoning models yet, because honestly, people love them.

People that use it love them.

Obviously, a lot more people use GPT-40 and just the default on chat GPT and that kind of stuff.

I think it's worth trying the reasoning models.

I think people would be surprised at what they can do.

I use windsurf daily.

And they still haven't actually enabled it as a default in windsurf.

I always have to dig up, type in O3, and then it's like, oh, yeah, that exists.

It's weird.

I would say my struggle with it has been that it takes so long to reason that I actually break out of flow.

I think that is true.

Yes.

And I think this is one of the advantages of codecs, that you can give it a task that's self-contained, and it can go off and do its thing and come back 10 minutes later.

And I can see that if you're using this thing as more like a pair programmer kind of thing, then you want to use GPT-4.1 or something like that.

What do you think are the most broken part of the development cycle with AI?

In my mind, it's pull request review.

I use codecs all the time.

And then I got all these pull requests.

And then it's kind of hard to go through all of them.

What other thing would you like people to build to make this even more scalable?

I think it's really on us to build a lot more stuff.

These models are very limited in some ways.

I think I find it frustrating that you ask them to do something, and then they spend 10 minutes doing it.

And then you ask them to do something pretty similar.

And then they go spend 10 minutes doing it.

I think I describe them as like, they're geniuses, but it's their first day on the job.

And that's kind of annoying.

Even the smartest person on Earth, when it's their first day on the job, they're not going to be as useful as you would like them to be.

So I think being able to get more experience and act like somebody that's actually been on the job for six months instead of one day, I think would make them a lot more useful.

But that's really on us to build that capability.

Do you think a lot of it is GPU constrained for you?

If I think about codecs, why is it asking me to set up the environment myself when the model-- if I ask code 3 to create an environment set up script for a repo, I'm sure it'll be able to do it.

But today in the product, I have to do it.

So I'm wondering, in your mind, could these be a lot more if we just, again, put more test time compute on them?

Or do you think there's a fundamental model capability limitation today that we still need a lot of human harnesses around it?

I think that we're in an awkward state right now.

Progress is very fast.

And there's things that are like, clearly, we could do this and the models would be better.

We're going to get to it.

You're just limited by how many hours there are in the day.

So progress can only proceed so quickly.

We're trying to get to everything as fast as we can.

And I think that O3 is not where the technology will be in six months.

I like that question overall in there's a software development lifecycle, not just generation of the code, like from issue to PR, basically, is the typical commentary of that.

And then there's the windsurf side, which is inside your ID.

What else?

Pull request review is something that people don't really-- there are startups that are built around it.

It's not something that Codex does.

And it could.

And so then there's like, what else is there?

That is sort of rate limiting the amount of software you could be iterating on.

It's an open question.

I don't know if there's an answer.

Anything else on ASWE in general?

Where do you think this goes?

Just in form factors?

Or what will we be looking at this time next year in terms of how things are-- what models are able to do that they're not able to today?

I don't think it's going to be limited to ASWE.

I don't think it's going to be limited to software engineering.

I think it's going to be able to do a lot of remote work kind of tasks.

Yeah, like Sweet Lancer type up work.

Yeah, or just like even things that are not necessarily software engineering.

So the way I think about it is like anybody that's doing a remote work kind of job, I think it's valuable to become familiar with the technology and kind of get a sense of what it can do, what it can't do, what it's good at, what it's not good at.

Because I think the breadth of things that it's going to be able to do is going to expand over time as well.

I feel like virtual assistants might be the next thing after ASWE then.

Because they're the most easily-- like virtual assistants, like hire someone in the Philippines, someone who just looks through your email and all that.

Because that is entirely-- you can intercept all the inputs and all the outputs and train on that.

And maybe OpenAI just buys a virtual assistant company.

Yeah, I think what I'm looking forward to is that for things like virtual assistants, the models, if they're aligned well, they could end up being really preferable for that kind of work.

If there's always this principle Asian problem, where if you delegate a task to somebody, then are they really aligned with doing it as you would want it to be done?

And-- Which is as cheaply, as quickly as they can.

And so if you have an AI model that's actually really aligned to you and your preferences, then that could end up doing a way better job than a human could.

Well, not that it's doing a better job than a human could, but it's doing a better job than a human would.

That word alignment, by the way, I think there's an interesting overriding or homomorphism between safety alignment and instruction following alignment.

And I wonder where they diverge.

OK, so I think where it diverges is, what do you want to align the models to?

That's, I think, a difficult question.

You could say you wanted to align it to the user.

Well, what happens if the user wants to build a novel virus that's going to wipe out half of humanity?

That's safety alignment.

So there's a question of, I think, alignment.

I think they're related.

And I think the big question is, what are you aligning towards?

Yeah, there's humanity goals, and then there's your personal goals, and everything in between.

So that's kind of, I guess, the individual agent.

And you announced you're leading the multi-agent team at OpenAI.

I haven't really seen many announcements.

Maybe I missed them on what you've been working on, but what can you share about interesting research directions or anything from there?

Yeah, there hasn't really been announcements on this.

I think we're working on cool stuff, and I think we'll get to announce some cool stuff at some point.

I think the team, in many ways, is actually a misnomer, because we're working on more than just multi-agent.

Multi-agent is one of the things we're working on.

Some other things we're working on is just being able to scale up test time compute by a ton.

So we get these models thinking for 15 minutes now.

How do we get them to think for hours?

How do we get them to think for days, even longer?

And be able to solve incredibly difficult problems.

So that's one direction that we're pursuing.

Multi-agent is another direction.

And here, I think there's a few different motivations.

We're interested in both the collaborative and the competitive aspect of multi-agent.

I think the way that I describe it is people often say in AI circles that humans occupy this very narrow band of intelligence, and AIs are just going to quickly catch up and then surpass this band of intelligence.

And I actually don't think that the band of human intelligence is that narrow.

I think it's actually quite broad.

Because if you compare anatomically identical humans from caveman times, they didn't get that far in terms of what we would consider intelligence today.

They're not putting a man on the moon.

They're not building semiconductors or nuclear reactors or anything like that.

And we have those today, even though we as humans are not anatomically different.

And so what's the difference?

Well, I think the difference is that you have thousands of years, a lot of humans, billions of humans, cooperating and competing with each other, building up civilization over time.

The technology that we're seeing is the product of this civilization.

And I think similarly, the AIs that we have today are kind of like the cavemen of AI.

And I think that if you're able to have them cooperate and compete with billions of AIs over a long period of time and build up a civilization, essentially, the things that they would be able to produce and answer would be far beyond what is possible today with the AIs that we have today.

Do you see that being similar to maybe like Jim Fann's Voyager scale library idea, resaving these things?

Or is it just the models then being retrained on this new knowledge?

Because the humans then have a lot of it in the brain.

I still grow.

I think I'm going to be evasive here and say that we're not going to-- yeah, we're not going to-- until we have something to announce, which I think that we will in the not too distant future, I think I'm going to be a bit vague about exactly what we're doing.

But I will say that the way that we are approaching multi-agent in the details and the way we're actually going about it is, I think, very different from how it's been done historically and how it's being done today by other places.

I've been in the multi-agent field for a long time.

I've kind of felt like the multi-agent field has been a bit misguided in some ways in the approaches that the field has taken and the way that's been approached.

And so I think we're trying to take a very principled approach to multi-agent.

Sorry, I got to add.

So you can't talk about what you're doing, but you can say what's misguided.

What's misguided?

I think that a lot of the approaches that have been taken have been very heuristic and haven't really been following the bitter lesson approach to scaling and research.

OK.

I think maybe this might be a good spot.

So obviously, you've done a lot of amazing work in poker.

And I think as the reasoning model got better, I was talking to one of my friends who used to be a hardcore poker grinder.

And I told them I was going to interview you.

And their question was, at the table, you can get a lot of information from a small sample size about how a person plays.

But today, GTO is so prevalent that sometimes people forget that you can play exploitably.

What do you think is the state?

As you think about multi-agent and kind of competition, is it always going to be trying to find the optimal thing?

Or is a lot of it trying to think more in the moment like how to exploit somebody?

I'm guessing your audience is probably not super familiar with poker terminology.

So I'll just explain this a bit.

A lot of people think that poker is just a luck game.

And that's not true.

It's actually-- there's a lot of strategy in poker.

So you can win consistently in poker if you're playing the right strategy.

So there's different approaches to poker.

One is game theory optimal.

This is like you're playing an unbeatable strategy and expectation.

You're just unexplodable.

It's kind of like in rock, paper, scissors.

You can be unbeatable in rock, paper, scissors if you just randomly choose between rock, paper, and scissors with equal probability.

Because no matter what the other guy does, they're not going to be able to exploit you.

You're going to not lose an expectation.

Now, a lot of people hear that.

And they think, well, that also means that you're not going to win an expectation because you're just playing totally randomly.

But in poker, if you play the equilibrium strategy, it's actually really difficult for the opponents to figure out how to tie you.

And they're going to end up making mistakes that will lead you to win over the long run.

It might not be a massive win, but it is going to be a win.

If you play enough hands for a long enough period of time, you're going to win in expectation.

Now, there's also exploitative poker.

And the idea here is that you're trying to spot weaknesses in how the opponent plays.

Maybe they're not bluffing enough.

Or maybe they fold too easily to a bluff.

And so you start adapting from the game theory optimal balance strategy of like you bluff sometimes, you don't bluff sometimes, to then playing a very unbalanced strategy that's like, oh, I'm just going to like bluff a ton against this person because they always fold whenever I bluff.

Now, the key is that there's a trade off here.

Because if you're taking this exploitative approach, then you're opening yourself up to exploitation as well.

And so you have to choose this balance between playing a defensive game theory optimal policy that guarantees you're not going to lose, but might not make you as much money as you potentially could, versus playing an exploitative strategy that can be much more profitable, but also it creates weaknesses that the opponents can take advantage of and trick you.

And there's no way to perfectly balance the two.

It's kind of like in raw paper scissors, if you notice somebody is like playing paper for five times in a row, you might think like, oh, they have a weakness in their strategy.

I should just be throwing scissors, and I'm going to take advantage of them.

And so on the sixth time, you throw scissors.

But actually, that's the time when they throw a bluff.

And you never really know.

So you always have this trade off.

The poker AIs that have been extremely successful.

And like my background is like I worked on AI for poker for several years during grad school and made the first superhuman no limit poker AIs.

The approach that we took was this game theory optimal approach, where the AIs would play this unbeatable strategy and they would play against the world's best and beat them.

Now, that also means they beat the world's worst.

Like they would just beat anybody.

But if they were up against a weak opponent, they might not beat them as severely as a human expert might, because the human expert would know how to adapt from the game theory optimal policy to be able to exploit these weak players.

And so there's this kind of unanswered question of like, how do you make an exploitative poker AI?

And a lot of people had pursued this research direction.

I had dabbled in it a little bit during grad school.

And I think fundamentally, it just comes down to AIs not being as sample efficient as humans.

We discussed earlier.

If a human's playing poker, they're able to get a really good sense of the strengths and weaknesses of a player within a dozen hands.

It's honestly really impressive.

And back when we were working on AI for poker in the mid 2010s, these AIs would have to play like 10,000 hands of poker to get a good profile of who this player is, how they're playing, where their weaknesses are.

Now, I think with more recent technology, that has come down.

But still, the sample efficiency has been a big challenge.

Now, what's interesting is that after working on poker, I worked on diplomacy.

I think we talked about this earlier.

And diplomacy is this-- it's a seven player negotiation game.

And when we started working on it, I took a very game theory approach to the problem.

I felt like, OK, it's kind of like poker.

You have to compute this game theory optimal policy.

And you just play this.

You're going to not lose an expectation.

You're going to win in practice.

But that actually doesn't work in diplomacy.

And it doesn't work, again, for the question of how much of a rabbit hole do we want to go down on this.

But basically, when you're playing the zero sum games like poker, game theory optimal works really well.

When you're playing a game like diplomacy, where there's you need to collaborate and compete, and there's room for collaboration, then game theory optimal actually doesn't work that well.

And you have to understand the players and adapt to them much better.

So this ends up being very similar to the problem in poker, of how do you adapt to your opponents.

In poker, it's about adapting to their weaknesses and take advantage of that.

In diplomacy, it's about adapting to their play styles.

It's kind of like if you're at a table and everybody's speaking French, you don't want to just keep talking in English.

You want to adapt to them and speak in French as well.

That's the realization that I have with diplomacy that we need to shift away from this game theory optimal paradigm towards modeling the other players, understanding who they are, and then responding accordingly.

And so in many ways, the techniques that we developed in diplomacy are exploitative.

They're not exploitative.

They're really just adapting to the opponents, to the other players at the table.

But I think the same techniques could be used in AI for poker to make exploitative poker AI's.

If I didn't get AGI-pilled by the incredible progress that we were seeing with language models like shifting my whole research agenda to focusing on general reasoning, probably what I would have worked on next was making these exploitative poker AI's.

It would be a really fun research direction to go down.

I think it's still there for anybody that wants to do it.

And I think the key would be taking the techniques that we use in diplomacy and applying them to things like poker.

I think to me the core piece is when you play online, you have a HUD, which tells you all these stats about the other player and how much they participate, pre-flop, blah, blah, blah.

And to me, it's like a lot of these models, from my understanding, are not really leveraging the behavior of the other players at the table.

They're just looking at the board state and working from there.

That's correct.

The way the poker AI's work today, they're just sticking to their pre-computed-- GTO.

--GTO strategy.

And they're not adapting to the other players at the table.

And you can do various hacky things to get them to adapt.

But they're not very principled.

They don't work super well.

OK, any grad students listening?

If you want to work on that, I think that is a very, very reasonable research direction that will at least get in front of you and get some attention, at least.

The other thing that this conversation brings up for me is-- yeah, one of the hypotheses for what is the next step after test time compute is world models.

Is world modeling important or worthwhile research direction?

Yann LeCun has been talking about this nonstop.

But basically, no LLMs have-- they have internal world models, but not explicitly a world model.

I think it's pretty clear that as these models get bigger, they have a world model.

And that world model becomes better with scale.

So they are implicitly developing a world model.

And I don't think it's something that you need to explicitly model.

I could be wrong about that.

When dealing with people or multi-agents, it might be because you have entities that are not in the world, and you're resolving hypotheses of which of the many types of entities you could be dealing with.

There was this long debate in the multi-agent AI community for a long time about-- and it's still going on-- about whether you need to explicitly model other agents, like other people, or if they can be implicitly modeled as part of the environments.

For a long time, I was on the-- took the perspective of, of course, you have to explicitly model these other agents because they're behaving differently from the environment.

They take actions.

They're unpredictable.

They have agency.

But I think I've actually shifted over time to thinking that actually, if these models become smart enough, they develop things like theory of mind.

They develop an understanding that there are other agents that can take actions and have motives and all this stuff.

And these models develop that implicitly with scale and more capable behavior broadly.

So that's the perspective I take these days.

So what I just said was an example of a heuristic that is not bitter lesson-filled.

And it just goes away.

Yeah.

It's really all-- come back to the bitter lesson.

Got to cite them every AI podcast.

So one of the interesting findings and most consistent findings-- you know, I think you were at ICLR.

And one of the hit talks there was about open-endedness.

And this guy, Tim, who gave that talk, has been doing a lot of research about multi-agent systems too.

One of the most consistent findings is always that it's better for AI to self-play and improve competitively as opposed to humans training and guiding them.

And you find that with alpha 0 and R10, whatever that was, do you think this will hold for multi-agents, like self-play, to improve better than humans?

Yeah.

OK.

So this is a great question.

And I think this is worth expanding on.

So I think a lot of people today see self-play as the next step and maybe the last step that we need for superintelligence.

And I think if you're following-- you look at something like AlphaGo and AlphaZero, we seem to be following a very similar trend.

The first step in AlphaGo was you do large-scale pre-training.

In that case, it was on human go games.

With LLMs, it's pre-training on tons of internet data.

And that gets you a strong model, but it doesn't get you an extremely strong model.

It doesn't get you a superhuman model.

And then the next step in the AlphaGo paradigm is you do large-scale test time compute or large-scale inference compute, in that case with MCTS.

And now we have reasoning models that also do this large-scale inference compute.

And again, that boosts the capabilities a ton.

Finally, with AlphaGo and AlphaZero, you have self-play, where the model plays against itself, learns from those games, gets better and better and better, and just goes from something that's around human-level performance to way beyond human capability.

These Go policies now are so strong that it's just incomprehensible.

What they're doing is incomprehensible to humans.

Same thing with chess.

And we don't have that right now with language models.

And so it's really tempting to look at that and say, oh, well, we just need these AI models to now interact with each other and learn from each other.

And then they're just going to get super intelligence.

The challenge-- and I kind of mentioned this a little bit when I was talking about diplomacy-- the challenge is that Go is this two-player zero-sum game.

And two-player zero-sum games have this very nice property where when you do self-play, you are converging to a minimax equilibrium.

And I guess I should take a step back and say, in two-player zero-sum games-- two-player zero-sum games are chess, Go, even two-player poker, all two-player zero-sum-- what you typically want is what's called a minimax equilibrium.

This is that GTO policy, this policy that you play where you're guaranteeing that you're not going to lose to any opponent in expectation.

I think in chess and Go, that's pretty clearly what you want.

Interestingly, when you look at poker, it's not as obvious.

In a two-player zero-sum version of poker, you could play the GTO minimax policy, and that guarantees that you won't lose to any opponent on Earth.

But again, I mentioned, you're not going to beat a weak player.

You're not going to make as much money off of them as you could if you instead played an exploitative policy.

So there's this question of what do you want?

Do you want to make as much money as possible, or do you want to guarantee that you're not going to lose to any human life?

What all the bots have decided is, well, what all the AI developers in these games have decided is, well, we're going to lose the minimax policy.

And conveniently, that's exactly what self-play converges to.

If you have these AIs play against each other, learn from their mistakes, they converge over time to this minimax policy, guaranteed.

But once you go outside of two-player zero-sum games, in the case of diplomacy, that's actually not a useful policy anymore.

You don't want to just have this very defensive policy, and you're going to end up with really weird behavior if you start doing the same kind of self-play in things like math.

So for example, what does it mean to do self-play in math?

You could fall into this trap of, well, I just want one model to pose really difficult questions and the other model to solve those questions.

That's a two-player zero-sum game.

The problem is that, well, you could just pose really difficult questions that are not interesting.

You could just ask it to do 30-digit multiplication.

It's a very difficult problem for the AI models.

Is that really making progress in the dimension that we want?

Not really.

So self-play outside of these two-player zero-sum games becomes a much more difficult, nuanced question.

Tim basically said something similar in his talk, that there's a lot of challenges in really deciding what you're optimizing for when you start to talk about self-play outside of two-player zero-sum games.

My point is that this is where the AlphaGo analogy breaks down.

Not necessarily breaks down, but it's not going to be as easy as self-play was in AlphaGo.

What is the objective function then for that?

What is the new objective function?

Yeah, it's a good question.

And I think that that's something that a lot of people are thinking about.

I'm sure you are.

One of the last podcasts that you did, you mentioned that you were very impressed by Sora.

You don't work directly on Sora, but obviously it's part of OpenAI.

I think the most recent new updates or in that generative media space is autoregressive image gen.

Is that interesting or surprising in any way that you want to comment about?

I don't work on image gen, so my ability to comment on this is kind of limited.

But I will say I love it.

I think it's super impressive.

It's one of those things where you work on these reasoning models and you think, wow, we're going to be able to do all sorts of crazy stuff, advanced science, and solve agentic tasks, and software engineering.

And then there's this whole other dimension of progress where you're like, oh, you're able to make images and videos now.

And it's so much fun.

And that's getting a lot more attention, to be honest, especially in the general public.

And it's probably driving a lot more of the subscription plans for Chaj B.T., which is great.

But I think it's kind of funny that we're also, I promise, we're also working on super intelligence.

But you can make everything goodly.

I think the delta for me was I was actually harboring this thesis that diffusion was over because of autoregressive image gen.

There were rumors about this end of last year.

And obviously now it's come out.

Then Gemini comes out with text diffusion.

And diffusion is so bad.

And this is two directions.

And it's very relevant for inference of autoregressive versus diffusion.

Do we have both?

Does one win?

The beauty of research is you've got to pursue different directions.

And it's always going to be clear what is the promising path.

And I think it's great that people are looking into different directions and trying different things.

And I think that there's a lot of value in that exploration.

And I think we all benefit from seeing what works.

Any potential in diffusion reasoning?

Let's say your-- Probably can't answer that.

OK.

So you did a master's in robotics too.

We'd love to get your thoughts on one.

Open AI kind of started with the pen spinning trick and the robotic arm they wanted to build.

Is it right to work on this humanoid likes?

Do you think that's kind of like the wrong embodiment of AI?

Outside of the usual, how long until we get robots, blah, blah, blah.

Is there something that you think is fundamentally not being explored right now that people should really be doing in robotics?

I did a master's in robotics years ago.

And my takeaway from that experience-- first of all, I didn't actually work with robots that much.

I was technically in a robotics program.

I played around with some LEGO robots my first week of the program.

But then honestly, I just pretty quickly shifted just working on AI for poker.

And it was kind of nominally in the robotics master's.

But my takeaway from interacting with all these roboticists and seeing their research was that I did not want to work on robots because the research cycle is so much slower and so much more painful when you're dealing with physical hardware.

Software goes so much more quickly.

And I think that's why we're seeing so much progress with language models and all these virtual coworker kind of tasks but haven't seen as much progress in robotics that physical hardware just is much more painful to iterate on.

On the question of humanoids, I don't have very strong opinions here because this isn't what I'm working on.

But I think there is a lot of value in non-humanoid robotics as well.

I think drones are a perfect example where there's clearly a lot of value in that.

Is that a humanoid?

No.

But in many ways, that's great.

You don't want a humanoid for that kind of technology.

I think weekly, I think that non-humanoids provide a lot of value.

I was reading Richard Heming's The Art of Doing Science and Engineering.

And he talks about how when you have a new technological shift, people try and take the old workloads and replicate them just in the new technology versus you actually have to change the way you do it.

And when I see this video of your humanoid in the house, it's like, well, the human shape has a lot of limitations that could actually be improved.

But I think people-- what's familiar?

It's like, would you put a robot with 10 arms and five legs in your house?

Or would it be eerie?

And when you get up and you see that thing walking around and it's like, why would you see a humanoid?

So I think to me, there's almost like this local maximum of, we got to make it look like a human.

But I think what's the best shape in-house would be.

I am terrible at product design.

So I am not the person to ask on this.

I think there is a question of is it better to make humanoids because they're more familiar to us?

Or is it worse to make humanoids because they're more similar to us but not quite identical?

I don't know which one I would actually find creepier.

The thing that got me humanoid pilled a little bit was just the argument that most of the world is made for humans anyway.

So if you want to replace human labor, you have to make a humanoid.

I don't know if that's convincing.

Again, I don't have very strong opinions in this field because I don't work in it.

I was weekly in favor of humanoids.

And I think what really persuaded me to be weekly in favor of non-humanoids was listening to the physical intelligence CEO and some of his pitches about why they're pursuing non-humanoid robotics.

And conveniently, their office is actually very close to here.

So if you wanted to-- They're speaking at the conference I'm running.

OK, perfect.

Yeah, so I'd say listen to his pitch and maybe he can convince you that non-humanoids will go.

Awesome.

The other one I would refer people to is Jim Fan recently did a talk on the physical tearing tests, which he did at the Sequoia conference, which was very, very good.

He's such a great educator and explainer of things.

It's very hard, especially in that field.

Cool.

We're asking you about things that you don't work on.

So these are just more rapid fires to explore some of your boundaries and get some quick hits.

How do you or top industry labs keep on top of research?

What are your tools and practices?

It's really hard.

I think that a lot of people have this perception that academic research is irrelevant.

And this is actually not the case.

I think that we do-- we look at academic research.

I think one of the challenges is a lot of academic research shows promise in their papers, but then actually doesn't work at scale or even doesn't replicate.

I think if we find interesting papers, we're going to try to reproduce that in-house and see if it still holds up and then also does it scale well.

But that is a big source of inspiration for us.

Whatever hits archive, literally, you do the same as the rest of us.

Or do you have a special process?

Especially if I get recommendations.

We have an internal channel where people will post interesting papers.

And I think that's a good source of, OK, well, this person-- that is more familiar with this area, thinks that this paper is interesting.

So therefore, I should read it.

And similarly, I'll keep track of things that are happening in my space that I think are interesting.

And if I think it's really interesting, maybe I'll share it.

For me, it's like WhatsApp and Signal group chats with researchers.

And that's it.

I think it is like-- I mean, a lot of people look at things like Twitter.

And I think it's really unfortunate that we've reached this point where things need to get a lot of attention on social media for it to be paid attention to.

That's what the grad students are trained.

They're taking classes to do this.

I do recommend to-- I've worked with grad students.

I work with fewer now because we don't publish as much.

But when I was at Fair Publishing Papers, I would tell the grad students I was working with that, you need to post it on Twitter.

And you need to-- we go over the Twitter thread about how to present the work and everything.

And there's a real art to it.

And it does matter.

And it's kind of the sad truth.

I know when you were doing the ACPC, the AI poker competition, you mentioned that people were not doing search because they were limited to two CPUs at inference.

Do you see similar things today that are keeping interesting research from being done?

That might be-- it's not as popular.

It doesn't get you into the top conferences.

Are there some environmental limiters?

Absolutely.

And I think one example is for benchmarks that you look at things like humanity's last exam.

You have these incredibly difficult problems, but then are still very easily gradable.

And I think that actually limits the scope of what you can evaluate these models on if you stick to that paradigm.

It's very convenient because it's very easy to then score the models.

But actually, a lot of the things that we want to evaluate these models on are more fuzzy tasks that are not multiple choice questions.

And making benchmarks for those kinds of things is so much harder and probably also a lot more expensive to evaluate.

But I think that those are really valuable things to work on.

And that would fit the same element.

GPT 4.5 is a high-taste model in a way.

There's all these non-measurable things about a model that are really good that maybe people are not.

Well, I think there are things that are measurable, but they're just much more difficult to measure.

And I think that a lot of benchmarks have stuck to this paradigm of posing really difficult problems that are really easy to measure.

So let's say that the pre-training scaling paradigm took about five years from discovery of GPT to scaling it up to GPT 4.

And then we give test time compute five years as well.

So if test time compute hit a wall by 2030, what would be the probable cause?

It's very similar to pre-training.

We're like, you can push pre-training a lot further.

It just becomes more expensive with each iteration.

I think we're going to see something similar with test time compute.

We're like, OK, we're going to get them thinking instead of three minutes, they're for three hours, and then three days, and then three weeks.

You run out of human life.

Well, there's two concerns.

One is that it becomes much more expensive to get the models to think for that long or scale up test time compute.

As you scale up test time compute, you're spending more on test time compute, which means that there's a limit to how much you could spend.

That's one potential ceiling.

Now, obviously-- well, not obviously, but I should say that we're also becoming more efficient.

These models are becoming more efficient in the way they're thinking.

And so they're able to do more with the same amount of test time compute.

And I think that's a very underappreciated point that it's not just that we're getting these models to think for longer.

In fact, if you look at '03, it's thinking for longer than '01 preview for some questions.

It's not a radical difference, but it's way better.

Why?

Because it's just becoming better at thinking.

Anyway, yeah, these models-- you're going to scale up test time compute.

You can only scale it up so much.

That becomes a soft barrier.

In the same way that pre-training, it's becoming more and more expensive to train better and better pre-trained models, or bigger pre-trained models.

The second point is that as you had these models think for longer, you kind of get bottlenecked by walk clock time.

If you want to iterate on experiments, it is really easy to iterate on experiments when these models would respond instantly.

It's much harder when they take three hours to respond.

And what happens when they have three weeks?

It takes you at least three weeks to do those evaluations and to then iterate on that.

And in a lot of this, you can paralyze experiments to some extent, but a lot of it, you have to run the experiment, complete it, and then see the results in order to decide on the next set of experiments.

I think this is actually the strongest case for long timelines, that the models, because they just have to do so much in serial time, we can only iterate so quickly.

How would you overcome that wall?

It's a challenge.

And I think it depends on the domain.

So drug discovery, I think, is one domain where this could be a real bottleneck.

If you want to see if something extends human life, it's going to take you a long time to figure out if this new drug that you developed actually extends human life and doesn't have terrible side effects along the way.

Side note, do we not have perfect models of human chemistry and biology right now?

Well, so this is, I think, the thing.

And again, I want to be cautious here, because I'm not actually a biologist or a chemist.

I know very little about these fields.

Last time I took a biology class was 10th grade in high school.

I don't think that there is a perfect simulator of human biology right now.

And I think that's something that could potentially help address this problem.

That's the number one thing that we should all work on.

Well, that's one of the things that we're hoping that these recent models will help us with.

How would you classify mid-training versus post-training today?

All these definitions are so funny.

So I don't have a great answer there.

It's a question people have.

And you're opening eyes now explicitly hiring for mid-training.

And everyone is like, what the hell is mid-training?

I think mid-training is between pre-training and post-training.

It's like it's not post-training.

It's not pre-training.

It's like adding more to the models.

But after pre-training-- In interesting ways.

Yeah.

OK.

All right.

Well, I was trying to get some clarity.

Is the pre-trained model now basically just an artifact that then spawns other models?

And it's almost like the core pre-training model is never really exposed anymore.

And it's the mid-training, the new pre-training.

And then there's the post-training once you have the models branched out.

You never interact with an actual raw pre-trained model.

If you're going to interact with the model, it's going to go through mid-training and post-training.

So you're seeing the final product.

Well, you don't let us do it, but we used to.

Well, yeah.

I mean, I guess there's open source models where you can just interact with the raw pre-trained model.

But for open AI models, they go through mid-training step, and then they go through post-training step, and then they're released.

And they're a lot more useful.

Frankly, if you interacted with the only pre-trained model, it would be super difficult to work with.

And it would seem kind of dumb.

Yeah.

It'd be useful in weird ways.

Because there's a mode collapse when you post-training for it for chat.

Yeah.

And in some ways, you want that mode collapse.

You want that collapse of-- Yes, to be useful.

Yeah.

I get it.

We're interviewing Greg Brockman next.

You've talked to him a lot.

What would you ask him?

What would I ask Greg?

I mean, I get to ask Greg all the time.

What should you ask Greg?

To evoke an interesting response that he doesn't get asked enough about, but you know this is something that he's passionate about.

Or you just want his thoughts.

I think in general, it's worth asking where this goes.

What does the world actually look like in five years?

What does the world look like in 10 years?

What does that distribution of outcomes look like?

And what could the world or individuals do to help steer things towards the good outcomes instead of the negative outcomes?

OK, like an alignment question.

I think people get very focused on what's going to happen in one or two years.

And I think it's also worth spending some time thinking about what happens in five or 10 years, and what does that world look like?

I mean, he doesn't have a crystal ball.

But he certainly has thoughts.

Yeah, so I think that's worth exploring.

What are games that you recommend to people, especially socially?

What are games that I recommend to people?

I've been playing a lot of this game called Blood on the Clock Tower lately.

What is it?

It's kind of like Mafia or Werewolf.

It's become very popular in San Francisco.

Oh, that's the one we played in your house.

Yeah.

OK, got it, Greg.

It's kind of funny, because I was talking to a couple people now that had told me that it used to be that poker was the way that the VCs and tech founders and stuff would socialize with each other.

And actually, now it's shifting more towards Blood on the Clock Tower.

That's the thing that people use to connect in the Bay Area.

And I was actually told that a startup held a recruiting event that was a Blood on the Clock Tower game.

Wow.

So I guess it's really catching on.

But it's a fun game.

And I guess you lose less money playing it than you do playing poker.

So it's better for people that are not very good at these things.

I think it's kind of a weird recruiting event, but certainly a fun game.

What qualities make a winner here that is interesting to hire for?

That's the thing.

I guess you get a good-- Ability to lie.

Deception and picking up on deception.

Is that the best employee?

I don't know.

So my slight final pet topic is Magic the Gathering.

So you have-- we talked about some of these games in the chat scope, and they have perfect information.

Then you have poker, which is imperfect information in a pretty limited universe.

You only have a 50 to card deck.

And then you have these other games that have imperfect information, like a huge pool of possible options.

Do you have any idea of how much harder that is?

How does the difficulty of this problem scale?

I love that you asked that, because I have this huge store of knowledge on AI for imperfect information games.

This is my area of research for so long.

And I know all these things, but I don't get to talk about it very often.

We've made superhuman poker AIs for No Limit Texas Hold'em.

One of the interesting things about that is that the amount of hidden information is actually pretty limited, because you have two hidden cards when you're playing Texas Hold'em.

And so the number of possible states that you could be in is 1,326 when you're playing Heads Up, at least.

And that's multiplied by the number of other players that there are at the table, but it's still not a massive number.

And so the way these AI models work is they enumerate all the different states that you could be in.

So if you're playing like six-handed poker, there's five other players, five times 1,326, that's the number of states that you'd be in.

And then you assign a probability to each one.

And then you feed those probabilities into your neural net, and you get actions back for each of those states.

The problem is that as you scale the number of hidden possibilities, like the number of possible states you could be in, that approach breaks down.

And there's still this very interesting unanswered question of what do you do when the number of hidden states becomes extremely large?

So if you go to Omaha poker, where you have four hidden cards, there are things you could do that are kind of heuristic that you could do to reduce the number of states.

But actually, it's still a very difficult question.

And then if you go to a game like Stratego, where you have 40 pieces-- so there's close to 40 factorial different states you could be in-- then all these existing approaches that we used for poker break down.

And you need different approaches.

And there's a lot of active research going on about, how do you cope with that?

So for something like Magic the Gathering, the techniques that we used in poker were not out of the box work.

And it's still an interesting research question of what do you do?

Now, I should say this becomes a problem when you're doing the kinds of search techniques that we used in poker.

If you're just doing model free RL, it's not a problem.

And my guess is that if somebody put in the effort, they could probably make a superhuman bot for Magic the Gathering now.

Yeah, there's still some unanswered research questions in that space.

Now, are they the most important unanswered research questions?

I'm inclined to say no.

I think there's-- the problem is that the techniques that we used in poker to do this kind of search stuff were pretty limited.

And if you expand those techniques, maybe you get them to work on things like Stratego and Magic the Gathering.

But they're still going to be limited.

They're not going to get you superhuman and code forces with language models.

So I think it's more valuable to just focus on the very general reasoning techniques.

And one day, as we improve those, I think we'll have a model that just out of the box one day plays Magic the Gathering at a superhuman level.

And I think that's the more important and more impressive research direction.

Cool.

Amazing.

Yeah.

Thanks very much for coming on, though.

Yeah.

Thanks for your time.

Yeah.

Thanks.

Thanks for having me.