The a16z Show · 2025-09-25

OpenAI's Mark Chen & Jakub Pachocki on GPT-5, RL, and the March Toward an Automated Researcher

Hosts: Anjney Midha, Sarah Wang

Guests: Mark Chen, Jakub Pachocki

GPT-5reasoning modelsautomated researchreinforcement learningevals and benchmarkslong-horizon agencyGPT-5 Codexcoding agentscompute scalingAI research culture

Why it matters

OpenAI research chiefs on GPT-5 unification and the automated researcher goal

Key claims

  • GPT-5 was designed to unify OpenAI's instant-response models with its reasoning models so users no longer choose between modes.
  • OpenAI's primary research target is an 'automated researcher' that can autonomously discover new ideas; current models reason reliably for ~1–5 hours and the team is working to extend that horizon.
  • Traditional evals (math contests, IOI, IMO, AtCoder) are approaching saturation; the lab is shifting to economically meaningful benchmarks that measure real discovery.
  • RL on natural language has been 'the gift that keeps giving'—Pachocki credits it as the most versatile method and says there are still many unexplored directions.

Episode summary

Summary

OpenAI's Chief Research Officer Mark Chen and Chief Scientist Jakub Pachocki joined a16z's Anjney Midha and Sarah Wang to discuss the GPT-5 launch, which they framed as the company's effort to bring reasoning into the mainstream by merging its instant-response GPT series with its longer-thinking o-series. They acknowledged that many traditional evals are now saturated and described a shift toward benchmarks that measure actual discovery and economically meaningful tasks, with math, programming competitions, and hard-science problem solving as the current frontier.

The central research target they articulated is building an "automated researcher" capable of autonomously discovering new ideas over long time horizons, with current models operating reliably at roughly one-to-five-hour reasoning windows. They credited reinforcement learning as the most versatile and productive lever in their stack, said RL on natural language is finally delivering on long-standing ambitions, and argued that compute—not data—remains the binding constraint on progress. They also discussed the just-released GPT-5 Codex, which rebalances compute spend between easy and hard coding tasks and is tuned for messy real-world software environments.

On culture, both emphasized protecting fundamental research from product pressures, maintaining conviction-driven long-term roadmaps rather than reacting to competitor launches, and portfolio-managing compute across bets. They described the shift from "vibe coding" to "vibe researching" as inevitable, and reflected personally on their own competitive-programming roots and what it feels like to watch models surpass their abilities.

  • GPT-5 was designed to unify OpenAI's instant-response models with its reasoning models so users no longer choose between modes.
  • OpenAI's primary research target is an 'automated researcher' that can autonomously discover new ideas; current models reason reliably for ~1–5 hours and the team is working to extend that horizon.
  • Traditional evals (math contests, IOI, IMO, AtCoder) are approaching saturation; the lab is shifting to economically meaningful benchmarks that measure real discovery.
  • RL on natural language has been 'the gift that keeps giving'—Pachocki credits it as the most versatile method and says there are still many unexplored directions.
  • Compute, not data, remains the binding constraint on frontier research; Chen and Pachocki said they would spend 10% more resources on compute if they had it.
  • GPT-5 Codex rebalances inference spend—spending less on easy problems and more on hard ones—and adds presets so users can trade latency for quality.
  • Both researchers said competitive-programming models now exceed their own abilities and predicted 'vibe researching' will become the default workflow as 'vibe coding' already is for high-schoolers.
  • They stressed protecting fundamental research from product gravity, hiring for persistence and taste over visibility, and portfolio-managing compute across diverse bets like reasoning, multimodal, and diffusion work.

Source material

Transcript

The big thing that we are targeting is producing an automated researcher.

So automating the discovery of new ideas, the next set of evals and milestones that we're looking at will involve actual movement on things that are economically relevant.

I was talking to some high schoolers and they were saying, "Oh, you know, actually the default way to code is vibe coding."

I do think, you know, the future hopefully will be vibe researching.

What does it take to build an automated researcher and can AI discover new ideas on its own?

OpenAI's Q scientist, Jakob Pohotsky, and chief research officer Mark Chen join A16Z General Partners Adune Midha and Sarah Wang to unpack GPT-5's reasoning closure.

Why evals must shift to economically meaningful benchmarks and the march towards an automated researcher.

We get into long-horizon agency, why RL keeps working, the new codecs for real-world coding, research culture versus product, and why for now compute is destined.

Let's get into it.

Thanks for coming Jakob and Mark.

Jakob, you're the chief scientist at OpenAI.

Mark, you are the chief research officer at OpenAI and you guys have the both the privilege and the stress of running probably one of the most high-profile research teams in AI.

And so we're just really stoked to talk with you about a whole bunch of things we've been curious about including GPD-5, which was, you know, one of the most exciting updates to come out of OpenAI in recent times.

And then stepping back, how you build a research team that can do not just GPD-5 but codecs and chat GPT and an API business and can weave all of the many different bets you guys have across modalities, across product form factors into one coherent research culture and story.

And so to kick things off, why don't we start with GPD-5?

Just tell us a little bit about the GPD-5 launch from your perspective, how did it go?

So I think GPD-5 was really our attempt to bring reasoning into the mainstream.

And prior to GPD-5, right, we have two different series of models.

You had the GPT kind of two, three, four series, which were kind of these instant response models.

And then we had an O series, which essentially thought for a very long time and then gave you the best answer that it could give.

So tactically, we don't want our users to be puzzled by, you know, which mode should I use.

And it involves a lot of research and kind of identifying what the right amount of thinking for any particular prompt looks like and taking that pain away from the user.

So we think the future is about reasoning, more and more about reasoning, more and more about agents.

And we think GPD-5 is the step towards delivering reasoning and more agentic behavior by default.

There is also a number of improvements across the board in this model relative to all three in our previous models.

But our primary reason for this launch was indeed bringing the reasoning out more and more.

Can you say more about how you guys think about evals?

I noticed even in that launch video, there were a number of evals where you're inching up from, you know, 98 to 99%.

And that's kind of how you know you've saturated the eval.

What approach do you guys take to measuring progress and how do you think about it?

One thing is that indeed for these evals that we've been using for the last few years, they're indeed pretty close to saturate it.

And so yeah, like for a lot of them, like, you know, inching from like 96 to 98% is not necessarily the most important thing in the world.

I think another thing that's maybe even more important, but a little bit subtler.

When we were in this like GPT-2, GPT-3, GPT-4 era, you know, there was kind of one recipe, you just like pre-train a model on a lot of data and you kind of like use these evals as just kind of a yardstick of how this generalizes to like different tasks.

Now we have these different ways of training in particular, reinforcement learning on like serious reasoning where we can pick a domain and we can really train a model to like become an expert in this domain to reason very hard about it, which lets us, you know, target particular kinds of tasks, which will mean that like we can get like extremely good performance on some evals, but it doesn't indicate as great generalization to other things.

So the way we think about it in this world, we definitely think like we are in a little bit of a deficit, like of great evaluations.

And I think the big things that we look at are actual marks of the model being able to discover new things.

I think for me, the most exciting thread and the actual sign of progress this year has been our model's performance in math and programming competitions.

Although I think like they are also becoming saturated in a sense.

And the next set of evals and milestones that we're looking at will involve actual discovery and actual movement on things that are economically relevant.

Totally.

You guys already got number two in the AtCoder competition, so there's only number one left.

Yeah.

I mean, I think it is important to note that these evals like you know, IOI, AtCoder, IMO are actually real world markers for success in future research.

I think a lot of, you know, the best researchers in the world have gone through these competitions that have gotten very good results.

And yeah, I think we are kind of preparing for this frontier where we're trying to get our models to discover new things.

Yeah, very exciting.

Which capability from GPD 5 before the release surprised you the most when you were working through the eval bench or using it internally?

Were there any moments where you felt like this was starting to get good enough to release because it was useful in your daily usage?

I think one big thing for me was just how much it moved the frontier in very hard sciences.

You know, we would try the models with some of our friends who are, you know, professional physicists or professional mathematicians.

And you already saw kind of some instances of this on Twitter where, you know, you can take a problem and have it discovered maybe not like very complicated new mathematics, but you know, some non-trivial new mathematics.

And we see physicists, mathematicians kind of repeating this experience over and over where they're trying GPD 5 Pro and saying, wow, this is something that previous version of the models couldn't do.

And it is a little bit of a light bulb moment for them.

It's like able to automate maybe like what could take one of their students months of time.

Well, GPD 5 is a definite improvement on O3.

For me, O3 was definitely like that moment where the reasoning models became like actually very useful on a daily basis.

I think especially for, you know, working through a math formula or a derivation, like it actually got to a level where it is like fairly trustworthy.

I can actually use it as a tool for my work.

And yeah, I think it is very exciting to get to that moment.

But I expect that, well, now as we're seeing, you know, these models like actually able to automate.

Well, yes, like we're saying solving contest problems over longer time horizons.

I expect that that was quite small compared to what's coming over next year.

What is coming in the next one to five years?

It would be just at whatever level you're comfortable sharing.

What does the research roadmap look like?

So the big thing that we are targeting with our research is producing an automated researcher.

So automating the discovery of new ideas.

And, you know, of course, like a particular thing we think about a lot is automating our own work, automating ML research, but that can get a little bit self-referential.

So we're also thinking about automating progress in other sciences.

And I think like one good way to measure progress there is looking at like what is the time horizon on which these models actually can reason and make progress.

And so now as we get to a level of near mastery of this high school competitions, let's say, I would say like we get to like maybe on the order of one to five hours of reasoning.

And so we are focused on extending that horizon, both in terms of like the models capability to plan over very long horizons and actually able to retain ability to retain memory.

And back to the EVALS question, that's why I think EVALS of the form of how long does this model autonomously operate for our particular interests to us?

And actually, maybe on that topic, there's been this huge move toward agency and model development.

But I think at least the state that it's in currently users have sort of observed this trade off between too many tools or planning hops can result in quality regressions versus something that maybe has a little bit less agency, the quality is at least observed today to be a bit higher.

How do you guys think about the trade off between stability and depth?

The more steps that the model is undertaking, maybe the less likely the 10th step is to be accurate versus you ask it to do one thing, it can do it very, very well.

And to have it keep doing that one thing better and better, but more complex things, there's sort of that trade off.

But of course, to get to full autonomy, you are taking multiple steps, you're using multiple tools.

I think actually, like, well, the ability to maintain depth is a lot of it is being consistent over long horizons.

So I think there are very related problems.

And in fact, I think like with the reasoning models, we have seen the models like greatly extends the landfever, which they are able to reason and work reliably without going off track.

Yeah, I think this is remain a big area of focus for us.

And I think reasoning is core to this ability to operate over a long horizon.

Because, you know, you imagine kind of yourself solving a math problem, or you try an approach, it doesn't work.

And you have to think about, you know, what's the next approach I'm going to take?

What are the mistakes in the first approach?

And you try another thing, and you know, the world gives you some hard feedback, right?

And then you keep trying different approaches.

And the ability to do that over a long period of time is reasoning and gives agents that robustness.

We talked a lot about math and science, I was curious to get your take on, do you think some of the progress that we've made can actually extend similarly to domains that are less verifiable, they're sort of less of an explicit right or wrong?

Oh, yeah, that's a question I really like.

I think if you actually truly want to extend to research and you know, finding discovering ideas that that meaningfully advance technology on the on, you know, the scale of like months and years, like, I think these questions like, stop being so different, right?

Like, it is one thing to solve like a very well posed constraint problem on the scale of an hour, right?

And there's like kind of a finite amount of ideas you need to look through.

And that might feel extremely different from solving something very open ended.

But you know, even if you want to solve like a very well defined problem that is on much longer scale, right, you like, you know, prove this millennium price problem, well, that suddenly requires you to think about, okay, like, what are the fields of mathematics or other science that might possibly be relevant, you know, other inspiration from physics that I must take, like, what is kind of the entire program that I want to develop around this?

And now these become very open ended questions.

And it's actually hard to, you know, for our own research, right, like, if all we cared about is, you know, reduce the modeling clause on a given data set, right, like measuring the progress on that, like, are we kind of actually asking the right questions in research, like, like actually becomes like a fairly open ended affair.

Yeah, and I think it also makes sense to think about what the limits of, you know, open ended means.

I think a while back, Sam tweeted about some of the improvements that we were making in having our models right more creatively.

And you know, we do consider the extremes here as well.

Right, right.

Let's talk about RL.

Because it seems like since 01 came out, RL has been the gift that keeps giving.

You know, every couple months, opening up a set of release, and everyone goes, Oh, that's great.

But this RL thing is going to plateau.

We're going to saturate the evals, the models won't generalize or there's going to be mode collapse because of too much synthetic data for whatever number.

Everybody's got a laundry list of reasons to believe that the gains and performance from RL are going to tap out.

And somehow they just don't you guys just keep coming out and putting out continuous improvements.

Why is RL working so well?

And what if anything has surprised you about how well it works?

RL is a very versatile method, right?

And there are a lot of ideas you can explore once you have an RL system working.

A long time at OpenAI we started from this before language models, right?

Like we were thinking about like, okay, like RL is this extremely powerful thing, of course, like on top of deep learning, which is this like incredible general learning method.

But the thing that we struggled with for a very long time is like, what is the environment?

Like, how do we actually anchor these models to the real world?

Or like, should we, you know, simulate, you know, some island where they all learn to collaborate and compete?

And then, you know, of course, came the language modeling breakthrough, right?

And we saw that, oh, yeah, if we scale deep learning on modeling natural language, we can create models with this like incredibly new understanding of human language.

And so since then, we've been, you know, seeking how to combine these paradigms and how to get RL to work on natural language.

And once you do, right, like, then you kind of have the, well, you have the ability to actually like execute on these different ideas and objectives in this like extremely robust, rich environment given by pre-training.

And so, yes, I think it's been perhaps the most exciting period in our research over the last few years where we've really like found so many new directions and promising ideas that all seem to be working out and we're trying to understand how to compare.

One of the hardest things about RL for folks who are not practitioners of RL is the idea of crafting the right reward model.

And so, especially if you're a business or an enterprise who wants to harness all this amazing progress you guys are putting out, but doesn't even know where to start.

What do the next few years look like for a company like that?

What is the right mindset for somebody who's trying to make sense of RL to craft the right reward model?

Is there anything you've learned about the best practices or an approach of thinking, of using this latest sort of family of reasoning techniques?

What is the right way I should think about even approaching reward modeling as a biologist or a physicist?

I expect this will evolve quite rapidly.

I expect it will become simpler, right?

Like, I think, you know, maybe like two years ago we would have been talking about like, what is the right way to craft my fine tuning data set?

And I don't think we are like at the end of that evolution yet.

And I think we will be inching towards more and more human-like learning, which, you know, RL is still not quite.

So I think maybe the most important part of the mindset is to like not assume that like, what is now will be forever.

So I want to bring the conversation back to coding.

We would be remiss not to say congrats on GPT-5 Codex, which just dropped today.

Can you guys say a little bit more about what's different about it?

How it's trained differently?

Maybe why you're excited about it?

Yeah, so I think one of the big focuses of the Codex team is to just take the raw intelligence that we have from our reasoning models and make it very useful for real world coding.

So a lot of the work they've done is kind of consistent with this.

They are working on kind of having the model be able to handle more difficult environments.

We know that real world coding is very messy.

So they're trying to handle all the intricacies there.

There's a lot of coding that has to do with style with just like kind of softer things like how proactive the model is, how lazy it is.

And just being able to define in some sense like a spec for how a coding model should behave.

They do a lot of very strong work there.

And as you see, they're also working on a lot better presets.

Coders, they have some kind of notion of this is how long I'm waiting.

I'm willing to wait for a particular solution.

I think we've done a lot of work to dial in on for easy problems, being a lot lower latency for harder problems.

Actually, the right thing is to be even higher latency, getting the really best solution.

And just being able to find that preset is very important.

What's the sweet spot for if you were to say like easier problems versus harder?

What we found is the previous generation of the Codex models, they were spending too little time solving the hardest problems and too much time solving the easy problems.

And I think that is actually just probably out of the box what you might get out of O3.

Maybe just on the topic of coding, since you guys are both competitive coders in prior lives.

I know you've been at opening eye for almost a decade now, but I was struck by the story of Lee Sedol, the Go player who famously quit Go after he lost to AlphaGo multiple times.

And I think in a recent interview, you guys were both saying that now the coding models are better than your capabilities.

And that gets you excited.

But say more about that.

And how much would you say you code now?

Well, if you're hands on keyboard, you can talk about opening eye generally, but how much code is written by AI now?

In terms of cutting models being better, I think it is extremely exciting to see this progress.

I think the programming competitions have a nice kind of encapsulated test of ability to come up with some new ideas in this boxed environment and timeframe.

I do think if you look at things like, well, I guess the IMO problem six, or maybe some very hardest programming competitions problems, like I think there's still a little bit of headway to go for the models, but I wouldn't expect that to last very long.

I do go a little bit.

Historically, I've been like being humble.

Historically, I've actually been extremely reluctant to use any sort of tools.

I just used them pretty much.

Eventually, I think, especially with this latest coding tools, like GPT-5, I really kind of felt like, okay, this is no longer the way.

You can do a 30-file refactor pretty much perfectly in 15 minutes.

You kind of have to use it.

So I've been kind of learning this new way of coding, which definitely feels a little bit different.

I think it is a little bit of an uncanny valley still right now, where you kind of have to use it, because it is just accelerating so many things, but it's still a little bit not quite as good as a co-worker.

So I think our priority is getting out of that uncanny valley.

But yeah, it's definitely an interesting time.

Definitely.

To kind of speak to the resettle moment, I think AlphaGo for both of us was a very formative milestone in AI development.

And at least for me, it was the reason I started working on this in the first place.

And maybe partly because of our backgrounds in competitive programming, I had this affinity to building these models, which could do very, very well in these forms of contests.

And going from solving eighth grade math problems to a year later, hitting our level of performance in these coding contests, it's crazy to see that progression.

And you kind of imagine or like to think that you feel a set of the feelings that at least it all felt too.

It's like, wow, this is really crazy.

And what are the possibilities?

And this is something that I took decades to do and took a lot of hard work to get to the forefront of.

So you really do feel an implication of that is these models, what can't they do?

And I do feel like already it's kind of transformed the default for coding.

This past weekend, I was talking to some high schoolers and they're saying, oh, actually the default way to code is vibe coding.

I think they would consider it, oh, it's like maybe sometimes for completeness, you would go and actually do all the mechanics of coding it from scratch yourself.

But that's just a strange concept to them.

Why would you do that?

You just vibe code by default.

It's like the default.

Yeah.

And so I do think the future hopefully will be vibe researching.

I have a question about that, which is what makes a great researcher?

When you say vibe researching, a big part of vibe coding is just having good taste in wanting to build something useful and interesting for the world.

And I think what's so awesome about tools like Codex is if you've got a good intuition for what people want, it helps you articulate that and then basically actualize a prototype very fast.

With research, what's the analog?

What makes a great researcher?

Persistence is a very key trait.

I think what is different about research when you're actually trying to-- I think the special thing about research is you're trying to create something or learn something that is just not known.

It's not known to work.

You don't know whether it will work.

And so always trying something that will most likely fail.

And I think getting to a place where you are in the minds of being ready to fail and being ready to learn from these failures.

And of course, with that comes creating kind of clear hypothesis and being extremely honest with yourself about how you're doing on them.

I think a trap many people fall into is going out of the way to prove that it works, which is quite different from believing in your idea.

And significant is extremely important.

And you want to persist that.

But you have to be honest with yourself about when it's working and when it's not so that you can learn and adjust.

Yeah, I think there are just very few shortcuts for experience.

I think through experience, you kind of learn what's the right horizon to be thinking of a problem.

You can't pick something that's too hard or it's not satisfying to do something that's too easy.

And I think a lot of researchers managing your own emotions over a long period of time too.

There's just going to be a lot of things you try and they're not going to work.

And sometimes you know when to persevere through that or sometimes when to kind of switch to a different problem.

And I think interestingness is something you try to fit through reading good papers, talking to your colleagues and you kind of maybe distill their experience into your own process.

When I was in grad school, there's a big part.

I'm a failed machine learning researcher.

I was in grad school for bioinformatics.

But a big part of my research advisor's thrust was about picking the right problems to work on such that you could then sustain and persist through the hard times.

And you said something interesting, which was there's a difference between having conviction in an idea and then being maximally truth seeking, but when it's not working and both those things might sometimes intention because you kind of go native on a topic or a problem sometimes that you have deep conviction in.

Have you found is there any sort of heuristics you found are useful at the taste step, at the problem picking step that helped you arrive at the right set of problems where that conviction and truth seeking is not as much in zero-sum tension as other kinds of problems?

Yeah, to be clear, I don't think conviction and truth seeking are really in a zero-sum tension.

I think you can be convinced or you can have a lot of belief in idea and you can be very persistent in it while it's not working.

I think it's just important that you're kind of honest with yourself, like how much progress you're making, and you're in the mindset where you're able to learn from the failures along the way.

I think it's important to look for problems that you really care about and you really believe are important.

I think one thing I've observed in many researchers that inspired me has been really going after the hard problems, like looking at the questions that are kind of like wildly known, but not really considered tractable and just asking why are they not tractable or what about this approach?

Why does this approach fail?

You're always thinking about what is really the barrier for the next step.

If you're going after problems that you really truly believe are important, then that makes it so much easier to find the motivation to persist with them over years.

And in the development of doing the training phase of GPD 5, for example, at any moments where there was a hard problem, there were initial attempts that were being made to crack that problem weren't working, and yet you found somebody persisted through that.

And what was it about any of those stories that comes to mind that worked well, that you wish other people and other researchers did more of?

I think on the path there, right, like along the sequence of models, like above the pre-trained models and the research models, I think one very common theme is bugs.

And both like, just like, yeah, silly bugs in software that can kind of stay in your software for like months and kind of invalidate all your experiments a little bit in a way that you don't know.

And identifying them can be a very meaningful breakthrough for your research program.

But also kind of bugs in the sense of like, well, you have a particular way of thinking about something and that way is a little bit skewed, which causes you to make the wrong assumptions and identifying those wrong assumptions, thinking, rethinking frames from scratch.

I think, you know, both for getting the first reasoning models working or getting the, you know, larger pre-trained models working, I think we've had like multiple issues like that we've had to work through.

As leaders of the research org, how do you think about what it takes to keep the best talent on your team and on the flip side, creating a very resilient org that doesn't crumble if a key person leaves?

The biggest, I think, things that OpenAI has going for it in terms of keeping the best people motivated and excited is that we are in the business of doing fundamental research, right?

We aren't the type of company that looks around and says, oh, what model did, for example, company X build or what model did company Y build?

You know, we have a fairly clear and crisp definition of what it is we're out to build.

We like innovating at the frontier.

We really don't like copying.

And I think people are inspired by that mission, right?

You are really in the business of discovering new things about the deep learning stack.

And I think we're kind of building something very exciting together.

I think beyond that, a lot of it's creating very good culture.

So we want a good pipeline for training up people to become very good researchers.

We, I think historically have hired the best talent and the most innovative talent.

So I just think we have a very deep bench as well.

And yeah, I think most of our leaders are very inspired by the mission.

And that's what's kept all of them there.

Like when I look at my direct reports, they haven't been affected by the talent wars.

I was chatting with a researcher recently and he was talking about wanting to find the cave dwellers.

And these are often the people who are not posting on social media about their work.

For whatever reason, they may not even be publishing.

They're sort of in the background doing the work.

I don't know if you would agree with this concept, but how do you guys hire for researchers?

And are there any non-obvious ways that you look for talent or attributes that you look for that are non-obvious?

So I think one thing that we look for is having solved hard problems in any field.

A lot of our most successful researchers have started their journey with deep learning at OpenAI and have worked in other fields like physics or computer science, theoretical computer science or finance.

In the past, strong technical fundamentals coupled with the ability, the intent to work on very ambitious problems and actually stick with them.

We don't purely look for who did the most visible work or is the most visible on social media.

As you were talking, I was thinking back to when I was a founder and I was running my own company, we would recruit for great talent engineers.

Many of the attributes you described were ones that were on my mind then.

Elon recently tweeted that he thinks this whole researcher versus engineer distinction is silly.

Is that just a semantic?

Is it just being semantically nitpicky or do you think these two things are more similar than they actually look?

Researchers don't just fit one shape.

We have certain researchers who are very productive at OpenAI who are just so good at idea generation and they don't necessarily need to show great impact through implementing all of their ideas.

I think there's so much alpha they generate in just coming up with, "Let's try this or let's try this," or maybe we're thinking about that.

There's other researchers who are just very, very efficient at taking one idea, rigorously exploring the space of experiments around that idea.

I think researchers come in very different forms.

I think maybe that first type wouldn't necessarily map into the same bucket as a great engineer.

We do try to have a fairly diverse set of research tastes and styles.

Say a little bit about what it takes to create a frontier winning culture that can attract all kinds of shapes and of researchers and then actually grow them, thrive them, make them win together at scale.

What do you think are the most critical ingredients of a winning culture?

I think actually the most important thing is just to make sure you protect fundamental research.

I think you can get into this world with so many different companies these days where you're just thinking about, "Oh, how do I compete on a chat product or some other kind of product surface?"

You need to make sure that you leave space and recognize the research for what it is and also give them the space to do that.

You can't have them being pulled in all of these different product directions.

I think that's one thing that we pay attention to within our culture.

Especially now that there's so much spotlight on open AI, so much spotlight on AI in general and the competition between different labs.

It would be easy to fall into a mindset of, "Oh, we're racing to beat this latest release or something."

There's definitely areas that people start looking over their shoulder and start thinking about, "Oh, what are these other things?"

I see it as a large part of our job to make sure that people have this comfort and space to think about what are things actually going to look like in a year or two.

What are the actually big research questions that we want to answer?

How do we actually get to models that vastly outperform what we see currently rather than just iteratively improving in the current paradigm?

Just to pull on that thread more on protecting fundamental research.

You guys are obviously one of the best research organizations in the world, but you're also one of the best product companies in the world.

How do you balance?

Especially with you've brought on some of the best product execs in the world as well, how do you balance that focus between the two and while protecting fundamental research also continue to move forward the great products that you have out?

I think it's about delineating a set of researchers who really do care about product and who really want to be accountable to the success of the product.

They should, of course, very closely coordinate with the research work at large.

But I think just kind of people understanding their mandates and what they are rewarded for.

That's a very important thing.

One thing that I think is also helpful is that our product team and broader company leadership is bought into this vision where we are going with research.

Nobody is assuming that the product we have now is the product we'll have forever and we'll just kind of wait for new versions from research.

We are able to think jointly about what the future looks like.

One of the things that you guys have done is let such a diversity of different ideas and bets flourish inside of OpenAI that you then have to figure out some way as research leaders to make it all make coherent sense as one part of a roadmap.

People over here are investigating the future of diffusion models and visual media.

Over here you've got folks investigating the future of reasoning when it comes to code.

How do you paint a coherent picture of all that?

How does that all come together when there might be at least naively some tension between giving researchers the independence to go to fundamental research and then somehow making that all fit into one current research program?

Our set of goal for our research program has been getting to an automated researcher for a couple years now.

We've been building Mossad projects with this goal in mind.

This still leaves a lot of room for bottom up idea generation for fundamental research on various domains but we are always thinking about how do these ideas come together eventually.

We believe for example that reasoning models go much further and we have a lot of explorations on things that are not directly reasoning models but we are thinking a lot about how they eventually combine and what will this innovation look like once you have something that is out there and thinking for moms about a very hard problem.

I think this clarity of our long-term objectives is important but it doesn't mean that we are prescriptive about like oh here are all the little pieces.

We definitely view this as a question of exploration and learning about these technologies.

I think you want to be opinionated and prescriptive about their very course level but a lot of ideas can bubble up at a finer level.

Has there been any moments where those things have been intentional recently?

One provocative example could be recently this new image model came out which was Nanobanana from Google.

It's extraordinary value shown that lots of everyday people can unlock a lot of creativity when these models are good at understanding editing prompts.

I could see how that would create some tension for a research program that may not be prioritizing that as directly.

If somebody talented on your team came and said guys this thing is so clearly valuable in the world out there we should be spending more effort and more energy on this.

How do you reason about that question?

I think there's definitely a question that we've been thinking about for quite a while at OpenAI.

If you look at GPT-3, once we saw like oh this is where language models are going.

We definitely had a lot of discussions about clearly there are going to be so many magical things you can do with AI.

You will be able to go to this extremely smart models that are out there pushing different tiers of science but you will also have this incredible media generation and this incredibly transformative entertainment applications.

How do we prioritize among all these directions has definitely been something we've been thinking about for quite a while.

Yeah, absolutely.

The real answer is we don't discourage someone from being really excited by that.

If we're consistent in the prioritization and our product strategy then it just won't naturally fall in.

For us, we do encourage a lot of people to be excited about building this, building kind of like agentic products, whatever kind of products that they're excited by.

But I think it's important for us to also have a separate group of people who protect that.

Their goal is to create the algorithmic advances.

How does that translate, just to build on Anja's question, into a concrete framework around resourcing?

Do you think about x percent of compute resources will go to longer term, very important but maybe a bit more pie in the sky exploration versus there's also obviously current product inference but this thing in the middle where it's achievable in the short to medium term.

Yeah, so I think that's a big part of both of our jobs.

Just this portfolio management question of how much compute do you give to which project?

I think historically we've put a little bit more on just the core algorithmic advances versus the product research but it's something that you have to feel out over time.

It's dynamic.

I think month to month there could be different needs and so I think it's important to stay fairly flexible on that.

And if you had 10 percent more resources would you put it toward compute or is it data curation, people, where would you stick that from like a marginal?

Good question.

Honestly, yeah, I think compute today.

I mean honestly, I do think kind of to your question of prioritization, right?

It's like in a vacuum any of these things you would love to like go and excel and win at.

I think the danger is you end up like second place at everything and not like clearly leading at anything.

So I think prioritization is important, right?

And you need to make sure there's some things you're clear eyed on this is the thing that we need to win.

Yeah.

I think it makes sense to talk about it for just a little bit more which is compute sets so much of computers destiny in a way.

Right at a research organization like OpenAI.

And so a couple of years ago I think it became very fashionable to say, "Oh, okay, we're not going to be compute constrained anytime soon because there's a bunch of CMs that people are discovering and we're going to get more efficient and all the algorithms are going to get better."

And then eventually like really was just being a data constrained regime.

And it seems like a couple of years have come and gone and we're still like this is sort of a compute constrained environment.

Does that change anytime soon, you think?

I mean, I think like we've seen for long enough like how much we can do with compute.

Yeah, I haven't really bought that much into the like will be data constrained claim and yeah, I don't expect that to change.

Yeah, anyone who says that should just step into my job for a week.

There's no one who's like, "Oh, I have all the compute that I need."

Historically, the job of advancing fundamental research has historically been largely amended that universities have had.

Partly for the compute reasons you just described, that hasn't been the case for Frontier AI.

You guys have done such an incredible job kind of channeling the arc of Frontier AI progress to help the sciences out.

And I'm wondering when those worlds collide, the fundamental world of university research today and the world of Frontier AI, what comes out?

So I guess I personally started as a resident at OpenAI and it's a program that we had for people in different fields to come in, learn quickly about AI and become productive as a researcher.

And I think there is a lot of really powerful elements in that program.

And the idea is just like, could we accelerate something that looks like a PhD in as little time as possible?

And I think a lot of that just looks like implementing a lot of very core results.

And through doing that, you're going to make mistakes.

You're going to be like, "Oh, wow, build intuition for if I set this wrong, that's going to blow up my network in this way."

And so you just need a lot of that hands-on experience.

I think over time, there have been curriculums developed at all of these large labs in optimization and architecture and RL.

And yeah, probably no better way than to just kind of try to implement a lot of those things and read about them and think critically about them.

Yeah, I think maybe one other nice thing that you get to experience at academia is persistence.

Of like, "Oh, you have a few years and you're kind of trying to solve a problem and it's a hard problem and you've never dealt with such a hard problem before."

And yeah, I do feel like this is a thing that's like, well, currently the pace of progress is very fast.

Maybe also the idea tends to work out a little bit more often than they did in the past because deep learning just wants to learn.

And getting your hands on a more challenging problem for a little bit, maybe being part of a team, attacking an ambitious challenge and getting that feeling of what it feels like to be stuck and what it feels like to finally be making progress, I think is also something that's very useful to learn.

How does external perception, reception of a particular product launch impact how you prioritize something?

Is it to the extent where perception and usage in the case where they're married, obviously there's probably a clear directive there, but in a case where maybe they're divorced a bit, does that impact how you think about roadmap or where you emphasize resources?

So we generally have some pretty strong convictions about the future and so we don't tie them that closely to the short-term reception of our products.

Of course, we learn based on what is going on.

We read other papers and we look at what other labs are working on.

But generally, we act from a place of fairly strong belief in what we're building.

And so, of course, that is for our long-term research program, of course, when it comes to product.

I think that the cycle of iteration is much, much faster.

With every launch, we are trying to aim it to be something that's wildly successful on the product side.

And I think from a fundamental research perspective, we're trying to create models with all of the core capabilities needed to build a very rich set of experiences and products.

And there are going to be people who have some vision of one particular thing they could build and we'll launch it and everything we launch, we really hope it goes wildly successful.

And we get that feedback and if it's not, we'll shape our product strategy a little bit.

But yeah, we are definitely also in the business of launching very useful, wildly successful products.

It feels like because of the completely unbridled pace of progress that we've just spent a lot of time talking about, a lot is going to change over the next few years.

It's really hard to predict.

I imagine 10 years out, let alone 10 months out.

And so my question, I guess, is through all that change that the frontier of AI is going to bring, what are some priors that you actually think should stay constant?

Is there anything?

Well, one clearly is that we don't have enough compute.

Is there anything else that you think doesn't change that you think would be strong, reasonably held priors as constants?

I think more broadly than compute, there's physical constraints of well, energy, but also like, at some point, not too far, like robotics will become a major focus.

And so I think thinking about the physical constraints is going to remain important.

But yeah, I do think on the intelligence front, I would not make too many assumptions.

Very few startups can get to the scale that you have both from an employee perspective, but also revenue count and maintain that breakneck speed that you probably had, I mean, seven, eight years ago when you both joined.

What's the secret sauce to doing that?

And how do you continue to maintain this pressure almost to ship as quickly as possible, even though you're kind of on top now?

I think one of the clearest markers that we have really good research culture, at least in my mind is, I've worked at different companies before and there is a real thing, which is a learning plateau, right?

You go to a company, you learn a lot for the first one or two years, and then you just find kind of like, I know how to be fairly efficient in this framework and my learning kind of stops.

And I've really never felt that at OpenAI, just like that experience you describe all these really cool results bubbling up.

You're just learning so much over a week, and it is a full-time job to kind of stay on top of all of it.

And that's just been very fulfilling.

So yeah, no, I think that's a very accurate description.

We just want to generate a lot of really high quality research and it's almost a good thing.

Like if you're generating enough that you're barely able to keep on top of it.

Yeah, exactly.

I think that's the develop of technology, I think is a driving force here where maybe we would kind of become comfortable after like a few years working in a given paradigm, but we are always on the cusp of that new thing and trying to reconfigure our thinking around the kind of new constraints and new possibilities that we're going to be faced with.

And so I think that kind of creates this feeling of constant change and the mindset of like always kind of learning the new thing.

Well, you know, one thing that came up in our research about things at OpenAI that have not changed through a lot of the change is the trust that the two of you guys have in each other.

Because I think there was an article or profile of you guys recently in the MIT Tech Review, and that was also one of the highlight themes that your chemistry, your trust with each other, your approach, something a lot of people at OpenAI have come to treat as a constant.

So what's the backstory?

How did you guys build trust there?

How did that happen?

It's like asking, have you ever seen that when Harry met Sally?

I feel like you're on the couch and now you gotta...

What's your mid-cute?

Yeah, exactly.

Well, I do think, you know, we started working together a little bit more closely when we kind of had the first seeds of working on reasoning.

I think, you know, we...

At the time, you know, that wasn't a very popular research direction to work on.

And I think both of us kind of saw glimmers of hope there.

And, you know, we were kind of pushing in this direction, kind of figuring out how to make our work.

And yeah, I think over time, kind of growing a very small effort into increasing larger effort.

And I think that's kind of where I...

Yeah, really got to kind of work with Jacob in depth.

I think he's just really a phenomenal researcher.

I think, you know, any of these Frank lists, like he should be number one.

Just his ability to take any very difficult technical challenge and almost like personally just kind of think about it for two weeks and just crush it.

It's incredible that he has kind of the wide range that he does in terms of understanding, as well as that kind of depth that you can go and just personally solve a lot of these like more challenges.

Now you get to say some nice stuff about him.

I have to say anything nice about him.

Thanks, Mark.

Yeah, yeah, I think the first like big thing that we did together was like we started seeing like, okay, like we think this algorithm is going to work.

And so, you know, I was thinking like, okay, like how do we, you know, direct people at this?

And we're talking with Mark like, oh, we should establish a team that's actually going to make this work.

And then, you know, Mark and Mark went and actually did this, right?

Like actually kind of like got a group of like people working on very different things, like got them all together and created a team with like incredible chemistry out of like this whole disparate group.

And that was like such an impressive thing to me.

And yeah, I'm really grateful and inspired to like kind of get to, you know, work with Mark and kind of experience that.

Yeah, I think this incredible capacity to both, you know, understand and get engaged and, and, and, you know, think about the technical matter of the research itself.

But then coupled with this like great ability to lead and inspire teams and create an organizational structure that, you know, in this whole kind of mess of chaotic directions actually like is coherent and, and, and, and able to gel together.

Yeah, very, very inspiring.

Well, on that note, some of the greatest discoveries in science, especially in physics have often come from a pair of collaborators often across universities, across fields.

And it seems like you guys have, have now added to that tradition.

And so we're just super grateful that you guys made the time to chat today.

Thanks for coming by.

Thanks for being with us.

Thanks for listening to this episode of the A16Z podcast.

If you liked this episode, be sure to like, comment, subscribe, leave us a rating or review and share it with your friends and family.

For more episodes, go to YouTube, Apple Podcasts and Spotify.

Follow us on X and A16Z and subscribe to our sub stack at a16z.substack.com.

Thanks again for listening and I'll see you in the next episode.

As a reminder, the content here is for informational purposes only.

Should not be taken as legal business tax or investment advice or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund.

Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.

For more details, including a link to our investments, please see a16z.com/disclosures.

[BLANK_AUDIO]