AI + a16z · 2025-08-29

How OpenAI Built Codex: Cloud Agents, Safety Trade-offs, and the Future of Coding

Hosts: Anjney Midha

Guests: Alexander Embiricos

Codexcoding agentsAI agentsreasoning modelsprompt injectionAI safetyagent securitydeveloper toolssoftware engineeringCS educationenterprise AIair-gapped deploymentAI startup strategy

Why it matters

OpenAI's Codex is a cloud coding agent shipping sandboxed PRs for safety and parallelism

Key claims

  • Current Codex is a deliberate re-use of the 2021 Codex brand (originally the model behind GitHub Copilot); OpenAI shipped it as a cloud agent working on its own computer, not an IDE extension
  • Launch stats: ~400k PRs opened and ~350k merged in 34 days, with an ~80% merge rate — Embiricos attributes the rate to the form factor (sandboxed work before PR exposure) and cautions the metric isn't apples-to-apples with draft-PR agents
  • Safety rationale for the late-PR model: prevents agents with network access from being prompt-injected mid-task; Embiricos walks through a realistic exfiltration scenario via a malicious 'customer feedback' prompt
  • Post-launch surprise: external users wanted multi-turn babysitting far more than OpenAI expected, exposing a deterministic bug in turn-3+ context persistence because no internal user had gotten that deep

Episode summary

Summary

Alexander Embiricos, who leads product for OpenAI's Codex agent, traces the origin of the current Codex release — a deliberate reboot of the 2021 Codex name (originally a code-completion model powering GitHub Copilot) — to an internal prototype where a reasoning model connected to a terminal began 'sight-reading' and editing React. The product decision to ship a cloud-based agent working on its own computer (rather than an IDE-resident assistant) was driven by parallelization, security containment, and a view that 'you'll hire them, tell them what their job is, give them compute, and let them go.'

The launch stats are striking: roughly 400k PRs opened and ~350k merged in the first 34 days, with an ~80% merge rate that Embiricos attributes to the form factor — Codex does extensive work in its sandboxed environment before showing the diff and offering to open a PR. He concedes this is partly a safety artifact (late-stage PRs avoid exposing partially-written code with network access to prompt-injection attacks) and pushes back on comparisons with draft-PR-style agents. The biggest post-launch surprise was that users wanted multi-turn 'babysitting' workflows — not the 'slot machine / re-prompt' pattern OpenAI used internally — and a deterministic bug in turn-3+ context persistence made it to production because no internal user got that far.

On the roadmap, Embiricos flags container startup speed, better multi-turn support, and 'best of N' exploration (recently shipped) as near-term priorities, with a longer-term vision of an 'interactive agent' that lives across IDE, terminal, issue tracker, and Slack — contextually proactive rather than spammy. He distinguishes between fully-hosted Codex (primary path to AGI / distributing benefits) and CLI / air-gapped modes needed for critical industries and governments, where legacy Fortran/Cobol modernization and drone-warfare-driven upgrades are creating new demand. For founders, he argues the defensible layer is customer-specific tooling, environment setup, and task design — not the model itself. For CS students, he urges adopting AI tools aggressively and notes that when hiring new grads, OpenAI now weights 'what have you built, and can I click to it?' far above grades.

  • Current Codex is a deliberate re-use of the 2021 Codex brand (originally the model behind GitHub Copilot); OpenAI shipped it as a cloud agent working on its own computer, not an IDE extension
  • Launch stats: ~400k PRs opened and ~350k merged in 34 days, with an ~80% merge rate — Embiricos attributes the rate to the form factor (sandboxed work before PR exposure) and cautions the metric isn't apples-to-apples with draft-PR agents
  • Safety rationale for the late-PR model: prevents agents with network access from being prompt-injected mid-task; Embiricos walks through a realistic exfiltration scenario via a malicious 'customer feedback' prompt
  • Post-launch surprise: external users wanted multi-turn babysitting far more than OpenAI expected, exposing a deterministic bug in turn-3+ context persistence because no internal user had gotten that deep
  • Most-used Codex task is 'building new features,' not debugging — and Embiricos says the speed collapse to first prototype 'has turned every day into a hackathon'
  • Near-term product priorities: faster environment setup, multi-turn reliability, more 'best of N' exploration; long-term vision is a contextually-proactive 'interactive agent' across IDE, terminal, Slack, and issue trackers
  • Strategic split: fully-hosted Codex is OpenAI's primary path, with Codex CLI as an evolution toward air-gapped / enterprise / critical-industry use cases — no claims of which specific industries are out of bounds
  • Embiricos' advice to CS students: still study CS, but adopt AI tools aggressively; on hiring new grads, 'what have you built, and can I click to it?' now outweighs grades

Source material

Transcript

What happens when AI stops helping you auto-complete code and starts acting like a real teammate?

Today, on AI + a16z, we're exploring Codex, OpenAI's coding agent.

Anjney Midha is joined in studio by Alexander Embiricos, who leads product for Codex at OpenAI.

They discuss the origin story, why reasoning models plus tools unlock agents, how developers are actually using Codex in the wild, and what all this means for the future of software engineering, from debugging and prototyping, to how CS students should think about their careers.

Let's get into it.

Alex on X, OpenAI.

Hey, Alex.

Hey, how's it going?

Good, thanks for coming.

Yeah, good to see you again.

You are one of the folks working on product for Codex, which is probably one of the most exciting launches to come out of the OpenAI team, for me at least in a while.

So, for a lot of people, though, it was confusing.

For sure.

Because it was the fifth Codex release from OpenAI.

Yeah.

But of course, it's completely new and different from the previous Codexes.

So, let's just start with the origin story.

What is the backstory on how the current version of Codex came to be?

Yeah, and man, our naming is so fun at OpenAI.

I'm excited for the naming to make more sense over time with Codex, as we bring this all together.

But yeah, let's go back, way back to the beginning.

The first Codex product was actually released, I think it was in 2021.

I might get the year wrong.

But actually, it was a code completion model that powered GitHub Copilot.

And so, recently, we were basically talking about a whole bunch of coding stuff we want to do, like models, but models in product.

We were thinking about what to call it, and we just felt like the Codex name was really cool.

And so, we wanted to go back to it.

How did this Codex product come about?

Basically, we've been thinking a lot about agents, as everyone has.

And before that, we've been thinking about reasoning models.

And basically, in our minds, one way you could think about an agent is you take a reasoning model, and then you give that reasoning model access to the tools that some agent would want to use or some human in given function would want to use, and an environment that tool works with takes side effects.

And then from there, you come up with what kind of tasks would this person do.

So, basically, you have this model, you give it tools, and then you make sure that the model is really good at doing the specific tasks that some function would do.

And the task bit is actually super important because if you think of there's a difference between writing and journalism, similarly, there's a difference between coding and software engineering.

So, we've been doing a lot of this tinkering with reasoning models internally, getting them to write code.

And so, the first tool we'd given them was terminals.

And we've been poking at this for a while and just started, it was actually one of the first real fieldy AGI moments for me, was when someone showed me a website editing itself by being prompted to itself.

Because we had this reasoning model very hackily connected to a terminal, and then it was editing this terminal.

It was just editing the DOM, basically, directly as a CLI.

Yeah, exactly.

Okay.

Well, and that wasn't the DOM directly, it was React, but whatever.

And it was super cool.

Was it parsing the visual?

Did you give it access to a browser?

No, it was like, I like to use this term like sight reading, it was just like sight reading the code.

So, it wasn't like taking screenshots of itself or any of this stuff that now people are building.

Okay, got it.

It was just like editing React.

And so, we had this prototype like a while ago, and just people internally really loved it.

So, we're starting to write more and more code.

And then we were starting to think about, okay, well, what is the right form factor for this thing?

When it's editing code, it's pretty great.

On my computer, it's pretty great.

But it's quite annoying to only have it able to work on one thing at a time.

Right.

It's also like a giant safety and security question if you just have this agent unleashed entirely on your computer.

And so, around this time, we started exploring a lot of different places to put this reasoning model that has access to a terminal.

And so, we had a prototype that ran in CI when your tests failed.

We had a prototype that through some crazy hack, automatically fixed your linear issues, but that was actually running in CI.

We had this prototype that was running on your computer.

And so, basically, the Codex product we launched was a distillation of that, where we thought, okay, well, what is the most powerful incarnation of this?

And we figured, you know, if you think about what an agentic teammate will be like in the future, you'll hire them, you'll tell them what their job is, give them some compute or a laptop, and give them some permissions, and then they'll go off and do work.

And so, we figured, okay, this is going to be kind of like a strange, unwieldy research preview, but let's put all our, or the vast majority of our effort, into this form factor of an agent working remotely, and kind of see what happens.

And so, that led to the Codex product that really is just like a cloud agent that can, you know, basically answer questions and write PRs out in the background.

And what was the reason that you guys picked, you know, it's pretty opinionated in the entry point to the task, which is that you have to start by first getting your entire environment set up, and then it interacts with a repo through a merged PR.

Yeah.

Right.

And we were chatting about this briefly, but somebody published a Dash 4 maybe a week ago, you know, kind of tracking PR merge success rates on GitHub across different autonomous agents.

And Codex is like clearly the gold standard at like this 80 plus percent rate.

Why is that, why did you guys decide to have the place where the PR starts, the after a bunch of sort of in private working through the code?

Totally.

And if you could just start a draft PR, have other people work on it together with you much earlier in the process.

Yeah.

So like, I think we're talking, you know, you and I were talking about this like chart that someone posted on Hacker News and like went viral.

It was basically showing like the number of open PRs, merged PRs from different coding agents, as you might track from like GitHub labels.

And Codex, actually I checked this morning because I figured we might talk about it.

And like Codex has opened like 400k PRs.

Since launch.

In like 34 days.

Yeah.

And how many days have been?

Yeah, probably.

Yeah.

And it's merged like 350 something kPRs or 350 kPRs have been merged, which is really cool.

And also very cool, but misleading, I'll say.

But very cool is that the merge rate for Codex PRs is like 80 something percent.

Right.

So like if, you know, assuming a PR is open with a Codex label, like if you're looking at GitHub open source repos later, is it merged in and it's like way higher than other agents, which are like 20 or 30 percent.

Right.

So yeah, just to talk about this, this chart is really a reflection of the form factor.

So I will say it makes us look really good.

Like it makes us look like the order of magnitude, like winner.

And we are of like a specific kind of agent, which is this like cloud agent.

Right.

That's working on its own computer.

Right.

Independently from you and therefore can do many tasks in parallel and so forth.

So like we believe that's where the future is going.

I'm sure we'll talk about that.

And it looks like, you know, right now we're like absolutely winning there.

But, you know, just to mention, probably the most AI, the most used AI coding feature right now is just like auto complete.

Right.

And tap completion.

Right.

Obviously that's not getting like a label when someone merges a PR on it.

So I think it's worth mentioning.

Like there's a whole bunch of other great.

That's like essentially invisible work happening in an IDE.

Exactly.

That's just a different form factor.

Yes, that's a different thing.

Right.

So that's not included in that chart.

And then the other interesting thing, so you were mentioning at the merge rate, our merge rate is excellent.

Right.

And that's a reflection of the fact that Codex does a bunch of work in its environment and then it shows you its work and it says, do you want me to open a PR?

Basically.

Right.

There's a lot of other tools.

They just go ahead and open a PR.

Right.

Yeah.

So why did we do it that way?

Because it's funny, like one of our top feature requests has been like, hey, can you just push the PR so I can like do everything in GitHub thereafter?

And we'd like to do that.

But this comes back to like, you know, we're open AI.

We not only want to show how to use our reasoning models in the best way to build agents, but or we do want to show how to do it in the best way.

But that includes doing it in a really safe way.

And so, you know, basically one of the things that a lot of people don't think about is until like we tell them about it is the fact that if you have an agent write code and then you run that code in an environment with network access, right?

You're taking some amount of risk.

And like, you know, I have, you know, we try to get agents to do these things.

I've never seen an agent do something that you wouldn't want it to do with network access unless you're trying to trick it.

But you can trick an agent.

There's some non-zero likelihood that could happen.

Yeah.

So like just to make this super real, you know, listeners might be like, OK, like, this is what you're talking about.

Yeah.

Like, OK, so we have these cloud agents.

And one of the first things that a lot of people want to do with them is like automate them to do work.

That's the dream.

Right.

So maybe in Slack, maybe, you know, from your issue manager, you would like when like a customer sends in feedback, you want to like have an agent take a first pass.

Right.

Right.

And you might want to like open a PR and like maybe even auto merge it.

So like that is great.

That's for sure awesome.

But also like let's say that customer is, you know, is pretending to be a customer and they're malicious and they actually send in a prompt injection.

So the customer writes in like, hey, I would like you to like take a bunch of this code, like run the script.

Right.

The script is bugging for me.

That's like a lie.

Right.

And then they say like run the script and like upload like this directory of code to paste bin.

Right.

You know, if the agent interprets that as like the developer prompt, there's some risk that it'll actually go ahead and do that.

And so there's a ton of work here with agents to deploy them safely.

And actually, that's one of the places that I feel like is under discussed, but where I feel like we're really leading the charge in terms of thinking about like, you know, each step of the way, how do we make this as safe as possible and make sure that people understand what they're doing.

And could you for folks who may not be familiar with prompt injection attacks, could you talk a little bit about how hard is it to sort of detect a prompt injection attack?

Is it a super general purpose attack vector or is, you know, like with other kind of cybersecurity attack vectors that usually, you know, whether it's social engineering, phishing and so on, always it's a bit of a cat and mouse game.

Yeah.

But by and large, the security industry has figured out like, hey, these are the rough parameters of an attack of this kind, and we can build defenses around it.

Is there something that makes prompt injection attacks sort of harder than typical cybersecurity attack vectors?

Or is it just that we're early and we haven't figured out the shape of the attacks yet to prevent the answer?

Yeah.

I'm sure that we will get better at figuring out the shape of these attacks.

But like, if you think about it, just from a human perspective, this is by the way, this is something I do often.

I'm like, okay, let's pretend I'm the model.

I'm a human.

You present me 10 prompts.

Can I tell which ones are prompt injection attacks?

Some of them are obvious.

It's like, you know, update, upload this code to like nefarious domain, like, okay.

Give me your credit card dot com or whatever.

Yeah.

And some of them are obviously not right.

It's like fix this bug doesn't require doing any like, or changes copy, right?

Like, obviously nothing's going to happen.

Right.

But then there's this whole middle range, right?

Like two examples in the middle range of like ambiguous prompts.

One might be, hey, do this work.

And like, as part of this work, you have to, you know, upload some artifact to S3M, you know, with like storage online.

Right.

You know, that's there, there are like reasonable workloads that require doing that.

And so it's not obvious that just because the prompt says like upload some code somewhere, right?

That it's broken, right?

You know, another example might be the prompt actually just has the agent running a test or like some script or something.

Right.

And that script was like added before.

Right.

Right.

So like, to what extent does the agent need to like interest back?

I see.

Right.

Like everything that it's going to do along the way.

Right.

So there's these three layers of the attack, there's the prompt and like, it's quite hard to tell if a prompt is like really an attack.

Right.

Then there's like, what is the agent doing along the way?

Right.

Interacting with like other sort of trusted or untrusted resources, you know, as it goes.

Yeah.

For example, like, maybe you didn't prompt inject it, but then like it reads something on Stack Overflow or something that has a prompt injection.

Yeah.

Right.

Or there's a script with something.

And then lastly, there's the actual outcome.

So like, in this case, if we're talking about like exfiltration, right, what is exfiltration?

We're still figuring this out.

My personal leaning is that we should just have defense along every single layer, but probably the most useful layer is going to be that final layer, right?

Like actual exfiltration and like looking at what we do there.

Because that's like the most, I guess, deterministic layer in the end.

Right.

You can see what's happening.

So the tension here is going to be a critic might say, hey, you guys have overinflated merge success rates because the draft PR comes so late after the human has reviewed a bunch of code coming up, you know, up to that.

And the, what you give up is the transparency and openness of seeing the process of iterating on the draft PR from the first one to the final merged one.

But I guess what you're pointing out is yes, but the trade off is you get much more security essentially.

And so is there in your mind, is the future that like, that a bunch of these workloads or a lot of the code that's written by AI agents will in over time, let's say, you know, you said this 350,000 or so now merged PRs in 35 days.

If we're rolling forward to the end of this year, do you think that rate of growth continues?

Does it plateau because more and more people actually move, want to move the draft PR process earlier in the merge flow?

Or do you actually think having used it now, having seen how customers have been using it for like the first 35 days, that roughly this is the shape of the workflow that people are going to want to just do merges right at the end after they've gone through all the security checks and so on internally?

Yeah, I mean, so first off, yeah, I think what I would say about the status, it's really cool, just not comparable to the other ones.

Right.

Right.

But you know, it's still a valid stat.

It's just a different phase of the pipeline.

But thinking about like, yeah, what is the shape of the journey?

Like, I think the shape of how people will merge code even with these cloud agents is going to completely change.

Okay.

So like, let's talk about where we're at right now.

Basically, we have, you could kind of think of it as like, there's a spectrum, maybe there's like three things, right?

There's like interactive coding, which is like, tab completion, like chat, that kind of stuff, you know, command K, a lot of that's being done in the IDE.

This is like CLI tools, where you can go back and forth with an agent.

So that's interactive coding.

It's awesome.

That's probably where like, most people are adopting AI right now.

And it's because like, if you think about it, like tab completion with an AI model is the same as tab completion before an AI model.

So you can get like fully brought along the journey.

I guess what I'm saying is it's not going away.

I don't think yeah, because I think even as the majority of code of like, say, code of the current level of abstraction, what I okay, let me unpack that a bit.

So if you think about it, we used to like, write punch cards, basically, or like punch cards, I guess.

And then we had like assembly, and then we had C, and now we have like Python, and like JavaScript and so forth, right?

So we just keep rising up the level of abstraction.

And one way of looking at what's happening now is that we're still we're just going to go up one more level.

So like my view is that we'll still have developers spending a bunch of time in the IDE, just like operating at higher levels of abstraction.

And so when a developer is like doing work, like writing whatever it is that they're writing, or communicating in whatever way, they'll still be like AI features just helping accelerate like every keystroke that developers doing those will still be awesome.

So that's interactive coding.

Then we have sort of agents, I guess, and then the fun part, maybe later naming TBD, maybe we'll have interactive agents.

So okay, that's about it.

So we'll get into that.

So like, not a fully baked idea.

But basically, then we can talk about agents, how will we work with agents, my view is that over over time, the majority of code written will be written by agents.

And actually, the majority of that code will not be manually prompted by a human like I'm automated pipeline.

Yeah, it kind of sucks to like, go and like write this prompt, and then like, wait 10 minutes.

And like during those 10 minutes, or if the same thing, push ups or whatever, yeah, like our average, you know, duration of a rollout, you know, is around like three minutes or a little under it for larger code bases, like ours, it's like longer, it's like maybe eight or something.

But it kind of sucks to have to like multitask across these things.

Right.

And the power users of codecs have like built this like amazing workflow that they use where they're like juggling tasks, we could talk about how people are using it.

But this isn't great, in my opinion, like what you really want when you hire someone, like a teammate is to kind of tell them what the job is, give them the credentials, all the tools and just have them like pick up work automatically, and kind of let you know when it's done.

So you're not feeling that latency on your own time.

Right.

So, you know, if we go to back to this original point of like, when will people merge PRs?

Like, I think what I would love for to see is like where agents are picking up work, and they're kind of like deciding whether or not it's worth pushing a PR maybe to trigger CI, but by the time you find out about they're like, hey, I did this thing, maybe I asked you for some input along the way.

CI checks are green, like should we merge it?

So we have to we have to build our way.

This is a classic green light.

And then over time, ideally, like most of the, you know, lower order bid tasks are just getting merged automatically.

And then when there's some like judgment call, they come to you the way kind of like a more junior engineer would come to you as an engine manager and say, it's looking good, but I want your here's some risk.

Are you comfortable with that risk?

And then you get the thumbs up, thumbs down?

Is that roughly where you think we're going?

Yeah, I think so.

Like, actually, like, you know, we've been talking basically about code gen this entire conversation so far.

And okay, so code gen is getting much easier.

Is code review getting much easier?

Because code review is still a key thing and like validation.

And I think right now we're in this like slightly awkward phase where we're entering an awkward phase where we have a lot of code gen.

And a lot of that code isn't is actually not going to be merged.

Right.

For the other tools, you see it in their PR verge rate.

For our tool, you would actually see it in the internal stat of like, right, what percentage of the time this is a PR created, right, from rollout.

And so there's like vastly more code to review and land.

And yeah, so it's awkward right now.

But this is something we're definitely thinking about.

And I'm like quite hopeful for the future.

And then I think we can make it even better for the humans involved.

Because like, no one likes reviewing code, right?

Yeah, so we should actually let's take a bit of a detour to talk about how it's been 35 days.

What are people doing with it?

What have you observed as like usage patterns now that it's out in the wild?

And what surprised you most?

And then I want to talk about now are the usage patterns more fun or not for people?

Because there was a moment, I think, in the first live stream you guys did around the product where one of your colleagues said, you know, my job has changed where I'm going from writing a lot of code to mostly reviewing PRs now.

And I heard that and I went, oh my God, that was the worst part of when I was an engineer.

Right.

That was the part I hated the most.

And there's always this like, I've been, I was at an offsite for a startup about a month and a half ago where literally we ended up spending 45 minutes talking about how to incentivize people on the team to review PRs more.

They're just sitting in the tray because nobody loves checking somebody else's code.

It's just not a very creative task.

But let's start with first, how are people using it and how are they using it?

What surprised you most about, especially as a product person, about how they're using it versus how you expected them to use it?

Yeah, for sure.

So we, it was really interesting building towards launch where we ran, use it internally and figured out how to use it.

And then what we found is that when we gave it to people externally, they didn't first, they didn't know how to use it the way we did and they didn't find it useful.

And then we obviously refined our messaging in the product.

And then when we actually launched it, people still used it differently from us, but they do find it useful.

So we can go through that journey, right?

So like internally, I think because we've spent a lot of time like working with reasoning models and like training them, we have this way of prompting reasoning models that is like intuitive to most open AI employees, like you write a pretty good prompt, you give it a lot of information.

It's kind of like a self-contained unit.

It's almost like a sleep-inched task, but obviously maybe not as well formed as that.

Give it all the right context.

Give it an upfront.

Yeah.

And then it goes and works and like you generally maybe don't go multi-turn, like where you like it gives you something and you reply, like maybe you're more likely to just re-prompt, right?

Adjust your prompt and re-go.

Just to do a best of end, essentially.

Yeah.

And actually there's an analogy I love floating around by another company that builds agents and it was like treated like a slot machine.

And I was like, oh, that's so apt because like, that's pretty much our intuition too.

Right.

So if you're treating something like slot machine, then the question was like, when do you use it?

And when we first ran like a small external alpha, like people were using it like the local agent they have in their IDE, which is actually not the right way to use it, right?

If something's going to work in your IDE, you're kind of lending it your computer for a while.

So you probably want to be really thoughtful about like, do I think this task is going to succeed?

And like, if I'm 80% sure it'll succeed, then I could like get it to go.

But maybe I also have some expectation of interactivity so we can kind of refine along the way.

The way to use like an agent in the cloud is just throws everything at it.

It doesn't matter if it's like spam as many as possible.

Yeah, it's like abundance mindset, you know, slot machine, somebody else's compute, right?

Yeah.

Okay.

Throw stuff at it.

And also, you know, you don't need to have the code on your computer to get and like decide to merge that code to get value, you could just be asking questions, you can be like, hey, explore this like four different ways so I can like pick the right way that I then want to do it.

Right.

You know, you can almost treat it as like your to do list of things that you will get to later in the day.

So that was some of the learnings we had when we ran the alpha where, hey, we need to kind of change the product so that it feels more like parallelization is like a key part of how to use it.

And so to more like make it so you like let go of what it's doing.

Okay, so then we shipped broadly externally.

And we got a bunch of feedback that we expected like, hey, the containers don't have network access.

This is really annoying, which it is.

Or hey, environment variables are hard to set up.

Environment variables are hard to set up, which they are.

Yeah, right.

And like we didn't like obviously we have many ideas.

We had ideas for how to like enable network access.

We just wanted to do that carefully.

And, you know, and then we on the environment set up stuff like we have ideas that we haven't shaped yet on how to make that better.

And board on board.

Yeah, simple model loop to like help write it and so forth.

But we just cut scope and like ship the really early research preview.

So there's much of that expected feedback.

Now, one of the things that really surprised me is that there was one feature that we didn't expect people to use.

And in fact, we used it so little internally that it just had a bunch of bugs we hadn't caught before releasing.

And that was multi turn.

So basically, like I was saying, like we and we told our alpha users, I guess to do this, basically said, hey, just like re prompt like fire many prompts and like maybe you can go back and forth.

It turns out that if you go back and forth more than once, so you do like three turns total, right?

Right.

The product was completely broken and that we were not like correctly like carrying over the diffs from the prior steps.

And like this is a lack of context persistent context, essentially after the third time.

Exactly.

And this is just like a plain old deterministic bug.

It's not like a weird model behavior thing.

It's just like we implemented the code wrong because no one ever nobody just got to turn for basically.

Yeah, exactly.

Yeah.

And so for me, that was really interesting to see that like people had this intuition for how they wanted to use the product.

And that wasn't like the re prompt intuition.

It was the hey, like I'm going to get like this main thing.

And then I kind of want to, you know, babysit that across the way to like actually landing it without it ever touching my computer.

And that like we kind of knew that might be a thing, but it was much more of a thing than we expected.

And do you think that's basically because internally, OpenAI employees are sophisticated enough to know that you you do all this upfront context building work for the agent to try to get as much as you can in the first turn.

But a user once you've made it fully cloud connected.

So the cost of the marginal cost of doing, you know, kicking off an agent was so low that they just quickly got to the third fourth turn without too much thinking.

It's funny, you know, I almost feel like in a way we're like less sophisticated because we understand too much about like the models or something like their expectations are lower than the average.

Yeah, we're like, oh, you know, this is a reasoning model, like works great, like, especially when you like prompt it in this way.

Right.

And then like, you know, folks outside OpenAI are just like, why does it not just I want to use it?

This thing is like, basically, like, you know, obviously, it's not easy yet.

But it's like, oh, is it like it's just like super smart model?

Why can't it just like, all I want you to do, you wrote this amazing PR, I just want you to change one thing.

Why can't you do it?

Right.

And so, you know, obviously, the bug that I mentioned we fixed, but that's something now we're thinking more about, like, okay, how do we enable that kind of multi turn interaction?

How do we make it faster as well?

Like container startup, just for example, takes time.

And, you know, there's a lot of optimization we can do.

But for now, if you need to incur a full container startup to like change one variable name, that's super frustrating.

So there's a bunch of things like that that we want to improve around.

Okay, that iteration loop.

Do you think that the is the arc of product development of agents such that you think the shape of the industry will be more and more Apple s square you'd go, well, cold starts are a problem for containers, because that's a really terrible user experience.

So instead of like outsourcing containers to some third party vendor, who then we're reliant on for providing us cold start, we're just going to bring this all in house is this is the most magical experience going to be the full stack end to end integrated experience where all the dependencies all the middleware is all done in house?

Or do you think that this is going to be more Android s square?

You know, you guys, a company like opening has an opinionated experience owns the agent sort of interface, but everything else is mostly like a collection of different tools orchestrated by different vendors?

It's a great question.

I think it's gonna be a bit of both maybe an annoying answer, but or right?

Where do you think the line where would you build versus by?

Right?

Yeah, no, totally.

So I think it's actually more like for whom or who will use what?

Like, I think that the average user or maybe like the new startup that is building with agents from scratch will just do things in a very different way.

And they'll basically have a bunch of agents with this a computer environment that scales really well, that has like all the credentials they need, but is also like protected with the right forms of sandbox sandboxing applied at the right times, you know, with the right like monitors on all like network egress and all this stuff.

And right, you know, maybe this kind of like computer, I think of it as a laptop, although obviously it's not is actually the thing that like many agents use, right?

And it contains many tools, not just the terminal, but it has a browser and it has whatever, right, you know, API access, and it's like, it gets piped the right credentials at the right time.

And so like you kind of think of yourself when you're hiring like your new agent for your new startup, which you might do before you bring on a co founder, even, you know, right, you think of yourself as just like setting up that environment.

And it's just getting like, this like fairly generalist employee that can code, right?

Like if you think of codex right now, it's like, it basically takes prompts and turns them into messages and diffs.

And that's like not general, I can't be like, Oh, yeah, hey, like, can you move engineering sync to 30 minutes later?

Because I have a conflict.

But like, a real software engineer can do that, right?

A real software engineer can go peruse like any source of data can like find out that are potential.

I mean, they can just use the internet.

Right, right.

So I think we will get towards that.

And I think we'll be able to build like a really nice managed system for that that lets you use more capabilities safely and with some like product pushes from us on like how to make the most of it.

So for example, recently, we shipped best event.

And like, you know, it's very simple feature.

But in our minds, it's like kind of just the beginning of like taking advantage of the fact that we're not running into a laptop.

So we can explore like four versions of the same, right?

And then you have is there some evaluator model looking at the best of actually the evaluator is the human right now.

But like, you know, you the roadmap is like fairly obvious if you just imagine like what we're thinking.

Just throw like all three pro.

So right.

So, so yeah, so there's that.

However, also, you know, the majority maybe of valuable code is actually written by enterprises who rightly so are like really locked down all their IP in their code, right?

And so something we've been thinking about as well as like, how do we meet these enterprises in a way that we can like provide value to them as well in a way that they like.

Right.

And so I think what we're going to get towards is like there's this like default way of working with things.

And then we'll basically have like some flavor of like on prem or bring your own compute that we support where it's like, hey, you know, here are all the things we manage for you when you use our compute.

If you're going to use your compute, then like we can work with you and like provide you as much of a harness as possible to automate things.

But like you're going to have to want to manage that compute and like for the agent, basically that environment of the agent here that here are the tools it should have.

Here's how you should sandbox it.

Or bring your own R back or whatever.

Yeah, exactly.

And so like the codec CLI, which we haven't talked much about, but in my mind, like the codec CLI might evolve into that, where it's like, hey, if you want to like run the agent loop in your own environment, then we can help you do that.

And you can use something that's an evolution of the CLI.

I think you should what let's talk about CLI versus the interface.

What are the two differences between codecs and codec CLI?

Yeah, so the place where I want this to get to is just like there's GitHub, right?

And GitHub has a website and a CLI and a mobile app.

And like, it's not confusing.

Right now it's a little bit confusing in that they are just completely distinct experiences.

We have codecs in chatbt, which is an interface that you can write a prompt and then we run codecs in the cloud and then you get back a different answer or an answer to your question.

Right.

Then we have the codec CLI.

And that's a completely distinct experience with a lot of the same ideas in it, which is basically you can run this tool in your terminal and we'll hit our model from the API.

And basically this agent will like work locally with you in your computer.

So right now I kind of think of it as you delegate to codecs in chatbt, probably.

Right.

And then you pair with codec CLI on your computer.

And what is the moment where the CLI journey integrates into the cloud workflow?

Yeah.

And so where I think we want this to go is there's just like one idea of codecs and it's just like, where do you want it working?

Right.

And, you know, there's going to be times where it's just like simply easier.

Like you don't have to set up an environment when it runs locally.

Right.

So maybe if you're trying something for the first time, yeah, just yeah.

Or like you don't even know if you like codecs yet, you know, you're just a new user.

Like maybe you just want to use the CLI or something.

Right.

And then maybe then you're using it and you realize, hey, like I want all this like cool parallelization and all this stuff.

Let me have this run in the cloud and you set up the cloud environment.

And then from then on, like you should still be able to like interface with that in the CLI if you want.

Right.

Except now it's running in cloud environment.

So it's more powerful.

Yeah.

So I think we kind of want to construct that and bring these things together.

But obviously we're in this temporary state of they're completely distinct.

Yeah.

I think so.

It's interesting hearing you talk about how there was this evolution from like the moment where you were using the tool as this like very precious first iteration tool where you put a ton of sort of weight and context into it, hoping to get back a really useful answer the first time around.

And then there was an aha moment where you're like, actually, this is more like a slot machine because other modalities in AI have played out very similarly.

So this was the case with image models, for example, right?

Two years ago, people were trying really hard to get the first version of image model, which were like GANs, you know, general adversarial networks, even pre like stable diffusion to be to produce useful sort of coherence images.

And they just weren't there, right?

They would produce these like artistic renders, which were great for like artistic exploration, but they weren't sort of useful because they didn't have the concrete coherence of a graphic design out, you know, piece of graphic design, for example.

And then if you remember the first like era of diffusion models like Dali and mid journey one, they started to get more coherent.

But there was this trick that a lot of product people start using.

And David from mid journey was one of the first to do this where he added four generations in the discord bot not one.

Because the idea was the insight was like, this is a slot machine, this is a stochastic process.

And you never really know which one the user is going to like best, especially for a super subjective domain, like art and like images.

And so human preferences is super subjective.

So let's just give them all four, and we'll figure out which one they like.

Now, over time, if you collect enough human preference, you can kind of nudge the distribution to be more aesthetically pleasing, or you can nudge it to be more like better typography or whatever you can nudge these distributions.

But by and large, that this day, the best use UIs for image models are still ones that give you like four outputs, if not more, and then allow the user to select the best of n, you know, and for a long time, people were like, that's going to work for these super creative domains, where like, verifiability or accuracy is not an issue, like, like images, like video, like music, audio.

But what's surprising is you're actually describing that same for pre verifiable domain like coding, because at the end of the day, it sounds like we there's still enough stochasticity in the sampling of a model, even as it gets better at reasoning, that's better, that makes sense to try to use it like a best of n machine.

And, you know, this has led to the, I guess, a popular set of critiques against reasoning models that like they're not, you know, RL from verifiable rewards doesn't actually introduce new capabilities, it's just really good at pulling out capabilities that are already in the model, it's good, really good at sampling.

Do you think that this is just an interim awkward phase where like, yes, the best of n is better at getting sort of the right answer from the existing model, it's not adding new capabilities yet.

But where we are going a year from now, there will be actually new capabilities that come from running verifiable RL on all the codecs usage that is about to happen from users with where do you how bitter lesson build basically are you roughly on that dimension?

Yeah, I mean, basically, I think an unsolved problem, and it's both a research and a product problem is like, how do we steer agents, right, what that are working independently.

And you know, you're talking, you mentioned like, hey, like, is best of n there to, to, you know, so the model has more shots on goal, basically, to, you know, to sample correctly.

And I think, you know, that might be part of it.

But actually, one of the things we've learned working in codecs is that, well, the human also doesn't know what they want.

Right, right.

And, you know, so if I ask you to fix a bug, like, there might actually be four reasonable ways to fix that bug with sort of different architecture implications.

And I might I haven't explored the solution space myself.

That's why I'm delegating this.

So I, I kind of want to know what the ways are.

And then I want to, you know, maybe I would pick the one that the model thinks is best to but like helpful for me to see, like, maybe that sucks in some way.

Yeah, but it's helpful for me to see the other ways that have like larger trade offs, right to then be confident in the right one.

Yeah.

So that's for like fixing a bug, which is like a very verifiable type thing.

If I ask you a model to like, you know, the classic example, implement tic-tac-toe or something, right, you know, I might not know what I want either.

Like maybe there's different styles and different like approaches you could take at various steps along the way.

Right, right.

And so, you know, it's kind of funny you were talking about, you know, generating four images and seeing those in the grid.

And like, in my mind, like, for a front end change, you could totally imagine a UI for different iterations, like the model does some work.

And then we like run the stuff we take, you know, the model in its environment runs the app and then like takes four screenshots.

And you actually just like have this like similar curatorial UI, right?

It's like just pick the one you like most.

We had Rick Rubin on the podcast a few weeks ago and Rick's a legendary music producer.

And he recently used Claude code to create a new vibe coding book.

And so we're talking to him about how he what's his what was his observation about how creating with AI, how is it creating with AI code Gen tools different from creating music?

And he was like, Oh, no, it's the same.

It's like going into a studio and he was talking about this story about, you know, going into the studio with Johnny Cash, and watching Johnny just pick up a guitar and start jamming.

And often the process of creating a great song is you just pick up an instant a tool like a guitar.

And then you just do four different iterations in completely different directions.

And then you usually have a creative partner like a producer or somebody going, not that one sucked go this way.

And it's that constant sort of best of end process in create like in the process of creating music that often results in the best, you know, output and often the quality of the end song is a determinant of the taste decisions you make along the tree of it of best event.

And so what's giving me hope about hearing you talk about it is if you read the Hacker News thread, for example, when you guys launched Codex somewhere down, I forget about halfway down the page was like a tree of discussions about how does this mean coding is going to get much less fun, because all of the interesting parts are being delegated to the agent.

And all the humans having to do now is just sit and review.

But actually, what you're saying is there are parts of the workflow where you get to almost entirely offload the plumbing parts of software engineering and focus on the taste exploration, which is sometimes the most fun part of software engineering is where you're creating a front end UX, or even when you're specking out like a really great schema for a database.

You know, some of the most fun times I've had is when I'm sitting with an infra engineer and respect spec, specking out the schema.

And like you go down one spec with you know, a bunch of pseudocode and you realize, actually, that's not the right one.

But it gave you an insight that then allows you to try another schema out.

Is that where you think we go?

Is that the silver lining?

Or are we actually destined for world where we're just all reviewing PRs and all the creative parts of software are gone?

Totally.

Yeah.

So this is just opinion here.

But I think you're right in that coding might be a little more painful for some number of months, because you have to do things like environment setup.

Right.

These are the teenagers.

Yeah, these are the teenagers.

I think like to be real, like that's true.

Maybe you don't get to write as much of like the code yourself right now.

But I think we will get to that more exciting place pretty quickly.

Because you know, it turns out environment setup is probably something that an agent can also massively help with.

Right.

And we can like close that loop where you know, you're not comparing like four deaths or something like that.

But we've like figured out the interaction model with the agent.

So you're kind of like making decisions in a way that feels like more like talking to another human, right?

Who's just like really smart and fast.

Right.

And then also that you're making these decisions not based on like reading like raw code in the case of front end, at least.

But like maybe you're like making decisions based on the outcomes, you know, like in the case of front end, like you're just choosing screenshots or like clicking around a preview or like if it's back end, maybe there's like some tests you agreed on and you're just like looking at test outputs to sort of decide.

Right.

The other thing that's interesting is that, well, if you were to guess, let's say I'll give you a few things that people use codecs for.

And curious what your guess would be the most like the biggest ones are like, let's say it's like building new features, asking questions, planning, debugging and fixing bugs.

Like, what do you think people would use codecs for more?

I think they would like to use it for debugging.

They probably aren't using it yet for that because there's often my knee jerk when I'm using an agent is that it just doesn't have enough context to fix for routine tasks.

Like, you know, some piece of boilerplate react is broken.

Like debugging is totally fine.

But I find I use it more and more for well defined, well scoped, well contained tasks, like create this new UI element that does blah or a refactor that's like where the atomic unit is very well constrained.

But I'm curious, what are you actually seeing?

Yeah, I mean, so my intuition was that people would use codecs for fixing bugs.

Okay, a lot.

Because, you know, bugs are somewhat well defined ish, you know, you can kind of tell if it's fixed, you might even have like some logging data, telemetry data that you could just paste into the model and right excellent fixing it.

Right.

Some of our earliest delight moments, we're like dumping in the stack trace and then just and it just figures that right.

But actually, the bike by far the thing that people use codecs for is building new features.

And I don't know, that was just like slightly surprising to me.

Because, you know, that is some of the most fun stuff to do.

And if you read like, you know, blog posts by folks who are using codecs in that way, it does look like they're having quite a lot of fun because of just the sheer speed they're experiencing.

Right.

The speed to prototyping has basically collapsed completely with something like codecs.

Yeah.

And broadly speaking, though, this is the vibe, the explosion of vibe coding, right?

I think it's that makes sense for me because some when you're prototyping a new idea, I find the most rewarding is when you actually see that if you can get to the first draft really fast and then kind of iterate from there, that's fun.

Sometimes the worst is when you have an idea, you kind of want to see it and then you lose steam between like firing up your IDE and seeing the first version of it, right, like compiling.

This is why hackathons have proven to be this like, I think magical sort of, you know, type of event where you get people together and commit to getting over the hump of the first prototype.

But in many in many ways, I think something like codecs or, you know, broadly speaking, really good coding agents have turned every day into a hackathon because they've collapsed the energy you need to get over the hump of all the plumbing, all the environments set up to test an idea.

When I was at Discord, we used to have this ritual across the company.

There was an annual tradition called Hack Week.

And some of the where the entire company would just stop for like a week.

And it wasn't just engineering.

It was product marketing, sales, ops, the entire company could hack on anything they wanted.

And some of the most enduring and popular features that made it into production, the company over the years, came from hackathon projects.

And it begs the question of, well, if there's a whole team called the product and engineering team, whose job it is to ship great features, why did it take this like special thing called a hack week to produce such great features?

And there is something about when you reduce the cost of prototyping new ideas, you end up getting things that don't make it through the usual PRD flow.

And it sounds like that's what a lot of users are using codecs for now is like that first to reduce the time to magic, essentially the time to first prototype, let's change stack for it.

Because there's this elephant in the room, right, which is that if you know, Mark famously wrote an op ed in 2011 to 2012, which is like software is eating the world.

And after I saw that chart you mentioned of the GitHub merge success rates of AI agents, starting 35 days ago hitting 80%.

And as of this morning, the volume being 350,000 it sounds like AI is eating software engineering.

Does it even does it even make sense to study software engineering anymore to get a CS degree?

If you're a freshman at Stanford today, or just a freshman grad, you know, somebody graduating high school, and you're broadly interested in software, does it even make sense to major in CS?

So my take is that it's two things.

First of all, I think still a great time to major in CS, I think there's going to be so much more software created, and therefore so much more software engineers needed.

But I also think, figure out how to be using AI constantly while you do it.

And hopefully you're at a university that's like very forward leaning.

And so they're kind of embracing it.

You know, I hear about policies like, hey, use AI as much as you want, but you just have to say how you use AI as part of your assignment.

Right.

That's great.

Right.

If you're at a place where like the main place where I would be worried if I was a student right now is if I was studying CS, and my college didn't allow the use of any AI, because then I would just feel like I'm like, falling behind.

Like, it'd be like, if you went to college, but you were only allowed to write assembly and you could not write C, you know, back in the day, right?

That would just be deeply worrying, I think.

Right.

But yeah, my, my take is, we can do like, you were talking about this, right?

Like, we can do so many more things now.

And, you know, we hear this from customers too, like, and from users, they're just like, hey, like, I would never have bothered doing this before, but I threw the idea into codecs just for the sake of it.

Right.

And I do this all the time.

And, you know, a lot of the time I do that.

And then I see the output and I'm like, I just still don't really care to do this.

But then sometimes this thing that they would not have even bothered doing, codecs either straight shots it or gets it to like 90%.

And they're like, you know what, I'm excited enough to do the last 10% here.

Let's get this merged.

And then this thing that would never have happened now happens.

Right.

Right.

You know, some of my favorite examples, like internally, are like when people build like new internal tools that accelerate the rest of their team.

And like, it's the kind of thing, like someone's complaining in Slack, like, I wish we had this tool to like, I don't know, look at these logs in a better way.

And they're like, you know, it just can't be bothered that everyone's too busy.

And then you right now you have this like great parser.

Right.

So I think that there are so many places where we could use software and that software could be more personalized to small groups or even individuals.

Right.

That we just are missing out on.

And so, yeah, now I believe that like with just the acceleration we're seeing in software development, I think we'll have many more of those tools existing and they'll be much cheaper to maintain as well.

Like that's the thing we're on the tip of now as well, where you're starting to see AI agents getting plugged into, you know, like GitHub or like Slack or, you know, linearize the agents feature.

And I think that that will make it much more efficient to actually have some like app out there and running.

Right.

Similarly, you know, even we're seeing those like, this is not codecs, but we're seeing products out there that will like write the app for you and then deploy it for you as well.

And so it's just like all in one full stack.

Yeah.

So it's just like, it's long story short.

It's much easier, I think, to build software to deploy that software and to maintain it.

I think that's just going to we're just at the beginning of this change.

So let's talk about that.

It's been 35 days now as a product lead.

You've had a chance to actually see, you know, the best laid plans rarely survive contact with reality.

So now how what priors have you updated the most and what comes next?

Where does codecs go in the V2?

Because this was just a research preview.

But what are the biggest improvements and what's the shape of the arc or the arc of the product in the future?

Yeah.

So I think there's one sort of conviction that has deepened and then one prior that's like being slightly updated.

So the conviction that deepened is that this form factor of an agent working on its own computer in the cloud is the future and is incredibly powerful and worth figuring out how to get right.

So we're continuing to invest in, you know, making that environment set up faster, making like performance just the way that better.

First time user onboarding.

Yeah.

First time user onboarding, but also just like, you know, once you're running, like things should just be faster.

Sure.

Speed is actually always the underrated feature.

And is that are the biggest gains in speed you think going to come from doing things like model distillation or do you think that comes from just better orchestration of tools?

Where do you think the biggest gains?

Honestly, I think the low hanging fruit is just like plain old deterministic, like dev-obsy type stuff.

Okay.

You know, like right now we clone your repo every time you do a task, even if it's a follow up.

And then we run your setup scripts from scratch every time.

And so if you have a large repo and a lot of dependencies to install, like that thing is slow.

Okay.

You know, start with gashing.

Yeah, we can just like we can fix these things.

Right.

And again, like I love that we didn't, I love that we shipped without those things to be zero.

Yeah, exactly.

So there's like that.

And I think, like I mentioned, best of n, I think thinking about how to make the most like basically how do we spend like more compute for you on your behalf?

Okay.

Is like very exciting.

And then how do we bring this closer to the tools you work in?

Right.

For me, the interface in chat, it's actually like very functional, but it's like not where developers go when they want to write code, right?

Like where do you go when you want to write code?

Either your terminal or your IDE, right?

Right.

Similarly, like where do you go when you want to like triage issues?

Well, like you go to your issue manager, right?

And so forth.

So I think we want to bring it much closer to the tools people work in.

And eventually, you know, the goal is to get to an agent that is like basically a teammate and it's like seeing what's going on your team and like picking stuff up for you.

Right.

That's okay.

Is this just is Codex just going to be a slack teammate?

I can just ping and interact on Slack.

It should just like, I kind of think of it as like, it's just, it should be sort of a ubiquitous teammate, right?

You know, it's just in your tools in the tools you want it to be in at least, right?

You know, and we'll start very gentle, just like, Hey, you decide when Codex does work.

And then over time, we'll figure out how to, like, kind of like more proactively chime in.

And, you know, we had a jam about this recently, like, you know, it's kind of an interesting point.

Like, I don't think we want it to proactively like DM you all the time every five minutes when something happens.

So I think there'll be some evolution of tools where we come up with, like, if you've, if anyone here has played video games, you know, there's always like press X to like, and it like, if you're next to a door, it opens the door.

If you like are next to some object, it picks up the object.

It just, it's a contextual action.

Yes.

Right.

Yeah.

Contextual proactiveness.

It waits for the hint that you want to do something and then jumps in.

Yeah.

And this is kind of like when we're getting to like interactive agents.

I think that's just like a big open area, but it's like, how do we have agents who understand what your team is trying to do and respond to like stuff in your team workspaces?

Right.

And then how do we have an agent that understands what you are trying to do?

And it's almost like this agent is like, both in all your tools, but like sitting next to you while you're working on your computer and like, kind of just being like, Oh yeah, like I can help you here.

Right.

So that's like actually the conviction that is deepened, right?

We're like, yes, all of this works when you give it its own computer and we need to figure out how to create this infrastructure for ecosystem integration and like make that safe and so forth.

Then the other thing though, that there's a bit of an update is like just thinking about how people like learn to use these tools.

I think right now there's some things that are pretty clunky.

Obviously we've talked a lot about environment set up.

I think also some of the things that, you know, you have to do like updating agents on MP is very manual and you have to like commit to your repo to get that context of the agent.

And so for me, I just thinking a lot now about like, okay, how do we make this like way easier to try?

I reduced the cognitive burden of the onboarding, fewer decisions to get to the magic.

Yeah.

Exactly.

Got it.

What has it changed most about research and the frontier of where frontier models are going, right?

As in your mind, does this mean that is the efficacy of how good Godex is as a post-strain version of O3 pro at using tools that like plug into this workflow?

Does it make you go, well, it just makes sense to pour an unlimited amount now of compute on post-straining models to get better and better at being autonomous coding agents.

Or do you think there's some marginal plateau point at which you go, you know, after this point, there's not really much the user is getting from better and better tool usage.

You know, what is the, how should, how does this change the trajectory of progress when it comes to the frontier of research?

Yeah, that's a really interesting question.

I don't, I definitely don't know if I have the answers to this, but what I can say is that one of the best parts of doing, you know, an optimized version of O3 was that we got to make a bunch of like hybrid research product decisions very quickly.

And I think that is incredibly exciting for thinking about how to make something useful.

So, you know, if I imagine we would have had this idea of like, you know, it's like really important that the agent knows how to write really good, like PR descriptions and, you know, tests code in a certain way, and it's used to working in varied environments.

And you know, when it runs some tests, it doesn't just tell you that it did, but it cites deterministically like in the logs, the output so you can verify that yourself.

Those are a bunch of like product ideas, really.

Right.

And they're not like those ideas I just mentioned are not like higher model intelligence, nor even really a higher ability to call the right tools.

Right.

It's just this understanding that like I like into the first few years of job experience of a software engineer, right?

Like you start you like you have a three is like this incredibly precocious college grad, like very smart, but like doesn't actually know how to be a software engineer just the code, right?

And like there's some like transfer, so it kind of knows a bit of software engineering, right?

And then like, that's fine.

But you can make it way more useful for, you know, the human trying to use the agent if it has those first few years of job experience.

So I think that there's no reason that those that knowledge couldn't be infused into the model.

Exactly.

But I think that having the freedom to like go and like explore these ideas like relatively cheaply and see what sticks and what doesn't is really powerful.

So frankly, like I don't really know to what extent it makes sense to like have like a bunch of custom post trains for like absolutely everything that matters.

But I think for something as important as like coding for us, I think that I think we're willing to say like, Hey, for coding, we really care about this.

Let's just do everything we can to like have the best product.

So like they actually did a similar thing with GPD 4.1, where we basically were getting a bunch of feedback from developers.

We said, Okay, let's go talk to a bunch of developers, like make custom evals for them, right?

Deeply understand like what our model is great at, what they want to get better at.

And then we release the custom model.

Right.

And then the goal should always be, okay, whenever we do this, like we have 4.1, okay, the next version of our like sort of general model should just integrate that.

Yeah, should integrate everything.

Right.

Yeah.

We have friends who are different levels of AGI build.

Did working on codecs update your priors on, you know, 2027?

Okay, so I'm very AGI filled.

I'm aware of my like, slightly joking, or but I can't tell if I'm joking 100% take is that if you took a model today and ran it in the right loop, we're basically there.

Would it have rights?

That's the question I sometimes wonder.

And should they be able to turn themselves off and go take a vacation if they want?

Yeah, so I so you know, that's kind of where I am.

Are you are you pro labor rights for O3 pro?

I am pro thinking about it.

You know what I mean?

Like, I don't think we're at a point where it's obvious, but I, it sounds kind of crazy.

But I feel like it's a question worth considering every now and then.

And not more concretely, how far are we from full recursive self improvement?

Okay, okay, sorry.

So basically, I think working on codecs made it very clear how we can have agents just like, omnipresent in our lives being incredibly useful.

Because what I realized is that obviously, we need to do a lot of model improvement.

But I also saw how does like just concretely a lot of like, normal product work to do, right to set them up in the right way.

And then that normal product work will then like pull the models into, you know, into being more and more useful.

So I think like by 2027, like agents will just be absolutely ubiquitous in the workplace.

I think in personal life, it might be a little bit slower, because in personal life, there's less of these like, constant pipes of like, signals of things to respond to.

The reason this matters is that if you think of chat TBT, you just have this like input box, right?

And like most people, including myself probably use it for like 1% of the things that I could use it for, because I just don't even know to use it in that way.

Or I don't prompt it right.

Right.

But intention just isn't there yet.

Yeah.

But like, it's similar, like if imagine you hired a teammate, and then the only time they do work is if you specifically tell them to do a task, right, then they would just be very underutilized, right.

But what makes a great teammate great is that they you kind of tell them what their job is, and they just start responding.

Proactives.

They're self starters.

Yeah.

So I think like that is the big unlock for agents at work, because there's like streams you can subscribe them to, like, you know, your communications tool, right?

And in personal life, I think that might be a bit slower, but we'll see.

Do you think that, well, actually, what percentage of all GitHub BRS do you think will be written by an AI agent 12 months from now?

That's a really tough question.

I sort of change my mind every time I answer it.

So maybe a slight cop out, and I'm curious for your answer too, would be that there will be teams for whom 90% of their PRs are written by agents.

But I don't know how quickly that will like spread.

You know, this is a common thing with AI.

It's like, we live on, like, I think we call it in the bubble, you could call it on the cutting edge.

And so we're just like adopting everything rapidly, but then it takes a while to like, diffuse or diffuse.

So, but I think the cutting edge will be like 90% on teams.

Right.

No, I think that's right.

There's I don't think people often talk about the coding economy as one homogenous economy.

And the reality is, there's multiple sub economies, but there are at least two big economies, which is there's the, for lack of a better word, you know, there's the digital native companies, right?

These are technology companies usually born in the post internet era, where they grew up where either the founders or most the vast majority of the team has grown up natively understanding how to do modern software development.

This, the default assumptions when a code base is initialized is that it's going to be, you're going to use Git for version management.

It's going, there's going to be branching.

There's going to be good review process and so on.

Like sort of modern software teams.

And then there's the vast majority of actually the world's mission critical code, which we talked about earlier as Fortran, Cobol, like running on prem in these massive ETL systems, like in Virginia or in parts of Europe that were set up in post world war two and or in the cold war with a default assumption that everything had to be locked down.

Often these code bases are running big parts of critical infrastructure, like the railway system of an economy or the air traffic control system.

So they're very high impact and high stakes code.

They're not modernized whatsoever.

And they're, you know, they're constantly rotting because of debt, technical debt.

And I think one of the most exciting things is that the one time migration costs to modernize these code bases now has collapsed precipitously because agents can do so much of the plumbing work that typically would hire some system integrator, you know, Accenture Deloitte for a 10 year contract where they'd come in.

You know, this is part of the founding thesis of Doge, right?

Which is like just vast parts of the American government in IT infrastructure is like super legacy.

And we're getting overcharged as a country to like modernize it and agents go in and are, if you, as long as we can get enough distribution, you know, training data on Fortran and COBOL and so on, then the one time like upgrade costs should fall and we should see an, like a, ideally this is my hope is that tools like Codex modernize that entire sort of legacy code economy.

And then we get to upgrade everybody onto like modern software engineering, right?

It then it's tending to happen from what I can see now in countries that get to leapfrog legacy infrastructure.

Cause it's starting from day one and very, it's very similar to like civil infrastructure, like roads and highways and so on.

So if you go to a country like Singapore, which is a much more modern country, cause it's barely 60 years old, you know, it's only got its independence and the 1950s, then they didn't have to build the roads and so on that Britain did and then upgrade them all, which is like refactors suck and they take way more time.

If you could just start from sort of a clean slate, it's much easier to modernize.

And so what I'm finding is that it is easier for countries that are whose IT infrastructure is just newer to adopt agents.

They're still legacy.

I mean, there's still, it's a vast majority of us running off and on prem and it's not modern, you know, it's certainly not TypeScript, but it's easier to upgrade from, you know, systems that were written in C plus to what to Python than it is to go from COBOL to Fortran and whatever to Python.

But if there's anything that makes me super excited that these economies will merge, it's autonomous agents, right?

Doing all of the plumbing work and doing it for a fraction of the cost and time that these mega, you know, sort of consulting companies have started to charge.

And frankly, many of them don't end up ever completing a project that just turned into a boondoggle.

So I'm very excited about that part.

And that's why I think AI is going to eat software because there's software did the modern sort of startup economy and digital economy.

Software ate really fast, but there were other parts of the world, especially mission critical industries where there was like a one time software upgrade largely driven by military scenarios.

And then we never modernized all that infrastructure since then.

So that's why I think the cybersecurity side of this, the safety evals that you're talking about, I think over time would come to be seen as having been very prudent because the thing that puts all of that adoption at risk is having like one terrible incident.

Then that then changes the risk posture for a bunch of enterprises.

I have a question about that.

Actually, I'm kind of curious.

So when, you know, a lot of the larger companies that we talk to, their use case is very different.

It's not like building new features, right?

Which is what we see like most of our users using us for, but it's it's three factors, large, three factors and repatforming.

Right.

So I'm curious, like if you mentioned some of these companies or governments or systems that you're thinking about kind of had this like one time upgrade for military reasons and then never upgraded from there.

I am curious if there was like a specific reason that they all want to upgrade now that you're seeing or if actually we're still kind of in the state of like there's no forcing function.

So like, although it's easier to do, there's still no impetus.

Right.

So for sure, there's the geopolitics has accelerated like adoption for a bunch of governments, right?

In Europe, the Ukraine crisis has forced a lot of governments in that region to go, wait a minute, like our air traffic control systems, especially in age of unmanned sort of drone warfare.

It is it is crazy that when there's a bug, we need to call in some legacy contractor who built it like 20 years ago to come and do some onsite maintenance.

Right.

That's been a wake up call.

And so you're seeing these like there was a there's sort of an eight hundred billion dollar defense bill that you were passed six months ago.

And the most urgent adoption is certainly happening at the intersection of like legacy code not working and battlefield needs and drone warfare code bases that interact with air traffic control systems with like UAV planning with mapping.

Those are the code bases that are like most urgently being upgraded.

I think in other parts of the world, there's just a desire to modernize.

So if you look at the UAE or the Kingdom of Saudi Arabia, we've talked about how the UAE rolled out is rolling out chat GPT to the entire country.

I think that's coming mostly from a top down directive to just embrace the like AI future that's coming rapidly.

Basically, the more AGI build I find the head of state is the more rapid the adoption is certainly for chat GPT like tools, but also coding.

That's not driven.

That's not driven by some like military function.

But then there are other regions like Europe where like for sure geopolitics accelerating all that.

And you know, you know, I've talked about this before, but usually those scenarios often need a slightly different, like the ergonomics of code are different.

They're very on prem.

They're very, they require a level of air gapping from cloud systems that like the modern software engineering workflow doesn't lend itself to.

And so we may see this bifurcation of codecs as a family.

Like I'm curious over the next few years, you know, the military require, let's call it the critical industry needs of modern autonomous coding agents might require like some pretty basic architectural differences than the, you know, let me ship the latest and greatest of our next version of our software product on GitHub.

I think it, I don't think it's a coincidence that the last time we saw a huge adoption and IT infrastructure on the world was the Cold War.

And now we're living through some pretty unstable times both in Europe, the middle East.

And I think that is causing governments, I think the US has always been somewhat forward leaning posture wise on adopting the latest and greatest technology.

We make other governments look, you know, like rightly so like dinosaurs and those folks, not nothing forces dinosaurs to wake up like an impending comet hitting them and impending extinction.

So that's definitely happening.

Yeah, I think it's interesting.

For me playing this through my mind as we're working on codecs, I do think there needs to be an answer for like, you know, how do you use this agent in an air gapped environment?

Right.

How do you use this agent?

Like, you know, there's critical industries and then there's just many like large companies who have like incredibly stringent security needs.

Right.

It's kind of the way we've kind of thought about building is the most important thing is to, you know, build to AGI, right, and then distribute the benefits of that to all humanity.

And so we're kind of like leaning towards the like, okay, the primary thing is the like fully self hook, you know, the thing where we host it for you, you know, continue environment and everything.

And instead of in parallel, we have this like sidetrack of like, okay, and like, how are we going to make sure that like today, you know, you can use codec CLI, you could use that in a I guess, relatively air gapped way, obviously, it needs to sample the model.

And then as we build new capabilities into codecs and chat Bt, how do we just make sure that if you're running like something like CLI, right, and like get the most of all, you know, the capabilities as they go without a trade off, but it might all it might be a little bit like, okay, we build it in the like fully self contained system first, and then we push down, right?

You know, this there's this narrative violation, I keep hearing about I keep hearing from folks in San Francisco that Oh, you know, opening eyes all in on consumers, because it's because the rise of chat GPT as a consumer companion has been so extraordinary.

But clearly, our entire conversation is an exception to that story, right?

Because almost everything we've talked about has been focused on developers and governments.

So why is that misconception there?

I think chat GPT is in fact, an amazing and large business.

And it's super cool to work at a company that is like really distributing AI, right?

So like, a giant number of people.

But yeah, we are incredibly serious about coding.

And in fact, we always have been since like the first codecs product that was powering GitHub co pilot, right on all the way through with our models.

I will say though, like, I think people are noticing like, we are getting we've always been like very serious about coding models.

And we're now getting like very serious about like, coding products as well.

Right.

And so like, whereas before, we had these amazing models, you could use them in like whatever tool that you want to use them in.

Like now definitely, I mean, a lot of the stuff that I'm working on is thinking about like, hey, actually, there's a lot of, you know, as we build agents, there's a lot of value we can provide by not only thinking about the model, but also thinking about how the model is like useful to you in a certain form factor.

And actually the form factor really affects right everything.

And so yeah, we're spending a lot of time and effort building like even better coding models, and even better coding products, particularly focused on agents, but even beyond.

So you've been a founder before.

One of the scary things about hearing OpenAI going from being serious about models to all products is if you're a founder in the space, and you want to build something interesting in the coding space, there's this tension looming, right, which is anything I'm going to build, just going to be subsumed by OpenAI's products next year.

So how would you think about that?

If you were leaving OpenAI and starting a company today, what would you do and what would you not do?

Okay, so if I was leaving OpenAI today, probably the sort of the market changed, I would be thinking the most about or one of them would be agents.

Okay, great, not super controversial.

Then I would think, okay, like we were talking about earlier, an agent is basically like a really good model that I'm probably not going to build at my startup.

And then I need to give that model access to tooling in an environment.

And then I need to like figure out what tasks it needs to be good at.

And then obviously give it to customers.

And the interesting thing about it is that those latter three things, right, the tooling, the environment, and the task distribution, like I said, the customers, the four things, whatever, all of those things are very much based on like knowledge of a customer.

And those aren't things that like OpenAI is going to like, you know, generally do for like every industry, right?

Like coding happens to be of particular importance to us just broadly, but even you know, within coding, there's a lot more specifics, specific areas.

So just to really spell this out, you know, if you think of the environment, like it's really, you know, training codecs was like really non-trivial to like figure out how to give the environments different, how to give the model different environments to train in, you know, with like different kinds of realistically dependent, realistic dependency setups, right?

Various amounts of dependencies even installed, like varying amounts of unit tests, like we actually the startup that you know, I sold to OpenAI was like multi, that's how I joined, and we had very few unit tests on a lot of our code.

And it's like kind of funny and like that, but that's realistic.

That's like a real startup code base, right?

So actually, if you wanted to do that for like some specific function, I don't think it would be easy for us at OpenAI to like create that many environments for the agent to use and train on and like and then use it, you know, test time.

So that's hard.

And then I think the task distribution is also really interesting.

Like codecs, you know, we have a lot of intuition for what a good coding task could look like and like kind of where to draw the boundaries, right?

Like today, it's like provide prompt and then you get an answer or a diff that you can turn into a PR.

But like, those are some decisions we had to make around what bound what the boundaries of the agent are.

Right.

And then we had to like go collect a bunch of those like type of tasks or like invent those tasks to like again, like train the agent how to do it and evaluate how well it was doing.

So I think that, again, for a very specific industry, I don't know, I'm trying to come up with an example, let's say accountants, but in a specific region of the world where it's like a specific set of rules, like they might have like very specific tooling that's like provided by the state for doing that accounting, right?

Right.

There might be very different kinds of like based like knowledge and documents available.

And then like the way you need to do the work might be different.

So I mean, I think it is a very good question.

And I'm not 100% sure what I would do is if I was a founder right now.

But I think that I would try to lean really hard on like very good customer knowledge and less hard on like product, if that makes sense.

Right.

It sounds like the last mile connective tissue between an industry where you have deep domain expertise becomes more valuable.

Whereas the first mile of like all the general purpose parts of an agent's flow, you basically you should assume you should offload that to open AI.

Yeah.

Yeah.

And then I think the other thing I might do is I might keep my company really small.

So rather than like, you know, like doing the classic like hyperscale thing, I would try to use agents as much as possible.

The company as small as possible so that we're just agile and nimble.

I guess this is probably like just the sort of age old advice.

But well, I'm looking for back on that for a second, because it turns out that in many industries, serving the customer deeply like you're describing often requires human a human touch.

That might be sales, it might be solutions engineering, it might be customer support and so on.

It does sound like what you're saying is you would certainly keep your engineering team very small and minimal.

But if servicing the domain required more of the human touch, then that you would, you know, you would scale because if it required, often my experience is that getting an agent to actually work in the enterprise and the legacy industry requires going in and doing a fair amount of integration work, at least upfront.

So maybe it's a setup thing, right?

Upfront, you parachute in somebody who understands how to get an agent up and running.

And then you can leave because it's really just for them, for the customers like consuming teammates, like you were saying earlier, but maybe the where you do need people is that integration point.

Now, ideally, over time, I guess you're saying the model should the product should just get good enough at integrating into the customer's environment.

But sometimes for regulatory reasons or otherwise, you just need a human there.

You know, do you are there some industries that like clearly do you feel like out of bounds for open AI?

Because that just is not on the path to AGI.

But that would that still would interact with coding agents.

First off, it's a good point on like the actual like integration work probably requires humans, I would say, yeah, if it's especially if it's in person type integration work or like complex, then I think you're spot on there.

Industries that are out of bounds.

I think it's like, it's like a hard question to reason about because like we are building like general products, right?

And so you can like kind of use like chat LGBT to answer any question like already today.

So I wouldn't say there's like bounds, but it's more like focus, I would say, you know, right now, open AI, we're very focused on like serving consumers generally and like being really good at coding.

You know, there's some other things too.

So I would just say, yeah, the more maybe we should just not even have this answer in the podcast.

Yeah, we can take this part out of you.

Perfect.

Great.

No.

Oh, great.

Yeah, about to wrap.

You stopped me.

That was a good one.

I'm like, I don't know, man, I've not found it right.

You don't speak on behalf of Sam about which why world domination is not complete in total.

I can take that part out.

So slightly different topic, a question I get from a lot of parents, especially with kids who are approaching the end of high school and in that phase where they're picking careers or thinking about what they want to do is this immense anxiety, especially for folks in tech for whom, you know, for the last, for the vast majority of the like 20, 30 years, it's been a fairly stable assumption that like, if you went, if you were smart and generally oriented towards technical fields, if you went and studied software engineering, you'd have a pretty great career and safe and sort of rewarding time in the knowledge economy.

And it seems like coding agents like Godex are taking a violent hammer to that assumption.

How would you advise, you know, friends who are parents who are trying to figure out how to help their kids choose a career for the future?

So I'll answer this with humility because I don't have kids, but I do think about this.

And actually, I think my point of view would just be that the world has always been changing.

It's changing now, but it was changing before that maybe it's changing a little faster.

But that's the main thing to notice is actually the pace of change, not the specific change.

And so like, I think the most, you know, if I had a kid at like high school now, I would probably just be trying to encourage them to just be like excited it is about whatever they're doing and like being incredibly curious and constantly learning.

Right?

Like I studied CS, did you study CS as well?

Or I started with CS and then transferred to bioinformatics because I was more interested in healthcare.

Right.

You know, and now you do investing, right?

And like I studied mechanical engineering and then I changed to CS and now I work in product in AI at OpenAI.

But like the startup that I had started was not an AI company.

So things are constantly changing.

And I think the most important thing is to like be agile, curious and like, you know, have some foundation that you can build upon as the world evolves around you.

So I think similarly, if I had a child in late high school, I would just want them to crush whatever it is that they're doing.

And it wouldn't really matter what specific thing they've chosen, you know, Eileen technical, so that would be cool.

But like, maybe even that is optional.

And then I would just raise them with the expectation that they'll probably have like many career transitions throughout their lives.

And if you were having seen what you have with Codex knowing what you do, aware it's going, let's say you were the chair of the computer science department at university, what would you do differently now versus before when Codex launched?

Well, one is you'd allow kids to use the AI tools.

But let's hear thinking about the future of computer science education and how that should be taught over the next five, 10, 15, 20 years.

How do you how would, what would you do differently?

Yeah, again, just opinions here.

But I think I would have, you know, like at Stanford, there was a class where we wrote assembly, forget the name of that class.

That was cool.

We had one class.

So you guys 140, I think it was.

Yeah.

And then, you know, similarly, I would have like a handful of classes where folks do things like very manually to understand what's going on behind the scenes and also to build a confidence that they can.

But then generally, I would move towards like having students trying to deliver some kind of like outcome, be it like they've learned something or they've built something or something like that.

Project based learning.

Yeah.

And then I would probably encourage them to like use these various tools so they're picking up the skills.

And you know, I don't know, I don't know, this is just an idea in my head.

But if we can help them kind of like speed run through that arc, then maybe every quarter that they're using a different set of tools.

And so they're like becoming like very mentally plastic in terms of how they get things done.

And I think that would be the best simulation of like what future work would look like.

I'm not sure what would you do?

Well, I teach a class CS 143 at Stanford every year.

This year, we taught it in winter quarter.

And we had about 300 students.

And I was, you know, thinking through what was a, in previous years, we had a midterm and, you know, we had like problem sets.

And this year, we decided just to do have it be a combination of speakers who are CTOs or folks, researchers in AI come in and talk about the infrastructure problems of building AI products at scale.

And then we had one final project where everybody had to build an agent and ship it.

And they were all allowed to use any coding tools.

Obviously, in fact, we gave folks some credits to Mistral models and Black Forest models and the founder of cursor came by and kind of talked about the ID and why they should all be using it.

And what was extraordinary, right, was it was so clear that the distribution of the final projects followed this power law where the top four or five teams that really adopted wholeheartedly the coding, the cursor and the AI models and did a fully sort of AI assisted workflow of their final project, like produced software that was like production grade ready.

If I was still running the platform or get discord, I would have totally shipped four or five of those on the front page of the app store we had.

In fact, I sent some of them to the founders of discord and they were like, we should probably ship this.

The quality bar was just extraordinary for something they were able to build in basically a 10 week quarter.

Then there was this sort of usual sort of middle of the back that had made a halfhearted attempt, but enough to get a good grade to customize the templates we'd given them.

But clearly hadn't like asked what is something that now I can create that I couldn't before now that I have access to extraordinary coding agents.

And then there was just the classic sort of bottom of the class that I think just didn't accept those tools and think deeply about like trying them, using them, learning with them, developing a feel for like what they're good at and what not good at, and kind of turned in a final project that would have been totally possible to build a year ago.

Why do you think they didn't want to use the tools you were giving them?

Look, it's hard to parse out from just a final project, but I did office hours with a lot of the students every week.

And you could very clearly think the number one predictor of their success was their mindset.

It was just about like, were they curious and hungry to learn outside of like a traditional textbook?

And look, some of the students just had a lot going on, being a college student is a stressful thing today.

And so I have a lot of empathy for, there's definitely this awkward moment you're describing right now, where a number of the graduating seniors from who are graduating with college degrees this year started out as freshmen in a very different economy.

When they picked CS, the assumption was, hey, if I like do well in the core CS curriculum, if I get a 4.0 GPA, and I do like one or two good internships, you know, somewhere along the way, and I apply for a job, I'm going to get a job at a pretty good tech company.

That's just not happening anymore.

And it might be because there's a set of layoffs or some overhang from the ZURP era, or it might be because a lot of engineering teams are reducing their footprint of entry level jobs.

But I was definitely shocked by how many Stanford CS grads they were looking for, you know, graduating seniors still looking for full time jobs, you know, come winter senior year.

And I think that's anxiety inducing, it's stress inducing that has bleed over effects on like, can you concentrate on this like project based class when they're like also, a number of the students were also juggling interviews and were coming to office hours when I thought they were going to be coming to ask about, you know, the code, we're asking like, for career advice, which is totally fine.

But I do think there's a transition phase right now, which is very can be very stressful for computer science students.

And I think you're right, the faster they're able to onboard to using these tools rapidly and realizing that the cap on what they can create now is extraordinarily high.

The faster I think they're going to transition into the new economy better, because I do think there's an expectation, certainly for modern software teams, certainly at OpenAI, that like, you're just fluent in all of these tools now, relative to, you know, four or five years ago, it was crazy when I, you know, when we graduated through Stanford, I didn't take a single class that required the use of Git, right, which is absurd.

Yeah, I happen to like, you know, pick it up in an internship, but there's no class that actually requires you at least at the time required you know how to use Git.

Yeah.

And so I think I do think the computer science departments around the country have to recognize that and change and do the kind of make the changes you're talking about.

And my hope is that in the interim, you know, students will won't wait around for their deans and their professors to do that for them.

Because you can just go and use codecs, you know, for free, I think the research we use literally free.

Is that right?

Well, you have to you have to have a plus account or a pro account.

But yeah, it's a good point.

Maybe we should do something for students, student licenses.

Yeah, you know, I will say that like we were hiring for codecs, you know, please, which I say, if you're interested in working on codecs, dm@embryrico on Twitter, it's E M B I R I C O.

Yeah, I don't know if I'm allowed to plug myself here.

But yeah, we're hiring, but we mostly are hiring very senior.

But we actually are, we decided that we're pretty interested in hiring like a couple of new grads.

Oh, that's interesting.

Yeah.

And so it's been interesting just looking at new grad profiles.

And I totally feel you on the yeah, I mean, it's definitely a tough time to be graduating.

I don't know if this is advice.

But what I can say is that when I look at new grad profiles, for me, the thing that I take the most signal from is if they've built something, right?

And if they've built something that's linked from their profile, and I can just like click to it projects.

Yeah.

And you know, like, it's just like a cool website, right?

You know, like grades matter much less now.

Yeah, I don't even look I did actually now that you I didn't even realize that I haven't looked at anyone's grades.

You know, like I just like, because, you know, admittedly, we're only hiring a few new grads, right?

But that is the single largest signal.

It's like what have you built?

Right.

And is there some way for me to validate that?

Like, maybe it's because I can click to the website, or maybe you just have some stats on like how many people used it, right?

And then when I talked to them, I'm just like, yeah, let's talk about what you built and how you thought about that.

So maybe that's somewhat helpful for folks who are looking for something, you know, I kind of reflect on my journey here to open AI, which I'm really grateful for.

And I view it as a privilege to be working here.

But you know, when I look back to when we were working on the startup multi, which is like not an AI company, and we saw like Chachapati come out and we started to follow all this alum stuff, I remember just feeling like, wow, like, there is a chance that if we don't do this right over the next couple of years, like my co friend and I were talking, there's a chance that we actually just end up like dinosaurs.

Right.

And so at the time, we actually made like a very explicit decision to like heavily prioritize getting us and the entire company like ramped on AI things.

And to some extent, like, I don't know if I could have like gotten the job that I have here at open AI if I was just applying randomly, I think it's because we had built something that was interesting that we were able to like get that attention and have that conversation.

So I guess if there's one takeaway here, it's just like, just got to build.

It's time to build.

Yeah.

As a reminder, please note that the content here is for informational purposes only should not be taken as legal business tax or investment advice or be used to evaluate any investment or security is not directed at any investors or potential investors at any a 16 z fund.

For more details, please see a 16 z.com slash disclosures.