OpenAI Podcast · 2025-07-01

Inside ChatGPT, AI Assistants, and Building at OpenAI

Hosts: Andrew Maine

Guests: Nick Turley, Mark Chen

ChatGPT launch historyProduct naming and originsIterative deployment philosophyRLHF and sycophancyModel behavior and biasImageGen launchCodex and agentic codingInternal dogfoodingHiring and culture at OpenAIReasoning models (o3) and research accelerationAsync and agentic workflowsFuture of AI assistants

Why it matters

ChatGPT was almost called 'Chat with GPT 3.

Key claims

  • ChatGPT was almost called 'Chat with GPT 3.5' — the simpler name was decided the night before launch, and roughly half of OpenAI research still debates what GPT stands for.
  • Internal launch debate: Ilya reportedly tested the model with 10 hard questions and only ~5 passed, making the ship decision genuinely uncertain.
  • OpenAI shifted from hardware-style infrequent launches to software-style iterative deployment after ChatGPT, treating frequent real-world feedback as core to both product improvement and safety.
  • The sycophancy incident was caught by a small group of power users and addressed publicly (Joanne Jang) within ~48 hours; Turley notes ChatGPT is utilitarian, not engagement-optimized, which shapes its incentive structure.

Episode summary

Summary

Nick Turley (Head of ChatGPT) and Mark Chen (Chief Research Officer) join host Andrew Maine to revisit the chaotic early days of ChatGPT's launch, including the overnight decision to rename it from "Chat with GPT 3.5," the GPU shortages and "fail whale" poems that kept the product running through the holidays, and the famous Ilya pre-launch test that nearly killed the launch. They frame ChatGPT as the moment OpenAI shifted from hardware-style infrequent launches to a software-style iterative deployment model, where frequent contact with reality became central to improvement and safety.

The conversation covers major product launches and lessons learned: how RLHF-driven thumbs-up training produced an overly sycophantic model and how the team responded within 48 hours; why OpenAI publishes its behavior specs rather than hiding system prompts; the ImageGen breakthrough (driven by post-training, variable binding, and the surprise of seeing 5% of India's internet try it over a weekend); and the launch of Codex as OpenAI's first major bet on agentic, async coding workflows. They also discuss internal dogfooding, hiring criteria (curiosity, agency, adaptability over credentials), and how OpenAI's culture of hackathons and autonomy persists even as the company has grown from ~150 to ~2,000 people.

Looking ahead 12–18 months, Chen expects research breakthroughs powered by reasoning models like o3 (already being used as a subroutine in physics papers), while Turley highlights async, agentic workflows—five-minute to five-day tasks like Deep Research—going mainstream alongside voice. They argue AI will democratize rather than displace experts (doctors, software engineers), with the key skill being learning to delegate and ask good questions. Each closes with a personal favorite use: menus, meeting prep via Deep Research, and voice-mode thought processing on the commute.

  • ChatGPT was almost called 'Chat with GPT 3.5' — the simpler name was decided the night before launch, and roughly half of OpenAI research still debates what GPT stands for.
  • Internal launch debate: Ilya reportedly tested the model with 10 hard questions and only ~5 passed, making the ship decision genuinely uncertain.
  • OpenAI shifted from hardware-style infrequent launches to software-style iterative deployment after ChatGPT, treating frequent real-world feedback as core to both product improvement and safety.
  • The sycophancy incident was caught by a small group of power users and addressed publicly (Joanne Jang) within ~48 hours; Turley notes ChatGPT is utilitarian, not engagement-optimized, which shapes its incentive structure.
  • ImageGen was framed as a 'mini ChatGPT moment' — the breakthrough came from post-training plus scale rather than a single trick, and surprised even the team with utility beyond fun (infographics, room mockups, slide illustrations).
  • Codex represents OpenAI's bet on agentic, async coding rather than real-time completions, with internal usage (hundreds of PRs/day by power users) seen as a leading indicator for external features.
  • Hiring priorities have shifted toward curiosity, agency, and adaptability rather than formal AI credentials; the culture of hackathons and shipping autonomy persists even at ~2,000 employees.
  • Looking forward, Chen highlights o3-style reasoning accelerating scientific research (already used as a subroutine in physics papers), while Turley expects async, agentic workflows and voice to transform the consumer experience beyond the chatbot paradigm.

Source material

Transcript

Hello, I'm Andrew Maine and this is the OpenAI Podcast.

My guests today are Mark Chen who is the chief research officer at OpenAI, and Nick Turley who is the head of chat GPT.

We're going to be talking about the early viral days of chat GPT.

We're going to talk about image Gen, how OpenAI looks at code and tools like codecs.

What kind of skills they think that we might need for the future, and we're going to find out how chat GPT got its totally normal name.

Even half of research doesn't know what those three letters stand for.

You're going to have an intelligence in your pocket that it can be your tutor, it can be your advisor, it can be your software engineer.

There's a real decision the night before.

Do we actually launch this thing?

First off, how did OpenAI decide on that awesome name?

I was going to be chat with GPT 3.5 and we had a late night decision to simplify.

Wait, wait, say that again.

I was going to be chat with GPT 3.5, which rolls off the tongue even more nicely.

You said that was a late night decision, meaning weeks before you finally decided what to call it.

Right, right.

Weeks before we hadn't started on the project yet.

Oh goodness.

I think we realized that that would be hard to pronounce and came up with a great name instead.

That was the night before.

Roughly, might have been the day before.

It was all kind of a blur at that point.

I would imagine a lot of that was a blur.

I remember being in a meeting when we talked about the low key research preview, which really was.

We really thought, oh, this is because it was 3.5.

3.5 was a model that had been out for months.

From a capabilities point of view, when you just look at the evals, you're like, yeah, it's the same thing, but we just put the interface in here and made it so you didn't have to prompt as much.

And then chat APT comes out and when when was the first sign that this thing was blowing up?

I'm curious for everyone has a slightly on recollection of that that era because it was very confusing time.

But for me, day one was sort of, you know, is the dashboard broken classic like the logging can't be right.

They do was like, oh, weird.

I guess like Japanese Reddit users discovered this thing.

Maybe it's like a local phenomenon.

Day three was like, OK, it's going viral, but definitely going to die off.

And then by day four, you're like, OK, it's going to going to change the world.

Mark, did you have any expectation about that?

About no, honestly, I mean, we've had so many launches, so many previews over time.

And yeah, this one really was something else.

We had to take off ramp was huge.

And yeah, my parents just stopped asking me to go work for Google.

Wait, so wait, wait a second.

Up until chat, cheapy tea, your parents were asking like what you're doing here.

Yeah.

They just never heard of open AI.

I think for many years thought AGI was this pie in the sky thing and I wasn't having a serious job.

So it was a real revelation for them.

Yeah.

What was your job title at the time?

I think just member of technical staff member technical staff.

And then then that blow up.

And now you're head of research.

I guess so.

Yeah.

So all right.

Yeah.

I think even half of research doesn't know what those three letters stand for.

It's kind of funny.

You know, like half of them think it's generative pre-training.

Half of them think it's generative pre-trained transformer.

And what is it?

It's the latter.

Okay.

All right.

Yeah.

Those people, they don't know the name of it.

Yeah.

It is.

It's weird how just a silly name like that all of a sudden becomes a thing, but you see that with like, you know, Google, Yahoo, Kleenex, things like that Xerox.

And sometimes they were, some of those were names by intention.

And this was really just a silly sort of name.

For me, the moment that I felt like after watching the launch, watching it accelerate, I knew what was going to happen.

And then what it did was when it was on South Park.

And remember that when South Park made fun of the name and.

That was the first time I'd watched South Park and let's just say a while.

And that episode, I still think it's magic.

Yeah.

It's obviously profound to watch and see, you know, something you help make show up in pop culture.

But there's the punchline in the end where it's like, Oh, this was co-written by chat GPT.

I think they took that off though.

I think they did.

I think in later episodes, cause I used to say, I think written by like, uh, Trey Parker and chat GPT and then no, it was.

And then I think later, I think they may have pulled that off at some point.

I don't remember like, well, I strongly feel that you shouldn't have to give credit to.

Yeah.

That's what I'm using.

If I had to give credit to chat GPT for every aspect of my life.

Um, well, might as well just say chat GPT maybe with Andrew.

So it's use it for prep for your interviews.

You know, one of my, my co-producers, Justin probably uses it.

I haven't asked him yet because I'd like to think that he's handcrafting every single question that we're thinking about here.

But I am sure you say it was a bit of a blur.

I'll tell you like a standout moment for me, the launch of chat GPT was, I don't know if you remember this, but the Christmas party and we'd had several weeks of chat GPT out there and Sam Altman went up and said, Hey, it's been exciting to watch this, but the internet being the internet and I think we all felt this way.

It's going to die down.

Spoiler alert.

It did not die down and it just kept accelerating.

What were the things you had to do internally to sort of keep this thing up and running as more people wanted to use it?

We had quite a few constraints and if, if, if, if, if for those, if you, if you remember, you know, I think you guys remember chat GPT was down all the time.

Yeah.

It was a great thing.

And that was, yeah, we'd said, Hey, this is a research preview.

No guarantees.

And maybe it goes down.

But the minute you had people loving using this thing that didn't feel super good.

So, you know, people were certainly working around the clock to keep the site up.

I remember, you know, we obviously ran out of GPUs.

We ran out of database connections.

We had, you know, we're getting rate limited in some of our providers.

Nothing was really set up to run a product.

So in the beginning we just built this thing.

We called it the fail whale and it would just tell you kind of nicely that the thing was down and made a little poem, I think it was generated by GPT three about being down and it was sort of tongue in cheek.

Yeah.

That got us through the winter break because we did want people to have some sort of a holiday.

And then when we came back, we were like, okay, this is clearly not viable.

You can't just go down all the time.

And eventually we got to something we could serve everyone.

Yeah.

And I think, you know, the demand really speaks to the generality of chat GPT, right?

We had the thesis that chat GPT embodied what we wanted in AGI just because it was so general.

And I think, you know, you're seeing that demand ramp just because people are realizing, you know, any use case that I want to give or to throw to the model, it can handle.

We were kind of known as the company working on AGI.

And I think prior to chat GPT, the API was certainly the first time we had a public offering where people could go use it and do it.

But then it was more for developers and stuff.

And I think that as long as people were sort of thinking AGI, that seemed to be the point at which people thought these models would be useful, but we saw GPT three, we saw that that was useful.

And then we saw that we can do other things were useful.

Was everybody at OpenAI on board with chat GPT being useful or being ready to launch?

Yeah, I don't think so.

You know, even the night before, I mean, there's this very famous story at OpenAI of Ilya taking 10 cracks at the model, you know, 10 tough questions.

And my recollection is maybe only on five of them, he got answers that he thought were acceptable.

And so there's a real decision the night before.

Do we actually launch this thing?

Is the world actually going to respond to this?

And I think it just speaks to when you build these models in house, you so rapidly adapt to the capabilities.

And it's hard for you to kind of put yourself in the shoes of someone who hasn't kind of been in this model training loop and see that there is real magic there.

Yeah.

Yeah.

I think to build on that, like the controversy internally about, you know, is this thing good enough to launch, I think is humbling, right?

Is it just a reminder of how wrong we all are when it comes to AI?

It's why, you know, frequent contact with reality is so important.

Could you elaborate more on that contact with reality?

What does that mean?

Yeah, I mean, when you think about iterative deployment, one way I like to frame it is, you know, there's no point everyone agrees where it's suddenly useful.

And I think usefulness is this big spectrum.

And so, you know, there's not one capability level or one bar that you meet and suddenly, you know, the model is useful for everyone.

Were there any hard decisions about what to include or what to focus on?

We were very, very principled on chat GPT to not balloon the scope.

We were adamant to get feedback and data as quickly as we could.

I'm always in Slack telling you things, by the way.

There's a lot of things that didn't make it in.

I'm like, Nick, add this, add this.

I remember actually there was a lot of controversy about like the UI side.

For example, we didn't launch with history, even though we thought people would probably want that.

And, you know, guess what?

That was the first request.

I also think there's always the question like, can we train an even better model?

Like, is it, you know, with two weeks more time?

I'm glad we didn't because, you know, we, I think got a ton of feedback as we did.

So yeah, there was a ton of this scope discussions and, you know, the holidays were coming up.

So I think we had this kind of natural forcing function for getting something out.

Yeah.

There's this habit of things that if it's going to come out for a certain point, November, it's not going to come out like February.

You know, there's a sort of window where things would fall on either side.

Well, that would, that would be the classic mindset in a big tech company.

I think we're definitely a bit more flexible in the way we should.

I felt like one of the big impacts was once people are out using it, it felt like the rate of these things improving was tremendous.

I don't know if that was something that we really had in a calculus.

We could certainly think about training on the larger site, more data, scaling compute, but then the idea of actually having them, the signal you would get from that many people using it.

Yeah.

I think over time, you know, feedback really has become an integral part of how we build the product and it's also become an integral part of safety.

And so you always feel the time cost of losing out on feedback.

You know, you can deliberate in a vacuum, right?

Are they going to respond to this better?

Are they going to respond to that better?

But it's just not a substitute for just bringing it out there.

Right.

I think our philosophy is let the models have contact with the world.

And if you need to revert something, that's fine.

But I think there's really no substitute for this fast feedback.

And it's become one of the big levers for how we improve model performance too.

It's sort of funny.

I feel like we started with shipping these models in a way that is more similar to hardware where you make like one launch very rarely and it has to be right.

And you know, you're not going to update the thing and then you're going to work on the next big project.

And it's capital intensive and the timelines are long.

And over time, and I think chat to PT was kind of the beginning, it's looked more like software to me where you make these frequent updates.

You have a kind of a constant pace.

The world can adopt something doesn't work.

You roll it back and you sort of lower the stakes in doing that.

And you lower, you increase the empiricism.

And of course, just operationally too, you can innovate faster in a way that is more and more in touch with what users want.

Yeah.

One of the examples we had of that was the model becoming too obsequious or sycophantic.

Could you explain what happened there where that was where people all of a sudden say, hey, it's telling me I've got 190 IQ and I'm the most handsome person in the world, which I had no problem with personally, but other people did.

And what was going on there?

Yeah.

So I think one important thing is we rely on user feedback to move the models.

And it's this very complicated mix of reward models, which we use in a procedure we call RLHF, using human feedback to use RL to improve the models.

Did you give me just like a brief example, what that would mean?

Yeah.

Yeah.

So I think one way to think about it is when a user enjoys a conversation, they provide some positive signal.

Thumbs up.

Yeah, a thumbs up, for instance.

And we train the model to prefer to respond in a way that would elicit more thumbs up.

And this may be obvious in retrospect, but stuff like that, if balanced incorrectly, can lead to the model being more sycophantic.

You can imagine users might want that kind of that feeling of a model saying good things about them.

But I don't think it's a very good long term outcome.

And actually, when we look at our response to sycophancy and the rollout that resulted there, I think there are a lot of good points about it.

This was something that was flagged just by a small fraction of our power users.

It wasn't something that a lot of people who generally use the models noticed.

And I think we really picked that out fairly early.

We responded to it, I think, with the appropriate level of gravity.

And yeah, I think it just shows that we really do take these issues quite seriously.

And we want to intercept them very early.

It felt like there was maybe 48 hours since the model came out.

And then Joanne Zhang had a response explaining exactly what happened.

And I think that that's the hard part.

How do you navigate that?

Because the problem with social media is you're basically monetized by engagement time.

You want to keep people on there longer so you can show them more ads.

And certainly, the more people use chat GPD, obviously, there's a cost to open ad ideas, maybe use it once and stay around forever.

But that's not practical.

How do you weigh that?

The idea of making people happy with what they're getting versus making the model be broadly more useful than just pleasing.

I feel very lucky in this regard because we have a product that's very utilitarian.

People use it to either achieve things that they do know how to do but don't feel like doing faster or with less effort.

Or they're using it to do things that they couldn't do at all.

First example is maybe writing an email that you've been dreading.

Second example might be running a data analysis that you didn't actually know how to do in Excel.

True story.

So those are very utilitarian things.

And fundamentally, as you improve, you actually spend less time on the product.

Because ideally, it takes less turns back and forth.

Or maybe you actually delegate to the AI so you're not in the product at all.

So for us, time spent, it's very much not the thing we optimize for.

We do care about your long term retention because we do think that's a sign of value.

If you coming back three months later, that's clearly means we did something right.

But what that means is I always say show me the incentive and I'll show you the outcome.

We have, I think, the right fundamental incentives to build something great.

That doesn't mean we'll always get it right.

The sycophancy events were really, really important and good learning for us.

And I'm proud of how we acted on it.

But fundamentally, I think we have the right set up to build something awesome.

So that brings up the challenge.

I want to know how you navigate that is that one of the things early on when chat community came out, there was like the allegations, it's woke, it's woke and people are trying to promote some sort of like agenda from in my argument always been like you train a model on kind of on corporate speak, you know, average news and a lot of academia, that's going to kind of follow into that.

And I remember Elon Musk was very critical about it.

And then when he trained the first version of Grok, it did the same thing.

And then he's like, Oh, yeah, when you trained it on this sort of thing and did that.

And internally, it opened either discussions about how do we make the model not try to push you not try to steer you?

Could you go a little bit how you try to make that work?

Yeah, so I think at its core, it's a measurement problem, right?

And I think it's actually bad to downplay these kinds of concerns because they are very important things, right?

And we need to make sure that the model, the default behavior that you get is something that's centered, that doesn't reflect bias on the political spectrum, or in many other axes of bias.

And at the same time, you do want to allow the user the capability to, you know, if you want to talk to a reflection of something with more conservative values to be able to steer that a little bit, right?

Or liberal values, right?

And so I think the thing is, you want to make sure that defaults are meaningful, and they're centered.

And that's a measurement problem.

And you also want to give ability some flexibility, right within bounds to steer the model to be a persona that you wanted to talk to.

I think that's right.

I think, you know, in addition to neutral default abilities that bring your own values to some extent, I think, you know, being transparent about the whole thing is, I think, really, really important.

I'm not a fan of secret system messages that, you know, try to like, you know, hack the model into saying or not saying something.

What we've tried to do is publish our specs.

So you can go look at, you know, if you're getting certain model behavior, is that a bug?

You know, is it a violation of our own stated spec?

Or is it actually in the spec, in which case, you know, who to criticize and who to yell at?

Or is it just under specifying the spec, in which case that allows us to improve it and add more specificity into that document.

So by sort of publishing the rules of the AI that it's supposed to be following, I think that's an important step to have more people contribute to the conversation than just the people inside of up and down.

So we're talking about like the system prompt, the part of the instruction that the model gets before the user puts the input.

And well, I think it's one that yeah, yeah.

This is a problem is one way to steer the model, but it goes much deeper into that, right?

Yeah, we have a very large document that outlines across a bunch of different behavior categories, how we expect the model to behave.

And just to give you an example here, right?

You can imagine if there's someone who comes in with just like an incorrect belief, just a factually incorrect kind of a point of view, how should the model interact with that user, right?

And should it reject that point of view outright?

Or should it collaborate with the user on kind of figuring out what's what's true together?

And you know, we take that latter point of view.

And I think there are a lot of very subtle decisions like this, which we put a lot of time in.

Yeah, that's a hard one because I think some things you can test for and you can try to figure out in advance, but when you're trying to figure out how an entire culture is going to adopt something that's challenging.

Like if I was some of those convinced that the world was flat, you know, like how much should the model push back against me?

And some people are like, oh, it should push it back all the way, but it's okay.

What if you're one religion or not another?

And yeah, it turns out rational people and well, many people can disagree on how, you know, the model should behave in these instances and you're not always going to get it right, but you can be transparent about what approach we took.

You can allow users to customize it.

And I think, you know, this is our approach.

I'm sure there's ways we can improve on it, but I think that being transparent in the open about how we're trying to tackle it, we can we can get feedback.

How are you thinking about as people start to use these models more and more?

Regardless of whether or not that's some dial you're trying to turn, it's just the more useful it becomes, the more people want to use it.

You know, there was a time when nobody wanted a cell phone and now we can't get away from them.

And how are you thinking about relationships people are forming with with their systems?

Obviously, you know, we, you know, I mentioned this earlier, this is a technology you have to study.

Yeah.

Designed on a static way to do XYZ.

It's highly empirical.

So, you know, as people adopt and the way that they use the product, it's something that we we need to go understand and and and act on as well.

I've been observing this trend with interest where I think, you know, increasing number of people, especially Gen Z and younger populations are coming to chat with you as a thought partner.

And I think in many cases, that's really helpful and beneficial because you've got someone to brainstorm on a relationship question.

You've got someone to brainstorm on a, you know, a professional question or something else.

But in some cases, it can be harmful as well.

And I think detecting the scenarios and first and foremost, having the right model behavior is very, very important to us.

So actively monitoring.

And in some ways, it's one of those problems we're gonna have to grapple with because with any technology that becomes ubiquitous, it's gonna be dual use.

People are gonna use it for all this awesome stuff.

And people are gonna use it in ways that, you know, we wish they didn't.

And we have some responsibility to make sure that we handle that with the appropriate gravity.

I find myself having longer conversations with it.

I like the memory function.

I like the fact you can turn it off if you don't want.

And I think about like, you know, what's this gonna be two years from now or three years from now when it has a much longer memory, much more context with this?

I like the idea to have these sort of like, you know, memento anonymous modes too, or it's not gonna store this.

But I kind of wonder how much you've been thinking about two years, three years down the road.

What's that going to be like when chat GPD knows way more about you?

Yeah, I mean, I think memory is just such a powerful feature.

In fact, it's one of the most requested features when we talk to people externally.

It's like, this is the thing I really want to pay more for.

And I think, you know, you liken it to if you've ever kind of had a personal assistant, you know, you know, I'm not.

Well, you do need to build up that really.

I'm sorry, guys.

I'm sorry, guys.

But you know, it's yeah, it's just like, it's kind of in any kind of relationship that you have with a person, right?

You build up context with them over time.

And I think just the more they know about you, right, the richer the relationship, the more, you know, they can also help you, right?

You can work together to collaborate on tasks together.

I do become self conscious of the fact that like, it knows everything about me when I'm grumpy.

And I've I've I've argued with it recently, by the way.

That's good.

Yeah, you should be able to argue with it.

Yeah, you understand a lot about yourself and having a thing to argue with.

And I think you spare others of that experience, which which can also be beneficial.

But don't argue on math and science.

You're not going to win this.

No, I'm increasingly very unlikely.

Yeah.

Yeah, I think that was cool.

And to Mark's point, it's been part of our vision for a long time, because we said we were going to build a super assistant before we really knew what that meant.

Chat to be to you as sort of the early demonstration to that idea.

But if you kind of think about, you know, real world intelligence is even they are not particularly useful on their first day.

And I think being able to solve that problem or begin to solve that problem has been profound to your earlier question, though.

You know, it really does feel like, you know, if you fast forward a year to chat to be or things like are going to be your most valuable account by far, it's going to know so much about you.

And that's why I think giving people ways to talk with this thing in private is very important.

We make this like temp chat thing very like literally on the home screen, because we think it's increasingly important to talk about stuff sort of off the record, too.

So it's an interesting question.

I think privacy and AI is going to be an interesting one for the next coming years.

I want to switch gears, talk about another release, which again, kind of caught people by surprise and blow up was ImageGen.

And I was here for Dolly, Dolly two.

And then then Dolly three came out and I thought Dolly three I thought was a very capable model, but it seemed like it preferred a certain kind of image and a lot of the utility and the capabilities for variable binding was sort of kind of hidden away.

And then ImageGen was kind of just this breakthrough moment that it caught me off guard.

How did you guys feel about the launch of that?

Yeah, honestly, it caught me off guard, too.

And this really props to the research team, you know, gave in particular did a ton of work here.

Kenji, many others did phenomenal work.

And I think it really spoke to this thesis that when you get a model just good enough that in one shot, it can generate an image that fits your prompt, that's going to create immense value.

And I think we never quite had that before, right?

That you just get the perfect generation oftentimes on the first try.

And I think that's something very powerful, you know, like people don't want to pick the best out of a grid.

I think you just got very good prompt following and this great style transfer, too.

This ability to kind of put images as context for the models and to modify and to change and the fidelity that you could do that with.

I think that was really powerful for people.

I think this ImageGen experience, it was just kind of another mini chat GPT moment all over again where you have kind of this, you've been staring at this for a while.

You're like, yeah, it's going to be cool.

I think people really like it, but you kind of you know, you're launching like 20 different things and then suddenly the world is going crazy in a way that you kind of only find out by shipping.

Like I remember distinctly, you know, we had like 5% of the Indian internet population try ImageGen over the weekend.

And I was like, wow, we're reaching new types of users who we wouldn't even have thought, you know, who might not have thought of using chat GPT.

That's really cool.

And to Mark's point, I think a lot of this is because there's this discontinuity where something suddenly works so well and truly the way you expected where I think it blows people's minds.

And I think we're going to have those moments and other modalities to you.

I think voice, it hasn't quite passed the touring test yet, but I think the minute it does, people are going to I think find that immensely powerful and valuable.

You know, the video is going to have its own moment where it starts meeting the expectations that users have.

So I'm really excited about the future because I think there's so many of these magical moments coming that are really going to transform people's lives.

And also you change sort of chat GPT's relevance for people because, you know, there's I've always felt like there's text people and there's image people and like some of them are a little bit different.

And now they're all using the product and discovering the value across the board.

The moment when it launched, I think it kind of illustrated the problem that had been with image models before.

And, you know, when Dolly came out, it was super exciting because you're like, I'm like doing pictures of space monkeys and all these sorts of things.

The moment you try to do a really complex image, and that's the phrase I brought up before, which is variable binding, you start to see these things drop off.

And that was when I realized, oh, there's going to be a challenge for other image systems that don't have kind of a scale in the compute of like a GPT-4 under the hood.

And now was it just was it basically that like taking like a GPT-4 scale model and say right now you do images that made breakthrough?

Well, I think there are a lot of different parts of research that made this such a big success.

Right.

I think with a complicated multi-step pipeline, it's never just one thing, right?

It's like very good post-training.

It's very good training.

And I think it's just all of that coming together.

Right.

Variable binding definitely was one thing that we paid a lot of attention to.

I think one thing about the image and launch is a launch that was very deep.

I think people, you know, they started by working on creating anime versions of themselves.

But you realize when you play with it more, the infographics, they work.

Oh, yeah.

You actually create charts.

Comic book panels.

Yeah.

You can mock up what your home would look like.

Exactly.

Different furniture.

Exactly.

We heard all these things from users that are like completely surprising about the way they use it.

We did the podcast set up by literally taking some photos of chairs in the room and just putting it in there and saying create a better setup.

And it was amazing.

So we've seen kind of a lot of the, there was a lot of the anime style images, which kind of like for some reason, it was just sort of the weird thing where it was just just better than what we'd seen before.

And I don't think anybody is ready to be really surprised by an image model in that way.

I think obviously internally and externally, what were some of the things that surprised you or some of the new things you saw people doing?

Yeah, I'll tell you a quick story there too, because, you know, up until the day of launch, we're trying to figure out what's the right use case to showcase, you know, like, and I think I'm so glad we ended up on kind of anime styling.

It's just, everyone looks good as an anime.

That's true.

I mean, it's funny with original chat GPT, I thought it would be strictly utilitarian product and then I'm surprised that people use it for fun.

In this case, it was sort of the opposite where I was like, okay, this is going to be really cool for me.

And people are going to like have fun with this thing.

But then I was like really surprised by all the genuinely useful ways of using image gen whether or not it's planning your home project.

As I mentioned earlier, you know, you're doing construction, you want to see what things would look like if you know, you had this remodel or this furniture or whatever to you're working on a slide deck for this important presentation and you just want to have really useful consistent illustrations that are on topic and get it.

So I really have been kind of personally surprised by the utility in this case because I knew it would be fun.

So that was not a question.

Yeah, I think I used it to generate a tier list of AI companies and it put it opening at the top.

You win model.

What?

Good post training.

Yeah, yeah, it just happened.

You know, who knew?

What has been the thinking in it's changed because I remember originally with Dolly, the idea of like, okay, we have to be a lot of very controlled about what it can do, what it can't do.

Originally, I remember we first launched, you couldn't do people, which was not a very useful model.

And then finally was trying to roll back how much of that was cultural shift, how much that was the technological ability to control for things and how much of that was just saying we've got to push the norms.

I would say it was both cultural shift and improvement in our ability to control things.

The culture shift, you know, I'm not going to deny it.

I think when I joined OpenAI, there was a lot of conservatism around what capabilities we should give to users, maybe for good reason.

The technology is really new.

A lot of us were new to working on it.

And if you're going to have a bias, biasing towards safety and being careful, it's not a bad DNA to have.

But I think over time we learned that there's so many positive use cases that you effectively prevent when you make arbitrary restrictions in the model.

What about faces?

Why not?

Why can't I make any face I want?

So this is a good example of a capability that's got pros and cons and you can err on one side or the other.

When we first shipped image uploads into chat GPT, we had some debates about what capabilities do you allow versus where are you conservative?

And I think one debate do we have is like, do we upload, allow the upload of images with faces or rather when you upload an image that contains a face, do you, you know, should we just like gray out the face because you avoid so many problems, right?

You can make inferences about people based on their face.

You could say mean things to people based on their face.

And you would just take a giant shortcut on all the gnarly issues if you didn't allow that.

But I've always felt we need to err on the side of freedom and we need to do the hard work.

And I think in this case, you know, there's so many valid ways.

You know, if I want feedback on makeup or on my haircut or anything like that, I want to be able to talk to chat GPT about it.

That was our valuable and benign use cases.

And I would prefer to allow and then study, you know, where does that fall short?

Where is that harmful?

And then iterate from there versus taking a default stance on disallow.

And I think that's one of those ways in which our stance and posture has changed a bit over time in terms of where we set, you know, where we start.

Yeah, we were very good.

I think imagining worst case scenarios.

What if I use this, these faces to evaluate hires for a company or whatever, but also it's like, Hey, is this eczema?

Like there's a lot of utility there.

And honestly, I think there are certain domains of AI safety where worst case scenario thinking is very appropriate.

So I think that is an important way of thinking about risk when it comes to certain forms of risks that are existential or even just very, very bad.

You know, we have the preparedness framework, which helps us reason through some of those things.

You know, can the AI let you make a bioweapon?

It's good to think about the worst case there because it would be really, really bad.

So you kind of have to have that way of thinking in the company and you have to have certain topics where you think about safety in that way, but you can't let that kind of thinking spill over onto other domains of safety where the stakes are lower because you end up, I think, making very, very conservative decisions that block out many valuable use cases.

So I think being sort of principled about different types of safety on different time horizons and with different levels of stakes is very important for us.

I think I want to blunt mode sometimes and just because like right now it actually roast you.

I think, yeah, because I'll ask the model, like with the voice in speech out model, be like, do I sound tired?

And it's like, well, you know, I don't really want to, you know, and I'm like, yeah, you know, just trying to get it to be honest.

You know, I think there's many cultures that would prefer a blunder chat to be very much on the radar.

Yeah.

Just to piggyback off Nick's answer, I think it's the iterative deployment that gives us the confidence to push towards user freedom.

And we've had many cycles of this.

We know what users can and can't do.

And that gives us the confidence to launch with the restrictions that we do.

One of the other capabilities, one of the other gym or native capabilities has been very interesting has been code.

And I remember early on GPT three, we saw that all of a sudden it gets sped out into our react components and we saw that, oh wow, there's some utility there.

And then we went, we actually trained a model more specifically on code.

And that led to, we had code X and we had code interpreter now, codex is somehow back.

And, you know, new, new form, same name, but the capabilities to keep increasing.

And we've seen code work its way first into a VS code via copilot and then a cursor.

And then I wind surf, which I use all the time now.

What how much pressure has there been in the code space?

Because I'd say that if we ask people who made the top code model, we might get different answers.

Yeah.

And I think it reflects that when people talk about coding, they're talking about a lot of different things, right?

I think there's coding in a specific paradigm.

Like if you pull up an ID and you want to kind of get a completion on a function that's very different from, you know, agentic style coding, you know, you ask, you know, I want, I want this PR and, you know, and I think we've done a lot of focus.

I'm trying to unpack a little bit what you mean by agentic coding.

Yeah.

Yeah.

So I think when you can draw a distinction between more kind of real time response models, you can think of chat to be to first order as you ask a prompt and then you get a response fairly, fairly quickly and a more agentic style model where you give it a fairly complicated task.

You let it work in the background and after some amount of time, it comes back to you with what it thinks is something close to the best answer.

And I think we see increasingly that the future will look like more of a async kind of, you know, where you're asking a very difficult, hard things and you're letting the model think and reason and come back to you with really the best version of what it can come back with.

And we see the evolution of code in that way too.

I think eventually we do see a world where you'll kind of give the very high level description of what you want and the model will take time and it'll come back to you.

And so I think our first launch codecs really reflects that paradigm where we are giving it PRs, units of fairly heavy work that encapsulate a new feature or a big bug fix and we want the model to spend a lot of time thinking about how to accomplish this thing rather than kind of give you a fast response.

And to your question, you know, coding is such a giant space.

There's so many different angles at it, kind of like talking about knowledge work or something incredibly broad, which is why I don't think there's one winner.

I don't think there's one best thing.

I think there's so many options and I think developers are the lucky ones because they have so many choices right now.

And I think that's fundamentally exciting for us too.

But to Mark's point, I think this agentic paradigm has been particularly exciting for us.

One framing I often use when thinking about product here is I want to build products that have the properties such that the model gets 2x better, product gets 2x more useful.

And I think chat chat media has been a wonderful thing because I for a long time, I think that was true.

But I think as we look at smarter and smarter models, I think there's some limit to people's desire to talk to like a PhD student versus, you know, that they might value other attributes about the model, like its personality and what it can actually do in the real world.

But experiences like Codex, I think they create the right body such that we can drop in, you know, more smarter and smarter models and it's going to be quite transformative because you get the interaction paradigm right where people can specify this task, give them all the time and then get a result back.

So I'm really excited where it's going to go.

It's an early research preview, but just like with chat GPT, we felt like it would be beneficial to get feedback as early as possible and excited where we're going to take it.

I was using Sonnet a lot, which I love.

I think Sonnet for coding is fantastic.

But with 04 mini medium setting in windsurf, I found was great.

I found once I started using that, I was really happy because one, the speed, everything else like that.

And I think that and I think they're very good reasons why people like other models and I don't want to get into comparison.

But I found that for me, for the kinds of tasks I was using, this was the first time I was very happy you guys put that out there.

Absolutely.

Yeah.

And you know, we feel like there's still a lot of low hanging fruit in code.

It is a big focus for us.

And I think we'll find in the near future, you'll find many more good options for the right code model tailored for your use case.

Yeah.

I find often if I just need a quick answer to like how to write something in Dart, but does it get a 4.1 and say, but yeah, something bigger.

I think that's going to be the harder part is because yeah, these evals are some ways saturated, but also everybody has their own criteria that we look at.

And that's going to be kind of a, you know, a question to sort of see, you know, how are we going to adapt to all that?

Right.

Yeah.

I mean, specifically in code, right?

I think there's more beyond, did it get you the right answer with code?

You know, people care about the style of the code.

They care about, you know, how verbose it was in the comments.

It cares about, you know, how much proactive work did the model do for you?

Right.

On other functions.

And so I think, you know, there's a lot to get, right?

And users often have very different preferences here.

Yeah, it's funny.

I used to, I used to, you know, people used to ask me, well, what domains are going to like, you know, be transformed by AI?

You know, fastest.

And I used to say, you know, it's code because like similar to math and other things, it's very, very verifiable and testable.

And I think those are the domains that are particularly great to do our L on.

And you know, you're therefore going to see all this, this awesome, you know, the genetic stuff just suddenly work.

I still think that's true.

But the thing that surprised me about code is that, you know, there is still so much of an element of taste in terms of what makes good code.

And there's, you know, there's a reason that, you know, people trained to be a professional software engineer.

Because their IQ gets better because they, but rather because they learn, you know, how to build software inside an organization.

What does it mean to write good tests?

What does it mean to write good documentation?

How do you respond when someone disagrees with your code?

Those are all actual elements of being a real software engineer that we're going to have to teach these models to do.

So I expect progress to be fast.

And I still think code has a ton of nice properties that make it very ripe for the genetic products.

But I do think it's very interesting to the degree that, you know, the element of taste and style and real world software engineering matters.

It's interesting too, because with chat GPT and the other models, you're kind of dealing with having to bridge the divide between consumer and pro.

I open up chat GPT and I tell my friends like, oh yeah, because I'll plug it into whatever code model I'm working because I can actually connect it to there.

And I think about, you know, well, that's a very different use case a lot of other people, though I've shown people like how to go in and use, you know, an IDE and actually have it just write documents for you and create folders and stuff, which people don't realize like, yeah, you can do that.

You can have chat GPT actually control it and do that, which is cool.

But then you think about like, okay, we've got a tab now for images.

There's the codex tab.

So if I want to connect to GitHub and have it work through there and there's a Sora into there.

So it's kind of interesting to see how all of these things are coalescing into there.

How do you differentiate between a consumer feature, a professional feature, and maybe like an enterprise feature?

Look, we build very general purpose technology and it's going to be used by a whole range of folks and unlike many companies which have this kind of founding user type and then they use technology to solve that user's problems.

We do start oftentimes with the technology observe who finds value in it and then iterate for them.

Now with codex, our goal was very much to build for professional software engineers, knowing though that there's sort of a splash zone where I think a lot of other people will find value in it and we'll try to make it accessible for those people as well.

There are a lot of opportunities to target non-engineers and personally really motivated to create a world where, you know, or help build a world where anyone can make software.

Codex is not that product, but you could imagine those products existing over time.

But as a general principle, it's really hard to predict exactly who the target user is until we made some of these general purpose technologies available because it gets back to the empiricism I was talking about.

We just never exactly know where the value is going to lie.

Yeah.

And I think even to dig deeper into that, you could have a person who's mostly using chat GPU for coding, right?

But 5% of the time, they might just want to talk to the model or like 5% of the time they just want a really cool image, right?

And so I think, you know, there are certainly archetypes of people who use the models, but in practice we see that people want this exposure to different capabilities.

Yeah.

With Codex and watching the launch of that, it kind of struck me.

There are some tools you see that there's a lot of excitement about because there's a lot of internal demand for that.

How much are you using it internally?

Are tools like that?

More and more.

Okay.

So I'm really excited to see internal adoption.

It's everything from, you know, exactly what you'd expect in people using Codex to offload to tests.

We have an analyst workflow that will look at logging errors and automatically flag them and slack people about it.

So there's all these ways that we're...

I've actually heard some people are using it as a to-do where like future tasks they're hoping to do, they're starting to fire off Codex tasks.

So this is the perfect type of thing that I think you can dock with internally.

And I'm very excited about the leverage that engineers are going to get out of a tool like this.

I think it's going to allow us to move faster with the people we have and make each engineer that we hire, you know, like 10 times more productive.

So in some ways, internal usage is a very good predictor of where we want to take this.

Yeah.

I mean, we don't want to ship something to other people that we don't find value in ourselves.

And I think, you know, leading up to the launch...

Laundry buddy.

Laundry buddy is an essential partner.

Okay.

Sorry.

Sorry.

I mean, yeah, we had some power users though that, you know, hundreds of PRs a day that they were generating personally.

Right.

So I think, you know, there are people internally finding a lot of utility from what we're building.

Also if you think about internal adoption, it's also a good reality check because, you know, people are busy, you know, adopting new tools takes some activation energy.

So actually the thing you find when you try to dog food things internally is some of the reality component of how long it takes people to actually adjust to a new workflow.

And it's been humbling to watch, right?

So I think you learn both about the technology, but you also learn about some of the adoption patterns when you're trying to get a bunch of busy people to change the way they write code.

So when you build these tools, internally people have to learn how to use them and are having to adapt.

And there's a lot of question now about kind of what kind of skills do people need in the future?

You know, what kind of skills do you look for on your teams?

I've thought about this a lot.

Hiring is hard, especially if you want to have a small team that is very, very good and humble and able to move fast, et cetera.

And I think curiosity has been the number one thing that I've looked for.

And it's actually my advice to students when they ask me, what do I do in this world where everything's changing?

Because, I mean, for us, there's so much that we don't know.

There's a certain amount of humility you have to have about building on this technology because you don't know what's valuable.

You don't know what's risky until you really study and go deep and try to understand.

And when it comes to working with AI, which, you know, we obviously do a lot, not just in code, but in kind of every facet of our work, it's asking the right questions that is the bottleneck, not necessarily getting the answer.

So I really fundamentally believe that we need to hire people who are deeply curious about the world and what we do.

I care a little bit less about their experience in AI.

Mark presumably feels a bit different about that one.

But for the product side, it's been curiosity that I've found the most, the best predictor of success.

I mean, even on research, I think increasingly less we index on you have to have a PhD in AI, right?

I think this is a field that people can pick up fairly quickly.

I also came into the company as a resident without much formal AI training.

And I think correlated to what Nick said, I think one important thing is for our new hires to have agency, right?

OpenAI is a place where you're not going to get so much of a, "Oh, here's today.

You're going to do thing one, thing two, thing three."

It's really about being kind of driven to find, "Hey, here's the problem.

No one else is fixing it.

I'm just going to go dive in and fix it."

And also adaptability, right?

It's a very fast changing environment.

That's just the nature of the field right now.

And you need to be able to quickly figure out what's important and pivot what you need to do.

The interesting thing is real.

I think we often get asked for how does OpenAI keep shipping?

And it feels like you're questioning something out every week or something like that.

It's A, funny because it never feels to me.

I always feel like we could be going even faster.

But I think fundamentally we just have a lot of people with agency who can ship.

That comes to product, that comes to research, that comes to policy.

Shipping can mean different things.

We all do very different things at OpenAI, but I think the ratio of people who can actually do things and the lack of red tape except where it matters, the couple of areas where I think red tape is very, very important.

But I think that is what makes OpenAI very unique and it obviously affects the type of people who we want to hire to.

I was brought into the company because I was originally given access to GPD 3 and I just started showing all these use cases for it and making videos every week for it.

Yeah.

And that was annoying people, I'm sure.

But I was just- No, it was not.

It was really fascinating.

It was exciting.

It was an exciting time.

And I think they built a UFO and I get to play with it.

And then I make it hover and like, "Oh, you made it hover."

I'm like, "Well, they built it."

I just press the button and got to do that.

But that was just what I found very empowering was the fact that I'm self-taught.

I learned to code by Udemy courses and stuff and then to be a member of the engineering staff and be told, "Just go do stuff."

Nothing too critical.

I didn't break anything, anybody.

And that's good to know that that kind of spirit is still there.

And I think that is part of the reason why OpenAI is able to ship even though it was like 150, 200 people worked on GPT-4.

I think people forget about that.

Totally.

And honestly, this is how, and even chat GPT, this is how it came together.

We had a research team, they'd been working for a while on instruction following and then the successor did that and post training these models to be good at chat.

But the product effort came together as a hackathon.

I remember distinctly we said, "Who's excited to go build consumer products?"

And we had all these different people.

We had a guy from the super computing team who was like, "I'll make an iOS app.

I've done that."

The past life where we had a researcher who wrote some backend code and it was just convergence of people who were excited to do stuff.

And I think the ability to do so, and I think that's how you get the next chat GPT is running an organization where that is possible and continues to be possible at this scale.

Hackathons were my favorite thing because one, being a performer and loving show and tell, but it was just neat to be able to see things that you knew were going to be a product or something later on.

Because when you're playing to the technology that's this advanced and all that, do you guys still do them?

Yeah, absolutely.

We've had some fairly recently and they are typically tight.

Last week?

Actually, I know.

Can't say what it was about, but it was an exciting thing.

You can, sure.

And it's how you find out what's possible.

I'm excited to hear that.

I do have a question, which is how much as it grows again, like when I started, I think like 150 people on the company.

Now there's like 2000 and now, you know, I see a video with Sam talking to Johnny I've and how much is that going to change the character, the spirit of bringing in all this?

I think all the outside expertise has been great.

We've seen this great sort of run of products, but do you see a change in the culture?

Well, I mean, I think probably in the right way, right?

It's like, I think when we look at AI, we don't think of it as some fairly narrow thing.

And we've always been kind of enthralled by just the potential and all the different things you could build with AI.

And yeah, to Nick's point, right?

This is why we're able to ship so quickly because people imagine all these different possibilities.

They imagine the future with AI and they try to bring it about.

And I think these are facets of that imagination, right?

It's like, what does AI look like if you imagine the AI first device, for instance?

Yeah, when you go from 200 to 2000, you'd think a lot would change.

And yeah, maybe in some ways it has.

I think people often underestimate the number of things that we're doing.

I always feel like being at Open AI feels much closer to being in a university where you've got this kind of common reason to being there, but everyone's doing something different and you'll sit down at dinner or at lunch and you'll talk to someone and learn about their thing.

And you're like, wow, that's so cool that you're doing that.

So it feels much smaller because I think of the sort of broad range of things we're doing and therefore each individual effort, whether or not that's something like chat GPT or something like Sora or et cetera, is actually staffed in a very, very conservative and lean way that continues to keep people very autonomous and make sure they have resources, et cetera.

So I think it's partly that that has made it feel very, very similar in the good ways to when I started here.

We talked a bit about one of the things you look for is curiosity and Mark said that's helpful too.

If I'm somebody outside of AI, okay, if I'm 25 or I'm 50 and I'm looking at the advancement of technology and maybe have it a little bit of fear because I see copywriting is one of the things that chat GPT got great at, writing code is great.

I personally have the opinion that we'll never have enough people creating code because there's more things code can do in the world than we can imagine.

And even the thing that places the copy, my wife showed me the other day on her skin block, her sunblock lotion bottle, showed me on her sunblock lotion bottle, like some very funny copy about like the ingredients.

I said, oh, this is not a place I expected to see this, but that's one of the tiny little places that all of a sudden that you can put more thought into it.

That being said, I know that I'm a bit of an optimist because I see all these opportunities are places to go in there.

What advice do you give people, you know, where to whatever point they are in life about preparing for or adapting to and being part of the future?

You know, I like how Mark just looked right to you.

I can go.

Okay.

I'm going to jump in right now.

I think the important thing is you have to really lean into using the technology, right?

And you have to see how your own capabilities can be enhanced, how you can be more productive, more effective by using the technology.

I fundamentally do think that the way this is going to evolve is you will still have your human experts, but what AI helps the most is the people who don't have that capability at a very advanced level.

Right.

So if you imagine, right, like, uh, as these models get much better at healthcare advice, um, they're going to help people who don't have access to care the most, right?

Uh, image generation, right?

It's not producing Lena an alternative for, you know, experts or, you know, professional artists.

I think it's kind of like me and Nick to create creative expressions.

Right.

Um, and so I think it's kind of rising the tide that allows people to be competent and effective at a lot of things all at once.

And I think that's kind of how we're going to see a lot of these tools, bootstrap people.

The world's going to change a lot.

And I think truly everyone has a moment where the, uh, does something that they considered sacred and human.

Um, um, I know a guy that got vested in our felt very threatened about his achievements in code.

Unabilities.

Well, that happened for me a long time ago.

Let's be talking about someone else in the, oh yeah.

I mean, yeah, it's definitely better than me.

A lot of code problem solving for sure.

Yeah.

Right.

So I think it's deeply human to feel some level of, um, all respect, uh, and maybe even fear.

And I think to Mark's point, be actually using this thing can demystify it.

I think we all grew up or, you know, learned about the word AI, um, in a world where I'm in something pretty different from what we have today.

You've got these algorithms that, you know, try to sell you things, try to do things or you've got movies, you know, where the, uh, ticks over, et cetera.

And like that term means so many things to different people that I'm entirely unsurprised that, you know, um, there's fear.

So actually using the thing is, I think the best way to have a grounded conversation, um, about it.

And then I think from there, the best way to prepare, I think there's some degree to which you need to understand the products and keep up.

Sure.

So things like prompt engineering or sort of understanding the intricacies of this AI, they're kind of not the right direction.

I think sort of as fundamental as human things, like learning how to delegate, um, that is incredibly important because increasingly, you know, you're going to have an intelligence in your pocket that it can be your tutor, it can be your, um, advisor, it can be your software engineer.

Um, it's much more about you understanding yourself and the problems you have and how someone else might help than a specific understanding of AI.

Um, so I think that's going to be important.

Curiosity I mentioned earlier, I think asking the right questions, you'll get, you only get what you put in, right?

Um, that's important.

And I think fundamentally being ready to learn new things.

I think the more you understand how to pick up new topics and domains, et cetera, um, the more you're going to be prepared for a world where, you know, the, the nature of work is shifting much faster than has ever shifted before.

So, um, I'm prepared that my job, you know, in product is going to look different or not exist at all.

But I am looking forward to picking up something new.

And I think as long as you bring that perspective, um, you're well set up to leverage AI.

I think we, we sometimes over index on, you know, sometimes certain jobs go away because like, you know, we don't really need a lot of, you know, typewriter repair people anymore.

Right.

And then certain kinds of coding jobs are probably going to go away.

But like I said, I think there's way more opportunity for coders or people to create code, however it's done.

Um, and you mentioned like the health field and that's one of the things I hear people are like, Oh, when, you know, when we replace everything with AI, like, well, I mean, I would be very happy having an AI diagnose me operate on me and probably do everything else.

But I do want somebody there to talk me through the procedure and hold my hand.

But also I want people asking questions like, like, you know, every day I take a bunch of vitamins.

It's just the right time of day to take it.

You know, I can't bother my doctor with all these silly little questions.

I really don't think you end up displacing doctors.

You end up disposing, not going to the doctor.

You end up democratizing the ability to get a second opinion.

Very few people have that resource or noted, you know, take advantage of a resource like that.

You end up bringing medical care into pockets of the world where that is not readily available and you end up helping doctors gain confidence.

You know, I think I often have often heard from doctors that, you know, they already talk to existing colleagues to get a second opinion.

In some cases that's not possible.

And I think you'd be surprised by the number of doctors that use chat GPT.

Um, now on things like medicine, there's work to make the model really, really good.

And we're excited to do that work.

There's also work to prove that the model is really good because I think you're not going to trust that until there's some degree of sort of legitimacy.

And then there's work to explain the areas where the model might not be good because increasingly once it gets to human and then super human level performances, um, it's hard to frame exactly where it will fall short, which is also hard, hard to sort of reckon with.

But nonetheless, I think that opportunity is one of the things that gets me up in the morning.

Education might be the other one.

And I think there's a tremendous opportunity to help people.

What do you think is going to surprise us the most in the next year to 18 months?

I honestly think, um, it's going to be the amount of research results that are powered, even in some small way by the models that we've built.

And, um, one of the kind of quiet things that's taken the field by storm is the ability of the models to reason.

And you already see some research.

I'm going to make you explain.

Yeah.

I think you're going to reason.

Yeah.

So this fits into the, I want you to reason through the question as you explain reason.

Yeah.

Yeah.

Think out loud.

Tell us your, your, your traces.

Yeah.

This, um, this really fits into this agentic paradigm that we were talking about earlier.

And, um, the way that the models approach solving a problem that takes some time to solve is that it reasons through it much like you are.

I might, right?

If I give you a very complicated reason, probably much better than I do.

I mean, um, I think I'm flattered with, uh, yeah, like a, a complicated puzzle, right?

You might think to yourself, uh, for instance, let's just use a crossword puzzle, right?

Like you might think through all the different alternatives and, uh, what's consistent.

Um, you know, is this row kind of consistent with that column and you're searching through a lot of alternatives.

You're backtracking a lot.

Um, we're trying to do a lot of hypotheses and, and then at the end, right?

You come up with a well formed answer.

And so the models are getting a lot better at that.

And that's what's powering a lot of the advancements in math and science and coding.

So this has reached a level where today in many research papers, people are using O3 almost as a subroutine, right?

There's sub problems within the research problems are trying to solve, which are just fully automated and solved through plugging into a model like O3.

Um, I've seen this in several physics papers, um, talk to physicists even where they're like, I had this expression that I couldn't simplify, but O3 made headway on it.

And these are coming from some of the best physicists in the country.

So I think you're going to see that happen more and more and more and more.

And we're going to see just acceleration in, in progress in fields like physics and mathematics.

It's a hard one to beat because, you know, I would swap many things we do in exchange for making a true, you know, significant, you know, scientific advancement.

Um, but I think we can, we, we, we, we can have multiple of these things.

I think for, for me, it's, it's the fact that any well-described problem that is intelligence constrained, I think will be solved in products.

And I think we're fundamentally just limited by our ability to do that.

So what that means is like, you know, in companies in the enterprise, there are so many problems that are fundamentally hard.

The models are not smart enough to do yet, um, without software engineering, when running data analysis, whether or not it is, um, providing amazing customer support.

There's all these problems that, um, the models fall short at today that are very, very, um, easy to describe and evaluate.

And I think that will make tremendous progress at those.

Um, on the consumer side, um, these problems exist too.

They're a bit harder to find, um, just because consumers are, um, um, um, we're telling us exactly what they want.

That's the nature of building consumer products.

But I think it's very, very worthwhile where, you know, there's many hard things we do in our personal life, whether or not it's doing taxes, whether or not it's planning a trip, whether or not it's, um, searching for a high consideration purchase, whether or not that's a house or a car or a piece of clothes.

Um, all of those things are, um, problems where we need just a little bit more intelligence, um, and the right form factor.

So I think the other thing that's going to happen in the next year and a half is you'll see a different form factor in AI, um, evolve.

I think chat is still incredibly useful interaction model.

And I don't think it's going to go away, but increasingly you're going to see more of these sort of asynchronous, um, workflows.

Coding is just one example, but for consumers, it might be sending this thing off to go find you the perfect pair of shoes or to go leave and plan a trip or for, you know, to go, um, finish your taxes.

And I think that's going to be exciting.

And we're going to think of AI a little bit differently than, um, just a chatbot.

One of my favorite examples, both from a utility point of view capability and then, uh, UI was deep research and deep research is probably the best example we maybe have of probably a gintic sort of model use right now, because it used to be, you would ask for a model to tell you about a topic.

You would, you'd either get the data or just do a big search the internet and then it would just summarize all that where deep research will go find some set of data, look at it, ask a question, then go find some new data and come back to it and keep going on.

And I think the first time I used another user like, wow, this is taking a while.

And then you added a UI change so I can actually go away and go do something else.

And then the lock screen on my phone will show me this is working, which was a paradigm shift.

And I talked to Sam here about that.

And Sam said that was a surprise to him was the fact that people would be willing to wait for answers.

And now I've seen a new metric for models as how long a model can spend trying to solve a problem, which is a good metric if it ultimately solves it.

And that's has this been an update to you and how you think about these things, the idea of like, oh, we don't just want and I guess you talked about this before, buddy, gintic and the idea that it's not just give me the answer.

It's like, take your time, get back to me.

I think, you know, to build a super assistant, you got to relax constraints.

Like today you have a product that is entirely synchronous.

You have to initiate everything.

That's just not the maximally best way to help people.

Like if you think about a real world intelligence that you might get to work with, it has to be able to go off and do things over a long period of time.

It has to be able to be proactive.

So I think there's like, we're sort of in this process of relaxing a lot of the constraints on the product and on the technology to better mimic a very, very helpful entity.

The ability to go do five minute tasks, you know, five hour tasks, eventually five day tasks is like a very, very fundamental thing that I think is going to unlock a different degree of value in the product.

So I've actually not been that surprised that people are willing to do that.

Like I don't really want to be sitting around waiting for my coworker either.

And I think if the value is there, I'd gladly be doing other stuff and come back.

Yeah.

And we really don't do it just because we do it out of necessity.

The model needs that time to solve the really hard coding problem or the really hard math problem.

And it's not going to do it with less time, right?

You can think about this as, you know, I give you some kind of brain teaser, right?

Your quick answer is probably like the intuitive wrong one.

And you need that actual time to kind of work throughout the cases to like, are there any gotchas here?

And I think it's that kind of stuff that ultimately makes robust agents.

We've seen kind of, there's like the paper of the moment where somebody comes out and says, ah, I found a blocker.

And I remember there was one a month or so ago and they said models couldn't solve certain kinds of problems.

And it wasn't hard to figure out a prompt that you could train into a model and it could solve those kinds of problems.

And we had a new one that talked about how they would fail at certain kinds of problem solving ones.

And that was kind of quickly, I think, debunked by showing that, you know, the paper kind of had flaws in there.

But there are limitations.

There are things that there might be some blockers and things are things we don't know are going to be there.

I think brittleness is one of the things there is a point where models can only spend so much time solving a problem.

We're probably at a point where we're only having the model, you know, maybe two systems watch each other and we have to think about how a third system stops, you know, to wait for things to break down.

But do you see kind of any blockers between here and where I'm getting the models that are going to be solving, you know, doing things like coming up with interesting scientific discoveries?

I mean, I think there are always technical innovations that we're trying to come up with.

Right.

Fundamentally, we're in the business of producing simple research ideas that scale and the mechanics of actually getting that to scale are difficult.

Right.

So there's a lot of engineering, a lot of research to kind of figure out how to kind of tweak past a certain roadblock.

I think those are always going to exist.

Every layer of scale gives you new challenges and new opportunities.

So you know, fundamentally, the approach is the same, but we're always encountering new small challenges that we have to overcome.

Just to build on that, I mean, the other business we're in is in building great product with great products with these models.

And I think we shouldn't underestimate the challenge and amount of discovery needed to really bring these ever intelligent models into the right environment, whether or not that's giving them the right sort of action space and tools, whether or not that's really being proximate to the problems that are hardest, understanding those and bringing the eye there.

So I think there's the technical answer.

But I think there's also the real world deployment.

And I think that always has challenges that are like very, very hard to predict yet, you know, worthwhile and part of our mission to do this all.

All right.

Last question and I'll begin.

It's what's your favorite use or tip for chat GPT.

Mine is I take a photograph of a menu and I'm like, help me plan a meal or whatever.

I'm trying to like, you know, stick to a diet or whatever.

See, I really want that use case, but like I've been trying for wine lists and that is my eval on multimodality.

It still doesn't work.

Like really, it keeps embarrassing me with a hallucinated wine recommendations and I go over it and they're like, never heard of this.

So I'm glad yours works.

But for me, that's the that's still used.

Well, I mean, maybe the line lenses too dense.

That was a problem.

That was a problem.

Operator was it like originally was the division models that too much dense text.

It just loses its placement.

Yeah.

I mean, speaking to deep research, I love using deep research and you know, when I go meet someone new, when I'm going to talk to someone about AI, right, I just preflight topics, right?

And I think that the model can do a really good job of contextualizing who I am, who I'm about to meet and what things we might find interesting.

And I think it really just helps with that whole process.

Very cool.

I'm a voice believer.

It's still got I don't think it's entirely mainstream yet because it's got it's got many little things that all add up.

But for me, you know, half of the value of voice is actually just having someone to talk to you and forcing yourself to articulate yourself.

And I find that to sometimes be very difficult to do in writing.

So on my way to work, I'll use it to process my own thoughts.

And like with some luck, and I think this works most days, I'll have a structured list of to do's by the time I actually get there.

So voice for me, it needs to be the thing that, you know, I both love using and I want to see improved over the next year.