Latent Space · 2025-02-01

Karina Nguyen on OpenAI's Reasoning Interfaces: Canvas, Tasks, and the Path to Agents

Hosts: Alessio, Swyx

Guests: Karina Nguyen

ChatGPT CanvasChatGPT Taskso1 / o3-mini reasoning modelsDeepSeek R1Agent reasoning interfacesComputer use agentsOperatorSynthetic data / post-trainingAI product designGenerative UI / OS

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

Context: OpenAI slashed o1-mini from $12 to $4.

Key claims

Karina's team at OpenAI focuses on "reasoning interfaces"—HCI research combined with novel synthetic data training and post-training, shipping Canvas and Tasks as separate dropdown models to iterate faster before merging improvements into the core model.
ChatGPT Canvas was prototyped on July 4th as a dedicated model distilled from o1-preview, co-developed by research, product, design, and engineering from day one; writing quality was improved via a human-eval rubric co-built with the Model Design team.
Canvas vs. Google Docs represents an inversion of priorities: ChatGPT is AI-first with a docs surface as output, and Karina envisions it evolving into a "generative OS" that morphs UI based on user intent.
Karina defines agents as a gradual progression: one-off actions → collaboration → fully trustworthy long-horizon delegation, with collaboration being the critical trust-building milestone.

Episode summary

Summary

Karina Nguyen, who led ChatGPT Canvas at OpenAI (after writing the first 50,000 lines of claude.ai at Anthropic), discusses her team's work on "reasoning interfaces"—the intersection of HCI and novel synthetic data training. She explains how Canvas was prototyped over July 4th as a dedicated post-trained model distilled from o1-preview, why shipping as a separate dropdown model enabled faster iteration than baking improvements into the core model, and the writing-quality rubric work done with OpenAI's Model Design team. She also details the ~2-month build of ChatGPT Tasks and frames both features as compositional modules in a "generative OS" that morphs UI based on user intent.

On agents, Karina offers a graduated definition: one-off actions → collaboration → fully trustworthy long-horizon delegation, arguing that the missing middle step (collaboration) is the key trust-building milestone. She sees computer use as a core agent capability but acknowledges latency and precision gaps, suggesting small fast models (o1-mini, o3-mini) may help, and that expense reports are a tractable benchmark for end-of-year viability. The episode is framed by OpenAI's mid-January pricing move—slashing o1-mini to $4.40/M tokens and releasing o3-mini in ChatGPT—and the broader DeepSeek R1-driven reasoning price war.

Karina also contrasts Anthropic's more enterprise-focused, prioritized culture with OpenAI's higher tolerance for product risk and creative freedom, and concludes by hiring product-minded research engineers and encouraging designers to "play around with the models" to discover new interaction paradigms.

Karina's team at OpenAI focuses on "reasoning interfaces"—HCI research combined with novel synthetic data training and post-training, shipping Canvas and Tasks as separate dropdown models to iterate faster before merging improvements into the core model.
ChatGPT Canvas was prototyped on July 4th as a dedicated model distilled from o1-preview, co-developed by research, product, design, and engineering from day one; writing quality was improved via a human-eval rubric co-built with the Model Design team.
Canvas vs. Google Docs represents an inversion of priorities: ChatGPT is AI-first with a docs surface as output, and Karina envisions it evolving into a "generative OS" that morphs UI based on user intent.
Karina defines agents as a gradual progression: one-off actions → collaboration → fully trustworthy long-horizon delegation, with collaboration being the critical trust-building milestone.
She views computer-use agents (Anthropic's computer use, upcoming OpenAI Operator) as a core agent capability but flags latency, pixel precision, and intent-understanding gaps; expects small fast reasoning models to help and expense reports to be a tractable 2025 benchmark.
Context: OpenAI slashed o1-mini from $12 to $4.40 per million tokens and released o3-mini at the same price in ChatGPT and API, framing the discussion amid the post-DeepSeek R1 reasoning model price war.
Cultural contrast: Anthropic is more enterprise-prioritized and focused; OpenAI takes more product risks and gives researchers more creative freedom to pivot directions based on empirical results.
Call to action: hiring product-minded research engineers/research scientists, and encouraging designers to spend more time playing with models to discover new interaction paradigms.

Source material

Transcript

welcome back!

From Sam Altman to Satya Nadella, many people are saying that 2025 is the year of agents.

Since our podcast conversations about DeepSeek, the mainstream narrative has become obsessed with DeepSeek R1, and what it means to have a competitive open-weights reasoning model from China.

Swyx wrote a viral blog post about the reasoning price war of January 2025, and today OpenAI has responded by slashing the price of 01-mini from $12 per million tokens to $4.40, and also released 03-mini in chat GPT and to level 3 and above API users for the exact same price.

Given the 03-mini matches, or exceeds 01 especially with medium or high reasoning effort, this is an enormous leap in performance per dollar.

In the meantime, the rest of OpenAI has been busy shipping.

Chat GPT has slowly accelerated from shipping canvas during the 12 days of shipmass last month to shipping recurring tasks, and most recently Operator, the hosted virtual agent response to Claude's computer use.

We are very proud to host today's guest, Karina Nguyen, who was at Anthropik for the launch of Claude 3, and wrote the first 50,000 lines of Claude.ai before joining OpenAI to work on the future of what she calls reasoning interfaces.

We are very proud to also announce that Karina will be the closing keynote speaker for the second AI Engineer Summit in New York City from February 20th to 22nd.

This is the last call for applications for the AI leadership track for CTOs and VPs of AI.

If you are building agents in 2025, this is the single best conference of the year.

Our new website now lists our speakers and talks from DeepMind, Anthropik, OpenAI, Meta, Jane Street, Bloomberg, BlackRock, LinkedIn, and more.

Look for more sponsor and attendee information at apply.ai.engineer and see you there.

Watch out and take care.

[MUSIC] Hey, everyone.

Welcome to the Latent Space podcast.

This is Alessio, partner and CTO at Desibol, and I'm joining my usual co-host, Svex.

Hey, and today we're very, very blessed to have Karina Nguyen in the studio.

Welcome.

Nice to meet you.

We finally made it happen.

First time we tried this, you were working in a different company, and now we're here.

Fortunately, you had some time, so thank you so much for joining us.

Yeah, thank you for inviting me.

Karina, your website says you lead a research team in OpenAI creating new interaction paradigms for reasoning interfaces and capabilities like chat GPT Canvas and most recently chat GPT TAS.

I don't know.

Is that what we're calling it?

Streaming chain of thought for 01 models and more via novel synthetic model training.

What is this research team?

Yeah, I need to clarify this a little bit more.

I think it changed a lot since the last time we launched.

So we launched Canvas and it was like the first project that I was a tech lead, basically.

And then I think over time I was trying to refine what my team is.

And I feel like it's at an intersection of human-computer interaction, defining what's the next interaction paradigms might look like with some of the most recent reasoning models, as well as actually trying to come up with novel methods, how to improve those models for certain tasks that you want to.

So for Canvas, for example, one of the most common use cases is basically writing and coding.

And we continually working on how do we make Canvas coding to go beyond what is possible right now.

And that requires us to actually do our all training and coming up with new methods of synthetic data generation.

The way I'm thinking about it is that my team is going from very full stack from training models all the way up to deployment and making sure that we create novel product features that is coherent to what Chagipiti can become.

There are different types of features like Canvas, tasks, but all those components that go, they compose together to evolve Chagipiti into something completely new, I think, in the new year.

It's evolving.

I like your tweet about that.

It's modular.

You can compose it with the stocks feature, the creative writing feature.

I forgot what else.

We have a list of other use cases, but we don't have to go into that yet.

Yeah.

Can we maybe go back to when you first started working with LLMs?

I know you have some early UX prototypes with GPT-3 as well and maybe how that is informed the way you build products.

I think my background was mostly working on computer vision applications for investigative journalism back when I was at school at Berkeley.

I was working a lot with Human Rights Center and investigative journalists from various media.

That's how I learned more about AI, vision transformers.

At that time, I was working with some of the professors at Berkeley AI research.

Do you have some Pulitzer Prize winning professors that teach there?

No.

It's mostly like was reporting for teams like the New York Times, like the AP Associate Press.

It was all in the context of Human Rights Center.

That was in computer version.

Then I saw Chris Salo's work around accountability from Google.

That's how I found out about it.

At that time, I think it was the year when Ukraine's war happened and I was trying to find a full-time job.

It was all got distracted.

It was like spraying and I was very focused on figuring out what to do.

Then my best option at that time was to continue my internship at the New York Times and convert to full-time.

At the New York Times, it was just working on product engineering work around R&D prototypes, storytelling features on the mobile experience.

It was storytelling experiences.

At that time, we were thinking about how do we employ NLD techniques to scrape some of the archives from the New York Times or something.

Then I always wanted to get into AI.

I knew OpenAI for a while since I was in Berkeley.

I applied to Antartberg just on the website.

I was rejected the first time.

At that time, they were not hiring for anything like product engineering or front-end engineering, which was something that at that time I was interested in.

Then there was a new opening at Antartberg, which was like, "You are front-end engineer."

I applied and that's how my journey began.

The earlier prototypes were mostly like, I used clip for fashion recommendation search.

That was one of those successful projects.

Before even coming to Antartberg, I was thinking maybe I should just do my own startup.

I feel like I didn't have enough confidence and conviction in myself that I could do that.

But it was one of the early prototypes.

I think Twitter is a good platform for side projects.

That's fantastic.

Especially for something visual.

We'll briefly mention that the Ukrainian crisis actually hit home more for you than most people because you're from the Ukraine and you moved here for school, I guess.

We'll come back to that if it comes up.

But then you joined in Thopik.

Not just as a front-end engineer, you were the first.

Is that true?

Designer.

Yes.

I think I did both product design and front-end engineering together.

At that time, it was pre-CHI.G.P.T.

It was I think August 2022.

That was the time when Antartberg really decided to do more product-y related things.

The vision was like, "We need to fund research."

Building product is the best way to fund safety research, which I found quite admirable.

The really first product that Antartberg built was Cloud and Slack.

It was sunsetted not long after, but it was one of the first.

I think I still come back to that idea of Cloud operating inside some other organizational workplace like Slack and something magical in there.

I remember we built ideas like summarize the thread.

But you can imagine having automated ways of maybe Cloud should summarize multiple channels every week custom for what you like or for what you want.

Then we built some really cool features like tag, Cloud, and then asked to summarize what happened in the thread, suggest new ideas.

We didn't quite double down because you could imagine Cloud having access to the files or Google Drive that you can upload in Slack, just connectors, connections in the Slack.

Also, the UX was constraining.

At that time, I was thinking, "We wanted to do this feature, but Slack interface constrain us to do that.

We didn't want to be dependent on the platform like Slack."

Then after chat GPT came out, I remember the first two weeks, my manager made me this challenge, "Can I reproduce similar interface in two weeks?"

One of the early mistakes being in engineering is like I said, "Yes."

Instead, I should have some like 2X the time.

This is how Cloud that AI was born.

You actually wrote club.ai as your first job.

Yeah, I think the first 50,000 code of lines without any reviews at that time because there's no one.

It was like very small team, it was like six, seven team who we were called a deployment team.

On mine, I actually interviewed for an end topic around that time.

I was given Cloud in Sheets.

That was my other form factor.

I was like, "Oh yeah, this needs to be in a table, so we can just copy paste and just span it out," which is kind of cool.

The other rumor that we might as well just mentioned this, Raza Habib from Humanloop often says that there's some version of chat GPT in end topic.

You had the chat interface already.

Like you had Slack.

Why not launch a web UI?

Basically, how did OpenAI beat end topic to chat GPT basically?

Well, at that time- I see it's kind of obvious to have it.

I think chat GPT model itself came out way before then we decided to launch Cloud tool necessarily.

I think at that time, Cloud 1.3 had a lot of hallucinations actually.

I think there was one of the concerns is like, I don't think the leadership had a conviction that this is the model that you need to, you want to deploy or something.

It was a lot of discussions around that time.

Cloud 1.3 was like, I don't know if you played with that, but it's extremely creative.

And it was really cool.

Nice.

Still creative.

You had a tweet recently that you said things like Canvas and Task could have happened two years ago, but they were not.

Do you know why they were not?

Was it too many researchers at the labs not focused on UX?

Was it just not a priority for the labs?

Yeah, I come back to that question a lot.

I guess I was working on something similar to like Canvas-y, but for Cloud at that time in like 2023, it was the same similar idea of like Cloud workspace where a human and a Cloud could have like a shared work space.

That's artifacts.

No, no, no.

This is Cloud projects.

I think it kind of evolved.

I think like at that time I was like in product engineering team, and then I switched to like research team and the product engineering team grew so much.

They had their own ideas of like artifacts and like projects.

Maybe they looked at my previous explorations, but like when I was exploring like Cloud documents or like Cloud workspace was like, I don't think anybody was thinking about UX as much or like not many researchers understood that.

And I think the inspiration actually for I still have like all the sketches, but the inspiration was like from the high reporter like Tom Riddler diary.

That was inspirational like having Cloud writing into the document or something and communicate back.

So like in the movie you write a little bit and then it answers you.

Okay, interesting.

But that was like in the only in the context of like writing.

I think Canvas is like more also serves like coding one of the most common use cases.

But yeah, I think like those those ideas could have happened like two years ago.

Just like maybe I don't think it was like a priority at that time.

It was like very unclear.

I think like AI landscape at that time was very nascent.

Does that make sense?

Like nobody like even when I would talk to like some of the designers at that time, like product designers, they were not even thinking about that at all.

They did not have like AI in mind.

And like, it's kind of interesting, except for one of my designer friends, his name is Jason Yohn.

Yeah, who was thinking about that.

And Jason now is a new computer.

Yes, we'll have them on at some point.

I had them speak at my first summit.

And you're speaking the second one, which will be really fun.

Nice.

It was still in topic for a bit.

And then we'll move on to more recent things.

I think the other big project that you were involved with was just Cloud three.

Just tell us the story.

Like was it like to launch one of the biggest launches of the year?

Yeah, I think like I was so Cloud three.

This is Haiku, Sonnet, Opus all at once, right?

Yes.

Yeah, it was a Cloud three family.

I was a part of the post-training fine-tuning team.

We only had like what, like 10, 12 people involved.

And it was really, really fun to like work together as friends.

So yeah, I was mostly involved in like Cloud three Haiku post-training side and then evaluations, like developing new evaluations and like literally writing the entire like model card.

And I had a lot of fun.

I think like the way you train the models is like very different obviously, but I think what I've learned is that like you will end up with like, I don't know, like 70 models and every model will have its own like brain damage.

And like 70 is like like kind of debug.

Personality wise or performance benchmarks?

I think every model was very different.

And I think like it's like one of the interesting like research questions is like, how do you understand like the data interactions as you like train the model is like, if you train the model on like contradictory datasets, how can you make sure that there won't be like any like weird like side effects.

And sometimes you get like side effects.

And like the learning is that you have to like iterate very rapidly and like have to like debug and detect it and make like address it with like interventions and actually some of the techniques from like software engineering is very like useful here is like how do you do it for data?

Yeah, exactly.

So I really empathize with this because data sets if you put in the wrong one, you can basically kind of screw up like the past month of training.

The problem with this for me is the existence of YOLO runs, I cannot square this with YOLO runs.

If you're telling me like you're taking such care about data sets, then every day I'm going to check in, run evals and do that stuff.

But then we also know that YOLO runs exist.

Yes.

So how do you square that?

Well, I think it's like dependent on how much compute you have, right?

So it's like, it's actually a lot of questions and like researchers aren't like, how do you most effectively use the compute that you have?

And maybe you can have like two to three runs that is only like YOLO runs.

But if you don't have a luxury of that, like you kind of need to like prioritize ruthlessly, like what are the experiments that are most important to like run?

I think this is what like research management is basically, is like how do you...

Funding efforts, yeah.

Yeah, like take like research bets and make sure that you build the conviction and those bets rapidly such that if they work out, you like double down on them.

Yeah, you almost have to like kind of ablate data sets too and like do it on the side channel and then merge it in.

Yeah, it's kind of super interesting.

Tell us more like what's your favorite.

So you I'm I have this in front of me, the model card, you say constructing this painful, this table was slightly painful.

Just pick a benchmark and what's an interesting story behind one of them?

I would say in GPPA was kind of interesting.

I think it was like the first, I think we were the first lab like Antarpa was the first lab to like run.

Oh, because it was like relatively new after new rips.

Yeah, yeah, published in GPPA like numbers.

And I think one of the things that we learned was that I personally learned about that, like any evolves is like, some evolves are like very like high variance and like GPAs like, happen to be like a huge like high variance like evaluation.

So like one thing that we did is like having like run the average of like five and like take the average but like the hardest thing about like the model card is like none of the numbers are like apples to apples.

So you need to like, go back to like, I don't know, like GPD for model card, like read the appendix, just to like make sure that like, the settings are the same as you're running the settings too.

So it's like never an apple to apples.

Yeah.

But it's interesting how like, you know, when you market models as products, like customers don't necessarily know, like they're just like my MMO is 99.

What do you mean?

Why isn't there an industry standard harness, right?

There's this eluters thing, which it seems like none of the model labs use.

And then open the app without simple eval and nobody uses that.

Why isn't there just one standard way everyone runs this because the alternative approach is you rerun your evals on their models.

And obviously the numbers your numbers will be lower.

Yeah.

And they'll be unhappy.

So that's why you don't do that.

I think it operates on its assumption that like the models, the next generation of the model or the model that you produce next is going to behave the same.

So for example, like, I think the way you prompt a one or like clogged three is going to be very different from each other.

I feel like there's a lot of like prompting that you need to do to get the evals to run correctly.

So sometimes the model is just like output, like new lines.

And the way you parse will be like incorrect or something like this has happened with like Stanford, I remember like when Stanford had lists also like they were like random benchmarks.

Yeah.

And somehow like collod was like always like not performing well.

And that's because like the way they prompted it was kind of wrong.

So it's like a lot of like techniques.

It's just like very hard because like nobody even knows.

Has that gone away with chat models instead of, you know, just raw completion models.

Yeah.

I guess like each also can be run in a very different way.

Sometimes you can like ask the model to output in like XML tags, but some models are not really good at XML tags.

And so it's like, do you change the formatting per model or like, do you run the same format across all models and then like the metrics themselves, right?

Like maybe, you know, accuracy is like one thing, but maybe you care about like some other metrics like F score, like some other like things, you know, it's not hard.

And talking about one prompting, we just had a one prompting posts on the newsletter, which I think was apparently went viral within open AI.

Yeah.

I got, I got pinged by other opening up people out there.

Like, this is helpful to us.

I'm like, okay.

I think it's like maybe one of the top three most read posts now.

Yeah.

And I didn't write it.

Exactly.

What are your tips on a one versus like cloud prompting or like, what are things that you took away from that experience?

And especially now I know that with four over canvas, you've done RL after on the model.

So yeah, just general learning.

So now to think about prompting these models differently.

I actually think like a one, I did not even harness the magic of like a one prompting, but like one thing that I found is that like, if you give a one like hard, like constraints of like what you're looking for, basically the model would be, we'll have a much easier time to like kind of select the candidates and match like the candidate that is most like fulfilled the criteria that you gave.

And I think there's a class of problems like this that a one excels at.

For example, if you have a question, like a bio question on like some, or like in chemistry, right?

Like if you have like very specific criteria with the protein or like some of the chemical bindings or something like then the model would be really, it would be really good at like determining the exact candidate that will match the certain criteria.

I have often thought that we need a new IFE valve for this because this is basically kind of instruction following, isn't it?

Yes.

But I don't think IFE valve has like multi-step IFE valve.

Yeah.

So that's what basically I use AI news for.

I have a lot of problems and a lot of steps and a lot of criteria and one just kind of checks through each kind of systematically.

And we don't have any valves like that.

Yeah.

Does OpenAI know how to prompt O1?

I think that's kind of like that.

Sam is always talking about incremental deployments and kind of like getting, having people getting used to it.

When you release a model, you obviously do all the safety testing, but do you feel like people internally know how to get a hundred percent out of the model?

Or like, are you also spending a lot of time learning from like the outside on how to better prompt O1 and like all these things?

Yeah, I certainly think that you learn so much from like external feedback too on how people use like O1.

I think like a lot of people use O1 for like really hardcore like coding questions.

I feel like I don't fully know how to best use O1 except for like, I use O1 to just like do some like synthetic data explorations, but that's it.

Do people inside of OpenAI, once the model is coming out, do you get like a company wide memo of like, Hey, this is how you should try and prompt this, especially for people that might not be close to it during development, you know, or I don't know if you can share anything, but I'm curious.

I will internally, these things kind of get shared.

I feel like I'm like in my own little corner in like research.

I don't really like to look at some of this lag channels.

It's very, very big.

So I actually don't know if something like this exists.

Probably it might be exists because we need to share to like customers or like, you know, like some of the guides and like how to use this model.

So probably there is.

I often say this, the reason that AI engineering can exist outside of the model labs is because the model labs release models with capabilities that they don't even fully know because you never trained specifically for it.

It's emergent and you can rely on basically crowdsourcing the search of that space or the behavior space to the rest of us.

Yeah.

So like you don't have to know.

That's what I'm saying.

Yeah.

I think like, um, an interesting thing about like a one is that like it's really for like average human.

Sometimes I don't even know whether the model like produced the correct output or not.

Like it's really hard for me to like verify even like hard, like STEM questions.

I don't know if I'm not an expert, like I usually don't know.

So it's like the question of like alignment is actually more important for this like complex reasoning models to like, how do we help humans to like verify the outputs of these models?

It's quite important.

Then they feel like, yeah, like learning from external feedback is kind of cool.

For sure.

Um, one last thing on cloud three, you had a section on behavioral design.

Yes.

And topics very famous for the HHH goals.

What was your insights there?

Or, you know, maybe just talk a little bit about what you explored.

Yeah.

I think like behavioral design is like a really cool, I'm glad that I, I made it like a section around this and it's like really cool.

I think like lucky one got to publish one and then you insisted on it or what?

I think I just like put the section inside it and like, yeah, Jared, my, like one of my most favorite researchers like, yeah, that's cool.

Let's, let's do that.

I guess.

Um, yeah.

Like nobody had this like terminal like behavioral design necessarily for the models.

It's kind of like a new little field of like standing like product design into like the model design, right?

Like, so how do you create a behavior for the model in certain contexts?

So as for example, like in canvas, right?

Like one of the things that we had to like think about this, like, okay, like now the model enters like more collaborative environment, more collaborative context.

So like, what's the most appropriate behavior for the model to act like as a collaborator?

Should it ask like more follow up questions?

Should it like change what's the tone should be like, what is the collaborators tone?

It's different from like a chat, like conversational list versus like collaborator.

So how do you shape the persona and the personality around that?

It has like some philosophical questions too, like, yeah, behavioral.

I mean, like, I guess like I can talk more about like the methods of like creating the personality.

Please.

It's the same thing as like you would create like a character in a video game or something.

It's kind of like charisma, intelligence, wisdom, what are the core principles?

Helpful, harmless, honest.

And obviously for cloud is much easier than I would say like for charge PDE for clothes, like it's like big 10 and like the mission, right?

It's like honest, harmless, helpful, helpful.

But the most complicated thing about like the model behavior or like the behavioral design is that like sometimes two values would contradict each other.

I think this happened in cloud three.

One of the main things that we were thinking about is like, how do we balance this like honesty versus like homelessness or like healthfulness?

It's like, we don't want the model to always like refuse even to like innocuous queries, like some like creative writing prompts, but also if you don't want the model to be act like a be harmful or something.

So it's like there's always a balance between those two and it's more like art than the science necessarily.

And this is what data sets craft is, is like more of an art than a literal science.

You can definitely do like empirical research on this, but it's actually like, like this is the idea like synthetic data.

Like if you look back to like constitutional AI paper is around like how do you create completions such that you would agree to certain like principles that you want your model to agree on.

So it's like if you create the core values of the models, how do you decompose those core values into like specific scenarios or like, so how does the model need to express its honesty in a variety kind of like scenarios?

And this is where like generalization happens when you craft a persona of the model.

Yeah.

It seems like what you describe behavior modification or shaping as a side job that was done, I mean, I think anthropic has always focused on it the first and the most, but now it's like every lab has sort of vibes officer for you guys is Amanda for open the eyes, it's Run.

And then for Google, it's Steven Johnson and Riza who we had on the podcast.

Do you think this is like a job?

Like it's like a, like every, every company needs a tastemaker.

I think the models personality is actually the reflection of the company or the reflection of the people who create that model.

So like for Claude, I think Amanda was doing a lot of a cloud character work and I was working with her at the time, but there's no team, like clock character team.

Isn't that cool?

But before that, there was not.

I think like actually it was cloud three.

He was like, we kind of doubled down on the feedback from cloud two, like people, we didn't even like think, but like people said like cloud two is like so much better at like writing and like has certain personality, even though it was like unintentional at all.

And we did not pay that much attention and didn't know even how to like productionize this property of model being better like personality until like who's claud three, we kind of like had to like double down because we knew that we would launch like in chat.

We wanted to like clock honesty is like really good for like enterprise customers.

So you kind of wanted to like make sure the hallucinations went like factuality would like go up or something.

You didn't have a team until or after like cloud three, I guess.

Yeah.

I mean, it's going now.

And I think anyway, everyone's taking it seriously.

I think on OpenAI there was a team called Model Design.

It's Joanne, the PM.

She's leading that team and I work very closely with those teams.

We were working on like actually writing improvements that we did with ChaiGPT last year.

And then I was working on like this collaboration, like how do you make ChaiGPT Accollect's collaborator for Canvas?

And then yeah, we worked together on some of the projects.

I don't think it's publicly known his actual name other than Rude, but he's mostly doxed.

And then people can guess.

Do we want to move on to OpenAI?

And some of the reason work, especially you mentioned Canvas.

So the first thing about Canvas is like, it's not just a UX thing.

You have a different model in the backend, which you post trained on a one preview distilled data, which was pretty interesting.

Can you maybe just run people through, you come up with a feature idea, maybe then how do you decide what goes in the model, what goes in the product and just that the process?

Yeah, I think the most unique thing about ChaiGPT Canvas was that it was also the team formed out of the air.

So it was like July 4th or something during the break.

Like Independence Day, they just like, okay, I think it was there some like company break or something.

And I remember I was just like, taking a break.

And then I was like pitching this idea to like Barrett Zoff, who was my manager at that time.

She's like, I just want to like create this like Canvas or something.

And I really didn't know how to navigate open eyes.

It was like my first like, I don't know, like first months at Open AI.

And I really didn't know how to like navigate how to get product to work with me or like some of the ideas, like some of the things like this was like, so I'm really grateful for like, actually, Barrett in the mirror who helped me to like, stop this project, basically.

And I think that was really cool.

And it was like this 4th of July, and like Barrett was like, yeah, actually, he was like an engineering manager is like, yeah, we should like staff this project was like five sticks engineers or something.

And then Karina can be a researcher on this project.

And I think like, this is how the team was formed.

This was kind of like out of the air.

And so like, I didn't know anyone at that time, except for Thomas Dimson, he did like the first like initial like engineering prototype of the canvas and it kind of like ripped off.

But I think the first you learned a lot on the way how to work together as product and research.

And this is one of the first projects at Open AI where research and product work together from the very beginning.

And which has made it like a successful project, in my opinion, is because like designers, engineers, PM, and research team were all together.

And we would like push back on each other like if like it doesn't make sense to do it on a model side, like we had to like collaborate with like applied engineers to like make sure this is being handled on the applied side.

But the idea is you can go that far was like prompted baseline, prompted charge APT was kind of like the first thing that we tried was like a canvas as a tool or something.

So how do we define the behavior of the canvas, but then like we've found a bunch of like different like edge cases that we wanted to like fix.

And the only way to like fix some of the edge cases is actually through post training.

So we actually what we did was actually retrained the entire photo plus our canvas stuff.

And this is like, there are like two reasons why we did this is because like the first one is that we wanted to ship this as a better model.

In the drop down menu, we could like rapidly iterate on users feedback as we ship it and not going through the entire like integration process into like this like new one model or something, which took some time, right?

So I'm like from beta to like GA, it took, I think three months.

So if you kind of wanted to like ship our own model with that feature to like learn from the user feedback very quickly.

So that was like one of the decisions that we made.

And then with canvas itself, we just like had a lot of like different like behavioral it's again, like it's a behavioral engineering and so then like various behavioral craft around like when does canvas needs to write comment?

When does it need to like update or like edit the document?

When does it need to edit the entire like rewrite the entire document versus like edit very specific section of the user asks and what does it need to like trigger the canvas itself was one of those, those like behavioral engineering questions that we had.

At that time, I was also working with like writing quality.

So that was like the perfect way for us to like literally both teach the model how to use canvas, but also like improve writing quality if writing was like one of the main use cases for charge AP.

So I think that was like the reasoning around that.

There's so many questions.

Oh my God.

Quick one.

What does improve writing quality mean?

What are the emails?

The way I'm thinking about is like have two various directions.

The first direction is like how do you improve the quality of the writing of the current use cases of charge AP and those most of the use cases are mostly like nonfiction writings.

It's like email writing or like some of the maybe you've blog posts cover letters is like one of the main use cases.

But then the second one is like, how do we teach the model to literally think more creatively or like write in a more creative manner such that it will just create novel forms writing.

And I think the second one is like much of a longer term like research question.

While the first one is more like, okay, we just need to improve data quality for the writing use cases that between the models are.

It is more straightforward question, but the way we evaluated the writing quality, actually I worked with Jan's team on the model design.

So they had a team of like model writers and you would work together and it's just like a human eval.

It's like internal human eval where we just always like that.

Yeah, on the prompt distribution that we cared about, like we want to make sure that the models that we like use that we trained were always like better or something.

Yeah.

So like some test set of like a hundred prompts that you want to make sure you're good on.

I don't know how big the prompt distribution needs to be because you are literally catering to everyone.

Right.

Yeah.

I think it was much more opinionated way of like improving writing quality because we worked together with like model designers to like come up with like core principles of what makes this particular writing good.

Like what does make email writing good?

And we had to like craft like some of the literally like rubric on like what makes it good and then make sure during the eval we check the marks on this like rubric.

Yeah, that's what I do.

Yeah.

That's what school teachers do.

Yeah.

It's really funny.

Like, yeah, that's exactly how we grade essays.

Yes.

Yeah.

I guess my question is when do you work the improvements back in the model?

So the canvas model is better at writing why not just make the core model better too.

So for example, I built this small podcast thing for our pockets and I have the for O API and I asked it to write a write up about the episode bits on the transcript and then I've done the same in chemists.

The canvas one is a lot better.

Like the one from the raw for all the stars, the pockets delves and I was like, no, I'm not involved in the third ward.

Why not put them back in for all core or is there just like you put it back in the corner.

Yeah.

So like, so the for all canvas now is the same.

Yes.

As for all.

Yeah.

He must have missed that update.

Yeah.

What's the, what's the, what's the process?

That's a little bit different.

It's just like an AB test almost right to me.

It feels, I mean, I've only tried it like three times, but it feels the canvas, the canvas output feels very different than the API output.

Yeah.

I think like there's always like a difference in the model quality.

I would say like the original beta model that we released this canvas was actually much more creative than even right now when they use like for O with canvas.

I think it's just like the complexity of like the data and the complexity of the, it's kind of like versioning issues right here.

I was like, okay, like your version 11 will be very different from version eight.

Right.

It's like, even though like the stuff that you put in is like the same or something.

It's a good time to say that I have used it a lot more than three times.

I'm a huge fan of canvas.

I think it is, yeah.

Like it's weird when I talk to my other friends, they don't really get it yet, but they don't really use it yet.

I think because it's made me sold as like sort of writing help when really like it's kind of, it's the scratch pad.

Yeah.

What are the core use cases?

Or like, yeah, I'm curious.

Literally drafting anything.

Like I want to draft like copy from my conference that I'm running.

Like I'll put it there first.

And then I like, it'll just have the canvas up and I'll just say what I don't like about it.

And it changes.

I will maybe edit stuff here and paste in.

So, so for example, like I wanted to draft a brainstorm list of reasons of signs that you may be an NPC just for fun, just like a blog post for fun.

And I was like, okay, I'll do 10 of these.

And then I want you to generate the next 10.

So I wrote 10, I placed it in it to charge EBT, and they generated the next 10 and they all sucked all horrible, but it also spun up the canvas with with the blog post.

And I was like, okay, self critique, why your output sucks?

Yeah.

And then try again.

And it just kind of just iterates on the blog post with me as a writing partner.

And it is so much better than I don't know, like intermediate steps.

It's like that would be my primary use case.

It's like literally drafting anything.

I think the other way that I'll put it, I'm not putting words in your mouth.

This is how I view what canvas is and why it's so important.

It's basically an inversion of what Google docs is wants to do with Gemini.

It's like Google docs on the main screen and then Gemini on the side and write what not what chat EBT has done is do the chat thing first and then the docs on the side.

But it's kind of like a reversal of what is the main thing.

Like Google docs starts with the canvas first that you can edit and whatever.

And then you maybe sometimes you call in the AI assistants, but chat EBT, what you are now is your kind of AI first with these decide output being Google docs.

I think we definitely want to improve like writing use case in terms of like, how do we make it easier for people to format or do some of the editing?

I think there is still a lot of room for improvement to be honest.

I think the another thing is like coding, right?

I feel like one of the things that we like doubling down is actually like executing code inside the canvas.

And there is a lot of questions like how do you evolve this?

It's kind of like ID for both.

And I feel like this is where I'm coming from is like the chat EBT evolves into this blank interface, which can morph itself in whatever you try.

Like the model should try to like derive your true intent and then modify the interface based on your intent.

And then if you like writing, it should become like the most powerful like writing ID possible.

If it's like coding, it should become like a coding ID or something.

I think it's a little bit of a odd decision for me to call those two things the same product name, because they're basically two different UIs.

Like one is one is called interpreter plus bus.

Yeah, that was canvas.

Yes.

I don't know if you have other thoughts on canvas.

No, I'm just curious, maybe some of the harder things.

So when I was reading, for example, forcing the model to do target edits versus like for rewrite, it sounds like it was like really hard in the AI engineer mind.

Maybe sometimes it's like just pass one sentence in the prompt.

It's just going to rewrite that sentence, right?

But obviously it's harder than that.

What are maybe some of the like hard things that people don't understand from the outside and building products like this?

I think it's always hard with any new like product feature like canvas or tasks or like any other new features that you don't know how people would use this feature.

And so how do you even like build evaluations that would simulate how people would use this feature?

And it's always like really hard for us.

Therefore, like we try to like lean on to like iterative deployment this in order to like learn from user feedback as much as possible.

Again, it's like, we didn't know that like code diffs was very difficult for a model, for example.

Again, it's like, do we go back to like fundamentally improve like code diffs as a model capability?

Or do you like do a workaround where the model will just like rewrite the entire document, which is yield to like higher accuracy?

And so those are like some of the decisions that we had to like make as yeah, how do you like improve the bar to the product quality, but also make sure the model quality is also part of it?

And like what kind of like cheat ups you're okay to do?

Again, I think I think this is like new way of product development is more like product research, model training and like product development goes like together hand in hand.

This is like one of the hardest things.

Like defining the entire like model behaviors, I think just like, is there so many edge cases that might happen, especially when you like do canvas with like other tools, right?

Like canvas plus Dali, canvas plus search.

If you like select certain section and then like ask for search, like how do you build such evolves?

Like what kind of like features or behaviors that you care the most about?

And this is how you build evolves.

You tested against every feature of chat GPT?

No.

I mean, I don't think there's that many that you can write that will take forever.

But it's the same decision boundary between like Python, AD advanced data analysis versus canvas is one of the most trickiest like decision boundary behaviors that we had to like figure out.

Like, how do you derive the intent from the human user query?

Yeah.

And how do I say this?

Deriving the intent, meaning does the user expect canvas or some other tool?

And then like make sure that it's like maximally like the intent was is like actually still one of the hardest problems, especially with like agents, right?

Like you don't want like agents to go for like five minutes and do something on the background and then come back with like some mid answer that you could have gotten from like a normal model or like the answer that you didn't even want because it didn't have enough context.

I didn't like follow up correctly.

You said the magic word.

We have to take a shot every time you say it.

You said agents.

So let's move to tasks.

You just launched tasks.

What was that like?

What was the story?

I mean, it's your it's your baby.

So now that I have a team, I actually like tasks was purely like my residence projects.

I was mostly a supervisor.

So I kind of like delegated a lot of things to my resident.

His name is like Vivek.

And I think this is like one of the projects where I learned management.

I would say.

Yeah, but it was really cool.

I think it's very similar model.

I'm trying to replicate canvas operational model.

How do we operate with product people or like product applied orgs with research and the same happened.

I was trying to replicate like the methods and replicate the operational process of tasks and actually tasks was developed less than like two months.

So if canvas took like, I don't know, four months, then tasks took like two months.

And I think, again, like it's kind of very similar process of like, how do we build Evals?

You know, some people like ask for like reminders in actual charge EPT, but then like, obviously, even though they know it doesn't work.

Yeah.

So like there is some like demand or like desire from users to like do this.

And actually, I feel like task is like simple feature in my opinion, is something that you would want from any model.

Right.

But then the magic is like when I actually because the model is so general, it knows how to use search or like canvas or like create sci-fi stories and create Python puzzles.

When couple this task is actually becomes like really, really powerful.

It was like the same ideas of like, how do we shape the behavior of the model?

Again, we shipped it as like as a better model in the model drop down.

And then we are working towards like making that feature integrated in like the core model.

So I feel like the principles are like, everything should be like in one model.

But because of some of the operational difficulties, it's it's much easier to like deploy as a separate model first to like learn from the user feedback, and then it read very quickly and then improve into the core model.

Basically, again, this is a project was also like together at the beginning from the very beginning designers, engineers, researchers, we're working all together.

And together with model designers, we were like trying to like come up with like evolves evaluations and like testing and like back bashing and it's like a lot of cool like synergy.

Vails, bug bashing.

I'm trying to distill.

I would love a canvas for this for distill what the ideal product management or research management process is, right?

Start from like, do you have a PRD?

Do you have a doc that like these things?

Yes.

And then from PRD, you get funding, maybe, or like, you know, staffing resources, whatever.

Yes.

And then prototype, maybe.

Yeah, prototype, I would say like prototype was prompted baseline.

It's all always prompted baseline and then like, we craft like certain like evaluations that you want to like capture.

They want to like measure progress at least with the model.

And then make sure that you also good and make sure that the prompted baseline actually feels on those like evolves because then you have like, if you're allowed to like hill climb on.

And then once you start iterating on the model training, it looks very iterative.

So like every time you train the model or you like look at the bench or like look at your evolves and like goes up, it's like good.

But then also you don't want to like, you want to make sure it's not like super overfitting.

Like, that's where you run on other evolves, right?

Like intelligence evolves or something.

And then like, you don't want regressions on the other stuff.

Yes.

Okay.

Is that your job?

Or is that like the rest of the company's job to do?

I think it's mainly my like, really job of the people who like because regressions are going to happen and you don't necessarily own the data for the other stuff.

What's happening right now is that like you basically you only like a blade your your data sets, right?

So it's like, you compare on the baseline, you compare like the regressions on the baseline model, model training and then book bash.

And that's, that's about it.

Actually, I did the course with Andrew, Angie, who, yes, there's like one little lesson around this.

Okay, I haven't seen product research, you tweeted a picture with him.

And it wasn't clear if you were working on a course.

I mean, it looked like the standard course picture with enjoying.

Yes.

Okay, it was a course with him.

Was that like working with him?

No, I'm not working with him.

I like, I just like did the course with him.

Yeah.

How do you think about the tasks?

So I started creating a bunch of them.

Like, do you see this as being going back to like the composability, like composable together later, like you're going to be scheduled one task that does multiple tasks chain together?

What's the vision?

I would say task is like a foundational module, obviously to generalize to all sorts of like behaviors that you want.

Like sometimes like I see like people have three tasks in one query.

And right now I don't think like the model handles this very well.

I think that ideally we learn from like the user behavior.

And ideally the model will just be more proactive in suggesting of like, Oh, I can either do this for you every day because I've observed that you do that every day or something.

So it's like more becomes like a proactive behavior.

I think right now you have to be more explicit like, Oh yeah, like every day, like we might meet this.

But I think like the, the ideally the model will always think about you on the background and like kind of suggest, okay, like I noticed you've been reading some of this particular like hard-to-use articles.

Maybe I can try to suggest you like every day or something.

So it's like, it's just like much more like of a natural like friend, I think.

Well, there's an actual startup called friend that is trying to do that.

Yes.

We'll have, we'll interview Avi at some point.

But like it sounds like the guiding principle is just what is useful to you.

It's a little bit B2C, you know, is there any B2B push at all?

Or you don't think about that?

I personally don't think about that as much, but I definitely feel like B2B is cool.

Again, I come back to the cloud and Slack.

It's like one of the first interfaces where like the model was operating inside your organization, right?

It would be very cool for the model to like handle, to like become like a productive member of your organization.

And then either like even like even process, like I right now, like I'm thinking like processing like user feedback.

I think it'd be very cool if the model would just like start doing this for us and like we don't have to hire a new person on this just for this or something.

And like you have like very simple like data analysis, like data analytics or like how these features like.

Do you do this analysis yourself or do you have a data science team that tells you insights?

I think there are some data scientists.

Okay.

I've often wondered, I think there should be some startup or something that does automated data insights.

Like I just throw you my data.

You tell me.

Yeah, exactly.

Because that's what a data team at any company does, right?

Which is just give us your data.

We'll like make PowerPoints.

Yeah.

Yeah, that's, that'd be very cool.

That's, I think that's a really good vision.

You had thoughts on agents in general.

There's some more proactive stuff.

You actually had tweeted a definition, which is kind of interesting.

I did.

Or well, I'll read it out to you.

Tell me.

This is five days ago.

Agents are gradual progression of tasks starting off with one of actions, moving to collaboration, ultimately fully trustworthy long horizon.

I know it's, I know it's not comfortable to have your tweets read to you.

I have had this done to me.

Ultimately, fully trustworthy long horizon delegation in complex environments, like multiplayer, multi-agents, tasks and canvas forward in the first two.

What is the third one?

One of my weaknesses is like, I like writing long sentences.

I feel like I need to like.

No, that's fine.

That's fine.

Is that your definition of agents?

Like, what are you looking for?

I'm not sure if this is my definition of agents, but I feel like it's more like how I think.

It makes sense, right?

Like, I feel like for me to like trust an agent with my passwords or my credit card, I actually need to build trust with that agent that it will handle my tasks correctly and reliably.

And the way I would go about this is how I would naturally like collaborate with other people.

Is it like, we first, even with any project, right?

Like we first came, when we first come, like we don't even know each other.

Like we don't know how each other's like working style.

Like what I prefer, what do they prefer?

How do they prefer to communicate, etc, etc.

So like you spend like the first, like, I don't know, like two weeks to just like learn their style of working.

And then like over time you adapt to their working style.

And then this is how you create the collaboration.

And then like at the beginning, you don't have much trust.

So like how do you build more trust?

Especially like the same thing as like with a manager, right?

Like it's like, how do you build trust with your manager?

What does it need to know about you?

What do you need to know about them?

Over time, as you build trust and trust builds either through collaboration, which is why I feel like building Canvas was kind of like the first steps towards like more collaborative agents.

I think with humans, like you can, you should need to show a consistent effort to each other, like consistent effort that you care about each other is that you like work together very well or something.

So consistency and like collaborations, like what creates trust.

And then I will naturally will try to delegate tasks to a model because I know the model will not fail me or something.

So it's kind of like building out like the intuition for the form factor of like new agents.

Because sometimes I feel like a lot of researchers or like people in AI community are like so into like, yeah, agents delegate everything, like blah, blah.

But like on the way towards that, I think like collaboration is actually one of the main roadblocks, like milestones to get over.

Because then you will learn some of the implicit preferences that would help towards like this full delegation model.

Yeah, trust is very important.

I have an AGI working for me and we're still working on the trust issues.

We are recording this just before the launch of operator.

The other side of agents that is very topical recently is computer use and topic launch computer use recently.

You know, you're not saying this, but OpenAI is rumored to be working on things.

And like, there's a lot of labs are like exploring this like sort of drive a computer generally.

How important is that for agents?

I think it will be one of the core capabilities of agents.

Yeah, computer using agents using desktop or like your computer is like the delegation part, like when you might want to like delegate an agent to like order a book for me or like order flight or like search for a flight and then order things.

And I feel like this idea was flying around like for a long time since at least like 2022 or something.

And finally we are here.

It's just like there's a lot of like lag between idea and like full execution in the orders like two to three years.

The vision models had to get better.

Yeah, a lot better perception and something.

But I think like it's really cool.

I feel like it's it has like implications for like consumers definitely like delegation.

But again, like I think like leading is like one of the most important factors here is like you don't want to make sure that the model correctly understands what you want.

And then if it doesn't understand or if it doesn't know like full context, it should like ask for a follow up question and then like use that to perform the task.

Like the agent should know if it has enough information to complete the task at the maximal, if it was a maximal success or not.

And I think this is like still an open kind of like research question that you like.

Yeah.

And the second idea is that like, I think it also enables new causal like research questions of like computer use agents, like can we use it in RL, right?

Like this is kind of like very cool, like nascent area of like research.

What's one thing that you think by the end of this year, people will be using computer use agents a lot for?

I don't know.

It's really hard to predict.

Maybe for coding.

I don't know, like for coding.

I think like right now, like with canvas, we are thinking about like this paradigm of like real time collaboration to like asynchronous collaboration.

So it's like, it would be cool if I can just delegate to a model, like, okay, can you figure out like how to do this feature or something?

And then the model can just like test out that feature in its own like virtual environment or something.

I don't know, like maybe this is a weird idea.

Obviously, there will be a lot of use cases around the consumers, consumer use cases like hey, like shop for me or something.

I was gonna say everyone goes to booking plane tickets.

That's like the worst example, because you only booked plane tickets what two or three times a year.

Concert tickets.

I don't know.

Yeah.

I want a Facebook marketplace bot that just scrolls Facebook marketplace for free stuff.

Yeah.

And then get it.

I don't know.

What do you think?

I have been very bearish in computer use because they're this slow, they're expensive, they're imprecise, like the accuracy is horrible.

Still, even with end topics, new stuff.

I'm really waiting to see what opening I might do to change my opinions.

And really what I'm trying to do is like Jan last year versus December last year, I changed a lot of opinions.

What am I wrong about today?

And computer use is probably one of them where I'm like, I don't think I don't know if by end of the year we'll still be using them.

Will my chat GPT have like every GPT instance will they have a virtual computer?

Maybe I don't know.

Coding.

Yes.

Cause he invested in a company that does that the code sandbox is there.

There are a bunch of codes and box companies, E2B is the name, but then like in browsers, yes.

Computer use is like coding plus browsers plus everything else.

There's a whole operating system and it's very like you have to be pixel precise.

You have to OCR.

Well, I think OCR is basically solved, but like pixel precise and like understand the UI of what you're operating in.

Like I don't know if the models are there yet.

Yeah.

Two questions.

Like, do you think the progress of like mini models like O3 mini or like O1 mini, I guess like it's came back to like the cloud, cloud three high school, cloud one point two instance, like this like gradual progression of like small models becoming really powerful, which are very also like fast.

Like I'm sure like the computer use agents like would be able to like couple with like those like small models that will solve some of the latency issues in my opinion.

I think in terms of like other operating system, I think a lot about it this days is just like, if you're entering this like task oriented, like operating system or something where also a generative OS, like in my opinion, like people in like few years will click on like websites way less.

I want to see the plot of like website clicks over time, but then my prediction is like it will go down and like people's access to the internet will be through the models lens.

Either you see what the models doing or you don't see what the models doing on the internet.

Yeah.

I think my personal benchmark for computer use this year is expense reports.

So I have to do my expense report every month, but what you need to do.

So for example, I expense a lunch.

I have to go back on the calendar and see who I was having lunch with.

Then I need to upload the receipt of the lunch and I need to tag the person in the expense report, blah, blah, blah.

Yeah.

It's very simple on a task by task basis.

Yeah.

But like you have to go to every app that I use.

You have to go to like the Uber app.

You have to go to the camera roll to get the foot of the receipt and all these things.

It's not, you cannot actually do it today, but it feels like a tractable problem.

You know that probably by the end of the year, we should be able to do it.

Yeah.

This reminds me of like the idea of you kind of want to show to computer use agents how you would want, how you want, or how you like booking your flights.

It's kind of like a few shots.

Yeah.

Demonstration and learning.

Demonstration of like maybe there is more efficient way that you do things that the model should learn to do in that way.

And so it's kind of like, again, comes back to like personalized tasks to is like right now a task is just like where you're like rudimentary, but in the future tasks should become like much more personalized for your preferences.

Okay.

Well, we mentioned that.

I'll also say that I think one takeaway I got from your, this conversation is that chat GPT will have to integrate a lot more with my life.

Like you, you, you will need my calendar.

You will need my email.

Yes.

For sure.

And maybe use MCP.

I don't know if you've looked at MCP.

No, I haven't.

It's good.

It's got a lot of adoption.

Anything else that we're forgetting about or like maybe something that people should use more?

Yeah.

I don't know.

Before we wrap on like the open AI side of things.

I think like search product is kind of cool.

Like chat GPT search.

I think this idea of like, you know, like right now I'm thinking a lot of us like, you know, the magic of chat GPT when it first came out was like, you know, you ask something and you like instruction and then like it would like follow the instruction that you gave to a model, right?

Like write them poem and we'll give you a poem.

But I think like the magic of the next notion of chat GPT is like, actually, and we're like, we're marching towards that.

It's like when you ask a question, it's not just going to be in the text output.

The ideal output might be like in some form of like a react app on the fly or something.

So like this is happening with like search, right?

Like give me like Apple stock and then it gives you the chart and gives you like this like generative UI.

And I feel like this is what I mean by like, the evolution of chat GPT becomes like more of a generative OS with a task orientation or something.

So it's like, and then UI will adapt to what you like.

So like, if you really like 3D visualizations, I think the model should give you as much visualization as possible.

Like, you know, if you really like certain way of like the UIs, like maybe you like round corners, or I don't know, it's just like some color schemes that you're like, it's just like the UI becomes like more dynamic and like becomes like a custom, custom model, like personal model, right?

Like from personal computer to like a personal model, I think.

Yeah, takes overall.

You are one of the very few people, actually maybe not that rare, to work at both OpenAI and Intopic.

Not anymore, yeah.

Cultural difference.

What are general takes that people like only like UC?

I love both places.

I think I've learned so much at Intopic and I'm really, really grateful to the people and I'm still like friends with a lot of people there.

And I was really sad when John left OpenAI, because I came to OpenAI because I wanted to work with him the most or something.

What are you doing now?

But I think it changed a lot.

So I think like when I first joined Intopic, there were like, I don't know, 60, 70 people.

When they left, there were like 700 like people.

So it's like a massive like growth.

OpenAI and Intopic is different in terms of like more like maybe like product mindset.

Maybe OpenAI is much more willing to take some of the product risks and explore different bets.

And I think Entropic is much more focused and they have, I think it's fine.

Like they have to like prioritize, but they definitely double down in on like enterprise might be more than like consumers or something.

I don't know.

It's just like some of the product mindsets might be different.

I would say like research have enjoyed like both like research cultures, both in Entropic and like OpenAI.

I feel like they are more on a daily basis.

I feel like it's more similar than different.

I mean, no surprise.

Like how you run experiments is kind of like very similar.

In your mind.

It's just really good.

They're like connecting dots between like product and like kind of balancing like product research and like create this like comprehensive like coherent story because sometimes like there are like researchers who like really hate doing products and there are researchers who really love doing product and it's like kind of dichotomy between two and also like safety is like a part of this process.

So kind of you kind of want to like create this coherent like think from like systems perspective or like think about like bigger picture.

And I think I learned a lot from her on that.

I definitely feel like I have much more creative freedom at OpenAI and that's because the environment that the leaders set like enables me to do that.

So it's like if I have an idea, propose it.

Yeah, exactly.

There's like more like creative freedom and like resource reallocation, especially in the research is like being adaptable to like new technologies and like change your views based on like empirical results or kind of like changed research directions.

I've seen a lot of like sometimes I've seen researchers who would just like get stuck on the same directions for like two to two years and they would never like work out or something, but they would still be like stubborn.

So it's like adaptability to like new directions and like new paradigms is kind of like one of those things that this is a merit thing, what is the general culture in general kind of culture.

I think cool.

Yeah.

And just to wrap up, we just usually have a call to action founders usually want people to work at their companies.

Do you want people to give you feedback?

Do you want people to join your team?

Oh yeah, of course.

I'm definitely hiring for like research engineers who are like more product minded people.

So it's like people who know how to change the models, but also like interested in like deploying into like the products and developing like new product features.

I'm just looking for those archetypes like research engineers or like research scientists.

So yeah, if you're like looking for a job, if you're like interested in joining my team, I'm like really happy to just reach out, I guess.

And then just like generally what do you want people to do more often in the world, whether or not they work with you, like, you know, call to action as in like everyone should be doing this.

I think this is something that I tell to a lot of like designers is that like, I think people just like spend more time just like play around with the models.

And the more you play with the model, the more creative ideas you'll get around like what kind of like new potential features of the products or like new kind of interaction paradigms that you might want to create with those models.

I feel like if you're bottlenecked by like human creativity on like completely changing the way if you think about the internet or like some of the the way you think about software, like AI right now pushes us to like rethink everything that we've done before in my view.

And I feel like not enough people either double down on like those ideas, or I'm just like not seeing a lot of human creativity in this like interface design or like product design mindsets.

So I feel like it'd be really great for people to just like do that.

And especially right now is like research, some research becomes like much more product-oriented.

So it's like you actually can train the models for the things that you want to do in the products or something.

Yeah.

And you define the process now.

Now this is my go-to for how to manage a process.

I think it's pretty common sense, but it's nice to hear from you that because you actually did it.

That's nice.

Thank you for driving innovation, this interface design and the new models at ODI and Entropic.

And we're looking forward to what you're going to talk about in New York.

Yeah.

Thank you so much for inviting me here.

I hope my job will not be automated by the time.

Well, I hope you automate yourself.

Yeah, I hope so too.

We'll do whatever else you want to do.

That's it.

Thank you.

Thanks.