Training Data · 2025-07-22

OpenAI's ChatGPT Agent: Unifying Deep Research and Operator

Hosts: Unknown

Guests: Isa Fulford, Casey Chu, Edward Sun

ChatGPT AgentDeep ResearchOperatorAgentic AIReinforcement LearningTool UseMulti-modal AgentsAI SafetyProduct Development

Read summary Jump to transcript Original podcast

Podcast feed URL

Open feed

Why it matters

ChatGPT Agent unifies Deep Research and Operator into one virtual computer

Key claims

ChatGPT Agent unifies Deep Research and Operator into a single agent with text browser, GUI browser, terminal, and image generation—all sharing state on a virtual computer.
Trained via reinforcement learning on diverse hard tasks across thousands of VMs; the model was not told when to use which tool and learned strategies on its own.
Agents can run for extended periods (examples cited at 28 minutes and one hour) with context-extension mechanisms to avoid losing track.
Designed for collaborative multi-turn use: clarifying questions, mid-task corrections, human takeover of the browser, and persistent computer state across sessions.

Episode summary

Summary

OpenAI launched ChatGPT Agent, its most powerful agent to date, built by merging the teams behind Deep Research and Operator. The agent runs on a virtual computer with shared state across multiple tools: a fast text browser, a full GUI browser, terminal with code execution and API access, and image generation. This unified environment enables fluid transitions between research, visual interaction, and artifact creation (spreadsheets, slide decks), allowing the agent to tackle complex multi-step tasks like estimating a company's valuation, building a financial model, and producing a presentation.

ChatGPT Agent unifies Deep Research and Operator into a single agent with text browser, GUI browser, terminal, and image generation—all sharing state on a virtual computer.
Trained via reinforcement learning on diverse hard tasks across thousands of VMs; the model was not told when to use which tool and learned strategies on its own.
Agents can run for extended periods (examples cited at 28 minutes and one hour) with context-extension mechanisms to avoid losing track.
Designed for collaborative multi-turn use: clarifying questions, mid-task corrections, human takeover of the browser, and persistent computer state across sessions.
Safety stack includes a persistent monitor ('antivirus-like'), extensive red teaming (internal and external), bio-risk evaluation, and rapid-response protocols for new attacks.
Performance note: data-science benchmark reportedly outperformed human baseline; date picking and other basic UI tasks remain surprisingly difficult.
Small combined team (~10-12 research + applied) achieved the breakthrough in a few months, highlighting product insights and data curation as key drivers alongside compute.
Future direction: more sources and tools, personalization and memory, proactive agent-initiated tasks, and new interaction paradigms—preferring one generalist agent over specialized sub-agents.

Source material

Transcript

I think this model is actually very good at multi-turn conversations, and it's very nice to continue working on a task with.

I think that's one of the deficiencies of Deep Research.

A lot of people will do multiple Deep Research requests in a single conversation, but it doesn't always work so well.

So I think we're really happy with this model's multi-turn ability, and we just want to, you know, improve even further.

And then I also think, like, personalization and memory for agents will also be very important.

And right now, every agent task is initiated by the user, but in future it should also be doing things for you without you having to even ask in the first place.

Today we're exploring the evolution of AI agents with Isa Fulford, Casey Chu, and Edward Sun, the open AI team behind the new chat GPT agent.

You'll learn how they got to a huge link forwarding capability by unifying the architecture across Deep Research and Operator, allowing for multiple tools to share state, giving users fluid transitions between visual browsing, text analysis, and code execution all within a single environment.

We'll discuss their training approach.

Rather than programming specific tool usage patterns, they let the models discover the optimal strategies through reinforcement learning across thousands of virtual machines.

They've created an agent that can work alongside you for hours, asking clarifying questions and accepting mid-task corrections, expanding the ways that we can interact with AI agents.

The team shares fascinating challenges around safety, guard-ruler around agent activities, and why things like date picking still remain mysteriously difficult for AI systems.

They've revealed how small focus teams are achieving breakthrough capabilities through careful data curation, suggesting that we're now entering a new phase of AI development where product insights matter just as much as compute power.

Enjoy the show.

Isa, Casey, Edward, thank you for joining us today.

Thank you so much for having us.

So you're the team behind the chat GPT agent or agent mode.

What is it?

Yeah, so this has been a collaboration between the former deep research and operator teams, and we've created a new agent in chat GPT that's able to carry out tasks that would take humans a long time.

And we gave the agent access to a virtual computer.

And through that, it has a few different, two different ways to access the internet, or actually more ways, but we'll get to that.

It has a text browser, which is similar to deep research tools.

So it's able to efficiently access information online and search through things with this very fast text browsing tool.

And then it also has a virtual browser, which is similar to the operator tool.

So it actually has full access to the graphical user interface.

And it's able to click and type things into forms and scroll and drag and all this kind of, you know, all these kinds of things.

So together, it's much more powerful than either of those two tools, because one's more efficient and one's like much more flexible.

And then we also gave access to a terminal.

So it's able to run code and analyze files and create artifacts for you like spreadsheets or slides.

We also, through the terminal, it's able to call API.

So either public APIs or private APIs.

If you allow if you if you sign in, it could access your GitHub or Google Drive, SharePoint, many other things.

And the cool thing about this tool is all of the tools have shared state.

So it's similar to if you're using a computer, like all of your different applications have access to the same file system and things like that.

It's the same for the tool.

So the model can do quite flexible things.

And we'll talk more about this later, but I think it's just a very flexible way for the model to to do very complex tasks on behalf of of users.

Tell us a little about the origin story.

How did this get started?

Well, our team worked on operator and our team worked on deep research.

And so back in January, we released our first agent operator.

This is a product that can do internet tasks for you, like buy things on the internet, shop for you, this kind of thing.

And then two weeks later, we released deep research, which is a different model that's a different product that's able to extensively browse the internet and synthesize information.

And it creates a long research report with citations for you.

And we were kind of thinking through our roadmap and we were kind of like, hey, this is kind of a match made in heaven here.

Like so, you know, operator is really good at visual, you know, interacting with a web page, but it's less good at kind of the text browser, like reading long articles.

Whereas deep research is really good at reading long articles, but it's has a tougher time with like interactive elements or like highly visual things.

Because the tools are different.

So deep research has a text browser.

So it's able to really efficiently read information and search and synthesize information.

But it's not able to like scroll and click in the same way or fill out forms in the same way that operator is because it has actually full access to the GUI browser.

And as Casey was mentioning, like deep research has some things that operator doesn't have.

And then similarly, one of the biggest requests for deep research is for the model to be able to access like paywalls sources or things that you have to pay a subscription for.

And operator is able to do that.

And also one of our members of our team, Eric, he was running an analysis on the types of prompts that people were trying on operator.

And we realized that it was a lot of deep research type tasks like research this trip from me, then book it.

So it really is a natural combination.

In what way is one plus one equals three?

So in deep research, we always wanted to figure out how to let deep research has access to a real browser that can you not loading all the real content that previous deep research cannot have access to.

It's funny that you bring up the one plus one equals three, because not only did we combine deep research and operator, but we also threw in like a bunch of other tools that basically like everything we think of.

Yeah, so like the terminal tool is there so it can use like it can run commands to like do calculations.

The image gentle is a fun one.

Like if it wants to, you know, spruce up its slides by making an image, it can do that.

Can cool APIs can produce PowerPoints.

Yes, yeah, it can do a lot of different things.

Yeah, tell us a little bit how people are using it knowing it's still early days.

So I think the cool thing about it is we have some ideas of how we think people are going to use it, but I think we intentionally kept it quite open ended.

I mean, it's called agent that's so vague, partially because we are excited to see how people end up using it.

So I think some of the things that we specifically trained it for, of course, deep research type tasks are things where you want a long report on a topic operator type tasks where you want it to do something for you like book something or your book a flight, buy something for you, and then also tasks to make slide decks.

We also, you know, spent a lot of effort on making spreadsheets and doing data analysis, but I think there are also just so many other things the model can do.

So we're just excited to see how people use it kind of similarly to how when we launched deep research, we saw a lot of people using it for code search, which was really surprising to us.

We're hoping to see a lot of new use cases that we didn't even think of ourselves.

Would you guess it'd be more consumer or kind of B2B type use cases?

Or is that a false?

Hopefully both.

Okay.

I think we're kind of aiming for like the pro-sumer, like someone who's willing to wait like 30 minutes for like a detailed report, but that can be in the like, yeah, in the consumer case or the, you know, at your job.

I think it could be both good for both.

Do any of your favorite things you've used it for?

For me, it's more like putting data from our spreadsheet or Google Docs, like documenting our expansion log and then make some slides to present the data or organize the data.

It's pretty useful.

I've been doing a deep dive into ancient DNA.

What am I, you know, what are my interests?

And there's actually a lot of exciting work going on, like these past like five years.

They're like sequencing all this DNA and like discovering all these facts about like, oh, where did like this group of people come from and like, you know, historical stuff.

The problem is that everything is so new that, you know, there isn't like a reference like source material for like to summarize like a survey of these materials.

But, you know, agent can go out and pull together all these sources and synthesize it into a report that I can read or slides that I can read.

And I think it's kind of made for this topic.

Yeah, I like it for consuming use cases.

Like I've used it for online shopping.

I think especially because a lot of websites require using visual browser because it will have a search filter or something that it needs to go through or the model like actually needs to be able to see what the item looks like.

And then also for planning events, it's been pretty useful.

What's your favorite shopping query?

I think I was using it for clothes shopping.

Love it.

And you guys also showed us a really cool use case right before we filmed this episode.

Do you want to share that one?

Yeah, sure.

So that was actually something that one of our co-workers, Tejal, shared with us.

She asked the agent to estimate OpenAI's valuation and create based on things that it found online, create a financial model with projections to create a spreadsheet, also create a summary analysis and then also create a slide deck presenting the results.

And so hopefully the model is correct because it was it had quite an ambitious projection for us.

It was impressive slide deck.

Well, I think I want to point out about this trajectory was that it reasoned for, I think, 28 minutes.

And yeah, I think this is kind of opening up a new paradigm where you ask the agent for tasks and then you step away and it comes back with a report.

And yeah, I think I think as agents become more agentic, it'll be like longer and longer tasks.

And this is a good example of one.

Are these the longest running tasks you guys have launched so far?

I would say so.

Like I just did one that was an hour long and I don't think I've ever seen that.

I didn't know how long codecs can run for.

That's true.

Yeah.

Is there anything special that makes that goes into making an agent run for so long without kind of flying off the rails?

We have some like some tools to enable the model to be able to further extend this context lens beyond what like what's what's original like the hard limit.

So that the model is able to perform tasks by like documenting what's what it's doing and step by step like kind of like increase the time like it can do the task of the task it can do without the human's interruption.

Yeah, it's also the flow to go back and forth between the model and the human also is very nice so I can correct it as it's going right.

Yeah, so this model is very flexible and collaborative and that was very important to us.

So it's modeled after how you would interact with someone if you ask them to do a task for you.

So imagine you're asking someone on slack to do something for you.

You'd probably give them instructions and then they'd ask you some questions and then maybe start doing the task and then maybe in the middle of the task they'll say, oh, actually, can you clarify this for me or can you sign into this thing for me?

Or what am I allowed to do this for you?

And similarly, you might remember something that you forgot to say when you first gave them the task, you might want to interrupt them and just say, oh, hey, please also do this.

Or you might want a status update if they're taking a long time to do it or you might want to redirect them if they're getting going on the wrong path.

So that's what we modeled it after.

And I think it's very important that the user and agent are both able to initiate communication with each other.

So I think what we have now is probably the most basic version of what this could be, but it's better than anything we've released before in this area, because at first the agent can ask you clarifying questions similar to deep research, but it's more flexible.

So it doesn't always ask you clarifying questions.

And then you can interrupt the model so you can say, oh, can you summarize what you've done so far or oh, I forgot to say I actually only want blue sneakers.

And then if the model is going to take some kind of destructive action or if it needs you to log into something, it will also ask the user if it's allowed to do that before doing anything.

On this topic, we kind of built this kind of computer interface, you guys saw, where you can kind of watch along with what the agent is doing.

And that actually persists beyond the conversation.

So once it's done with the task, you can actually go back and ask it follow up questions and ask it to fix something or do another task.

And you can also take over that computer so you can click in and then now you have access to its environment and you can click for it or log in for it or insert your credit card information or things like that.

And so, yeah, I like to think of it as like looking over your coworkers' shoulders and being able to take over if necessary.

Thank you for enabling the micromanager in me.

Just kidding.

So we'd love to talk a little about how this works, the extent that you can share.

Yeah, so this agent is chained with the same technique as the one that we're learning.

So we give this agent model the other tools we have implemented in the same virtual machine, like a text browser, like a GUI browser terminal and the imaging tool.

And then the model will try to solve the task like we curated, which are pretty hard task that the model has to complete with using these tools.

And then kind of like we reward the model if the model completed the task efficiently and correctly.

And for example, like after this training, the model is able to ensure that I learned to switch between these tools fluently.

Like for example, if for a task of, you know, the like you ask the model to research some restaurants and maybe booking a spot for you, it will first do like a deep research style text based browsing.

And then we'll probably also use the GUI browser to view the image of the food and also view the availability, like which is usually like, you know, written in JavaScript that you have to use a real GUI browser.

And then, for example, if you ask it to create an artifacts, it's usually, you know, can post sources from a website and then they use them in the terminal.

Yeah, I think the cool thing about this tool compared to tool use implementation in the past is that all of the tools have shared state.

So it's all it's like you're using when you're using your computer and you have many different applications, you know, like if you download something, it's going to be accessible to other applications.

It's very similar.

So the model can open a page in the text browser, which is more efficient, but then maybe it realizes it needs the visual browser so it can just seamlessly switch or it could download something using the using the browser and then in terminal, it manipulates it or something like that.

It can run something in terminal and then open it in the browser.

It's very flexible.

And so it's just giving the model a more powerful way of interacting with the Internet and files and its file system and code and things like that.

Yeah.

One interesting thing to emphasize is that like we essentially give the model all these tools and then like lock it in the room and then like an experiments, you know, we don't really tell it when to use what tool.

It kind of figures that out by itself.

It's kind of almost magic.

Is the technique it sounds very similar to, you know, deep research.

We had you on the podcast before.

Should we think of this as the standard technique of how open I think that that agents will be trained going forward?

I think we can take this really far.

You know, this was we haven't our teams haven't been collaborating for that long.

We even framed this model run as kind of minimum shippable de-risk.

That was mostly for PR reasons internally.

But this is really like the most basic version we could make together.

And I think we have so much further we could push this with these methods.

For example, the slides capability is a new capability.

It's a very, you know, already impressive.

It's a great work from Aidan Paloma Martin, a bunch of other people.

But, you know, there's so much further we can we can push that and improve using the same techniques.

But I think we can take it further.

But we probably need other things too.

Yeah, I feel so far it's pretty magical.

Like the same algorithm just works.

Yeah.

Like O1 reasoning, like deep research with tool call.

And then now like a more advanced, you know, computer use browser use agents.

Where does it run into the limits with this strategy and with this model specifically as well?

I think the interesting thing with this model is that because it's taking it's able to take actions with external side effects.

There's a lot more risk.

So for deep research, it was read only.

So there's kind of a limit to what the model could do in terms of like data exfiltration and other things.

But with this, in theory, the model could successfully complete a task, but take a lot of harmful actions along the way.

Like you could ask it to buy you something and it decides to buy just like a hundred different options to make sure that you're satisfied.

Exactly.

Or you can think of many examples like that.

So I think that safety and safety training and mitigations was kind of one of the really cool parts of our process with this model.

And yeah, maybe you can talk more about it.

I was going to mention that kind of along the same lines, it's like this contact with the real world that makes things difficult.

You know, we have to train this on like a bunch of VMs, like it's like thousands of VMs maybe and, you know, things break.

And, you know, as soon as you're hitting a real website, like the websites down or like you're hitting like all these capacity limits and like load testing and this kind of thing.

Yeah, it's really the very beginning.

And, you know, we're going to iron out all these details and continue, but that's a major limitation.

How do you think about from a safety perspective building in the right guardrails and like how do I make sure the model is not, you know, logging into my bank account and sending that all off to a Nigerian prince?

Yeah, that's a very good question.

Yeah, this is definitely an emerging risk where, you know, the Internet's a scary place, you know, there are a lot of like attackers and like scammers and this kind of thing, phishing attacks, like this goes on and on.

And yeah, our model is a bit like it can it can reason about these things, like if you tell it to be careful, you know, we've done some safety training to make this like more robust.

But sometimes it can get fooled and sometimes it, you know, is a bit too over eager to complete your task.

We have a long list of mitigations and the team has worked really hard to like stack together a bunch of techniques to really try to make the model as safe as possible.

So, you know, one example that I'll call out is that we have a monitor that looks kind of looks over its shoulder and just sees if anything looks funny, like whether it's going on a weird website or anything like this kind of like antivirus for your computer.

It's like just kind of persistently watching.

And then if it looks like there's anything suspicious, then it'll stop the trajectory and stop there.

Of course, we can't catch everything.

And this is a major area that we'll continue on, continue to iterate on.

We do have like a protocol for if there are new attacks in the wild that we discover or we encounter, then we can rapidly respond and update these monitors kind of like you would update your antivirus software like it would kind of pick up on these new attacks and hopefully keep you safe.

Yeah, I think the cool thing about the safety training is that it's been a really cross org effort from the safety team, governance team, legal team, research team, engineering team, like so many others.

And we have so many mitigations at every single level.

We did a lot of external red teaming, internal red teaming.

But yeah, as Casey mentioned, there's more.

Surely when we release the model, there will be new things we uncover.

So we just need to make sure we also have robust ways of detecting those and then mitigating those.

For some of these models, there's a risk of what you can do with the models, whether it's creating bio hazards or otherwise.

How do you guys manage some of that?

Yeah, it's actually bio has been heavily on our mind.

Yeah, the team has been really thoughtful about, you know, yeah, this agent, you know, we think it's very powerful.

It can do research.

It can like really speed up your work.

But that also means that it could speed up harm.

And kind of one of the top things that our team has been looking into is the risk of bio, bio risk.

So like creating bio weapons, this kind of thing.

And yeah, the team has been really thoughtful about how to mitigate against this and generally being very cautious.

We did like many weeks of red teaming to make sure that this model cannot be used for those harms and a bunch of other mitigations in place.

Shout out Karen, who spearheaded this effort.

And yeah, in general, I think we're very aware of this and just trying to be very cautious.

Yeah, makes sense.

Tell us a little about the team that came together to build this.

So as Casey mentioned earlier, we had deep research research team and then deep research applied team and operator research team will compute using agent research team and operator applied team.

And we effectively merged everybody.

We all work really closely together, both the research team and the applied team.

And the vibes have been great.

It's been so fun.

Easton and I have been friends for a long time.

Yeah.

So like it was a natural like, it was great.

How many of you are there?

On deep research for the majority of the time, three or four.

Now we have some new people, which is very exciting.

And then on Kua.

On Kua, I think around six to eight, somewhere around there.

On the research side.

And then we have an amazing applied team, so like engineering product design led by Yosh Kumar.

And then he has just a really cracked engineering team.

So it's been very fun to work really closely.

I think that's one thing that has made this collaboration really special is that the research and applied teams work so closely.

And even from the beginning, we're defining what the product should be able to do.

It's very much a collaboration between research and product and design.

So we go backwards from the use cases we want to be able to solve to training the model and building the product.

And obviously it's able to do it's not able to do all of those things fully yet.

And it can do some things that we didn't plan, but I think it's a good framework for us when we're starting a project.

It's very grounded in how we want people to use it in the real world.

It's a way smaller team than I was expecting.

Small teams can do amazing things.

Yeah, so you've built a lot.

Yeah, and we haven't been working together for very long.

It's been a few months.

Yeah.

And actually the boundary between the research team and the applied team are not very decisive because during the model training, like lots of applied engineers, they are helping us train the model.

And also after we changed the model, some research team members are also working on the setup of the model to deploy the model to the real users.

What was the hardest part about training this agent?

Yeah, I think one of the biggest challenges we have is how to make training stable, especially when we train deep research, it's only using browsing and Python.

It's like it's pretty mature tools there.

Like we've been using it for a while, but when training the agent model, it has some new tools like a computer and also the terminal like a bundle in the same container, the same virtual machine as a computer.

So it's actually quite hard to train because we are literally set up hundreds of thousands of virtual machines at the same time.

And then they all like, you know, visit the Internet.

And we so it's it's one of the biggest challenges.

Like we see that actually the training sometimes we'll feel better.

Like finally, we are very happy that we get this model.

The VMs.

All back to the engineering.

Tell us what's next.

More sources, more tools, better model.

How do you think about it?

Well, I think one thing I like about our agent framing is that you can ask it to do whatever you want and, you know, you can ask it to do like every possible task you can imagine.

It just might not do it well.

And I said, you tell it like, go make me money on the Internet.

You can tell it that.

Should we try that right after this?

But yeah, I think it's really a matter of like improving the accuracy, like the performance of tasks of like the whole distribution of tasks that anyone does on a computer.

Right.

Which is a lot of tasks.

And yeah, and it's like an iterative deployment.

We are very excited to see like what's the new capabilities that our user will find in our agent, like, you know, like the coding ability in deep research or deep research ability in operator.

You were using the agent mode for coding.

Yeah, I use it for coding a lot because I feel it's, you know, it's actually not, you know, like a very like always try to rewrite my whole code base.

It just actually have some small editing and also it's actually read the original docs of different functions pretty well.

Like so I think I feel it's hallucinate less on the function coding.

Oh, interesting.

How do you choose when to go to codecs versus when to go to agent for that?

For the agent, it's more similar to how I use for O3.

So it's more like an interactive experience for the codecs.

It's more like, you know, you have some, you know, we designed the problem that you want a coworker to solve and then it will make a PR for you.

But for the agent, it's more like just to give you a function or give you a suggestion.

Cool.

And it can do code search because it can access GitHub through the API connector.

So code search kind of things.

Yeah, it almost feels like, you know, the agent roadmap up until now you've you've built the different almost like appendages of what it would take to have an agent.

And by combining them all, this really is like the first like fully embodied agent on the computer.

I think it's very exciting.

Yeah, I think another area that we're excited to push on is the experience of collaborating with the agent.

I think this model is actually very good at multi-turn conversations and it's very nice to to continue working on a task with.

I think that's one of the deficiencies of deep research.

A lot of people will do multiple deep research requests in a single conversation, but it doesn't always work so well.

So I think we're really happy with this model's multi-turn ability and we just want to improve even further.

And then I also think personalization and memory for agents will also be very important.

And also right now, every agent task is initiated by the user, but in future it should also be doing things for you without you having to even ask in the first place.

Yeah, I'm also pretty excited about the UI and UX surrounding the agent because right now, I think, you know, obviously we're working in a chat, you keep world like it's like you start a conversation and it goes.

But you can imagine a lot of different modes of interacting with an agent.

And I'm excited to explore different ways of interacting with the agent.

Do you see this as always being a kind of single omniscient super agent or will there be the financial analyst sub-agent and the personal party planner sub-agents?

Like what's your vision for how that kind of plays out?

I think people have different opinions on this.

I think in the limit, if you could just ask one thing and it can figure out what it needs to do to finish the thing that you want it to do for you, that seems like it would be easiest.

Like if you just had a really amazing chief of staff who knows how to route things correctly and basically can do anything you need, that seems like it would be pretty easy.

I think I agree with that take and like even in some of our trajectories where like, I don't know, you're asking about, I don't know, maybe like a shopping task.

Like sometimes it'll go into terminal and like do some like calculations, you know, budget and like I think the model should be free to use all the tools it wants.

It doesn't need to be a financial analyst to like have the financial analyst like tool set.

Yeah, I feel like when you launch the product, it sometimes makes sense to have some GPS like a customized model or customized instruction to put the model into a specific role.

But in general, like when training the model, there are lots of positive transfer between deep research, cooperations, also, you know, slice generation, like all of these scales are transferable.

So it makes much more sense to just have a single agent like as an underlying based model.

Totally.

I guess even though, you know, people do different types of work, we're all fundamentally we're sending emails, we're making slide decks, we're doing a lot of the same work in front of a computer.

I'd love to understand some of the learnings from the reinforcement learning perspective.

Like it seems like that's the method that seems to really be working for you guys with agents.

Was it like very data intensive to kind of get to this point of having an agent that's so good at, you know, such a wide variety of tasks?

Or like what were some of the learnings from an RL perspective?

Yes.

So, you know, we actually we create a bunch of very diverse set of tasks, like some tasks to find some, you know, like a very niche topic or very niche answer in the internet.

Or some tasks, you know, like just very similar to deep research, like you need to write a whole four day length article and also lots of, you know, tasks like just all the tasks that we want the model to be good at.

And yeah, and so far we think that as long as, you know, like you can grade this task, we like to give you a result.

You can judge whether the, you know, the robots or the models performance good or not.

You can kind of like reliably train the model to be even better on this task.

Was there anything special you need to do to make sure it had good turn by turn interaction with users when doing that training?

Or was it just about the type of trajectories you collected?

Yeah, so like I think most of the time we focus on end to end performance, like, you know, like a from the from a way of specify the prompt how to complete the task.

And somehow it's very good at working with users.

To your question, the reinforcement learning is very data efficient.

So that means that we're able to curate a much smaller set of very high quality data.

The scale of the data just so minuscule compared to the scale of pre-training data.

So we're able to teach the model new capabilities by just curating these much smaller high quality data sets.

I will say to get the operator piece to work well.

You know, before we have before we do RL, the model has to be good enough to like have like a basic basic completion of tasks.

And our team has spent a lot of time in the past, like over the past two, maybe maybe three years getting the model to that point where it's able to actually reason about a page and like kind of like understand the visual elements really well.

So this model is built on all that as well.

Actually, could you say a little bit more about that?

Because I remember early days of open AI, this was always part of the like the world of bit stuff.

And you're trying to RL the mouse paths and it was just like way too unbounded of a problem.

What's changed now for that to be kind of solvable?

Yeah, that's great that you point out the world of bits.

That's this project does have a very long lineage dating back to like 2017 or so.

Actually, like our like codename is like World of Bits 2, like for the computer use part.

That's awesome.

And yeah, what's changed?

I think essentially the scale of the training has changed.

Like we have I don't know the multiplier, but it must be like 100,000 X or something like in terms of compute the amount of training data we've done both in pre-training and RL.

So yeah, I really think it's just scale and the scale catching up to our ambition, I guess.

Wow, scale is all you need.

I believe it.

We had some good data.

Are there particular capabilities or functionality that you're especially excited about in agent mode?

Yeah, so this model is actually pretty good at doing some real research like data science and also, you know, summarize the reports or like the findings in a spreadsheet.

So we have some evaluation like, you know, data science bench.

We evaluate the model and it's actually outperformed the human baseline.

So in some sense, it's actually superhuman in some research task that we can rely on them on the model to perform some basic analysis for us.

And this is an area that John Blackman on our team was really pushing on like spreadsheets and data science.

So shout out, John.

Spreadsheets and data science.

You are elevating us out of a job over here.

Elevating us out of a job.

Enhancing.

Another thing I'm excited about is, you know, we released Operator in January and I, you know, it was decent at clicking around, but I think we've substantially improved that capability where it's like much more accurate and just kind of getting the basic things right is what I'm actually excited about.

Where it can like reliably fill out a form and, you know, do those kind of things.

Date picking.

Date picking still needs a bit of work.

For some reason, date picking is just the hardest task.

It's hard for humans too.

Like picking a date in the calendar drop down?

Yes.

Okay, last question.

It seems like you guys have the overall framework and structure in place for something really interesting here.

What's ahead?

Where do you go from here?

I think the thing that we're really excited about is that this tool that we've given the model access to is very general.

It's basically most of what you could do on a computer.

And if you think about all of the tasks that a human can do on a computer, it's very extensive.

And so now we kind of feel like it's a matter of us making the model go to all those tasks too and figuring out a way of training on as diverse of tasks as possible with this very general tool.

So I think there's a lot of hard work ahead of us, but we're very excited about it.

I think we're also excited about pushing different forms, ways of interacting with the agent.

I think there'll be a lot of new interaction paradigms between users and these virtual assistants or agents.

So a lot of exciting times ahead.

I can't wait to see it.

Thank you.

Thanks for joining us.

Congratulations on the launch.

Thank you so much.

Thank you for having us.

[MUSIC PLAYING]