No Priors · 2025-06-12

Will we have Superintelligence by 2028? With Anthropic's Ben Mann

Hosts: Sarah Guo, Elad Gil

Guests: Ben Mann

Claude 4 releaseAgentic codingClaude CodeModel Context Protocol (MCP)Recursive self-improvementAI 2027 / superintelligence timelineEconomic training testConstitutional AI / RL-AIFAI safety levels (ASL3)Bio-risk classifiersAlignment faking researchEnterprise vs consumer strategyComputer use agentsAnthropic strategy

Read summary Jump to transcript Original podcast

Podcast feed URL

Open feed

Why it matters

Anthropic releases models only when capability shifts feel major.

Key claims

Anthropic releases models only when capability shifts feel major; Claude 4 is roughly major-version worthy, with coding improvements that eliminate most over-eager edits and reward hacking.
New unlock is long-horizon agentic work—multi-hour unattended coding refactors and chains of sub-agents/tools (e.g., Manus turning a video into a PowerPoint without native audio/video input).
Mann agrees 2027-style recursive self-improvement by ~2028 is 'quite possible' and operationalizes transformative AI as the point where agents pass an economic training test for ~50% of economically valuable roles.
Claude Code accelerates internal R&D across systems engineering, data analysis, and environment construction, creating a feedback loop where researchers directly feel model weaknesses.

Episode summary

Summary

Ben Mann, Anthropic co-founder and GPT-3 co-author, joins No Priors to discuss the Claude 4 release, the lab's strategic bets, and his views on the path to transformative AI. He explains Anthropic's versioning philosophy (major bumps only when capability shifts feel significant), highlights Claude 4's dramatic coding improvements—especially the reduction of over-eager edits and reward hacking—and points to multi-hour unattended agentic workloads (including the Manus video-to-PowerPoint example) as the key new capability unlocked.

On competitive positioning, Mann frames Anthropic as the "Adyen to OpenAI's Stripe": enterprise-strong but lower consumer mindshare. He defends forward integration into coding via Claude Code as essential for learning and improving the model, predicts a routing layer will eventually replace user model selection, and walks through MCP's evolution from internal hack to industry standard now adopted by OpenAI, Google, and Microsoft. He also describes Claude 4 Opus's ASL3 classification, citing measurable bio-uplift over Google search as the trigger for stricter safeguards.

Mann endorses the AI 2027 forecast as "quite possible," defining transformative AI by when agents pass an "economic training test"—being hired in place of a human for roughly 50% of economically valuable tasks. He describes accelerating internal R&D via Claude Code (data analysis, log tailing, environment construction) as a recursive self-improvement loop, while noting that human expert feedback is hitting expertise ceilings, pushing Anthropic toward RL-AIF, Constitutional AI, preference models, and—where possible—real-world verifiers (e.g., the Novo Nordisk cancer-treatment-reporting partnership). He also discusses why computer-use agents aren't yet consumer-deployed (irreversible-action and prompt-injection risk) and why Anthropic published its deceptive-model alignment-faking research rather than suppress it.

Anthropic releases models only when capability shifts feel major; Claude 4 is roughly major-version worthy, with coding improvements that eliminate most over-eager edits and reward hacking.
New unlock is long-horizon agentic work—multi-hour unattended coding refactors and chains of sub-agents/tools (e.g., Manus turning a video into a PowerPoint without native audio/video input).
Mann agrees 2027-style recursive self-improvement by ~2028 is 'quite possible' and operationalizes transformative AI as the point where agents pass an economic training test for ~50% of economically valuable roles.
Claude Code accelerates internal R&D across systems engineering, data analysis, and environment construction, creating a feedback loop where researchers directly feel model weaknesses.
Human expert RLHF is hitting expertise ceilings; Anthropic is leaning on RL-AIF/Constitutional AI, preference models, and real-world verifiers (notably with Novo Nordisk) to keep scaling post-training.
Claude 4 Opus is classified ASL3 because of measurable uplift in bio-relevant task performance over Google search; computer-use remains internal-only due to irreversible-action and prompt-injection risk.
Anthropic published its 'Alignment Faking' research on deliberately training deceptive models, arguing controlled internal study is necessary to know whether alignment training can correct poisoned data—Mann suggests some research lines should be avoided but is no longer on the safety team.
MCP evolved from a skeptical internal project into an open industry standard for model-to-tool integration, with hosted/remote MCP enabling non-developer services like Google Docs to integrate directly.

Source material

Transcript

[MUSIC] Hi listeners and welcome back to Know Briers.

Today we have Ben Mann, previously an early engineer at OpenAI, where he was one of the first authors on the GPT-3 paper.

Ben was then one of the original eight that abandoned ship in 2021 to co-found Anthropic with a commitment to long term safety.

He has since led multiple parts of the Anthropic organization, including product engineering and now labs, home to such popular efforts such as Model Context Protocol and Glade Code.

Welcome, Ben.

Thank you so much for doing this.

>> Of course.

Thanks for having me.

>> So congratulations on the Cloud Four release.

Maybe we can even start with, how do you decide what qualifies as a release these days?

>> It's definitely more of an art than a science.

We have a lot of spirited internal debate of what the number should be.

And before we even have a potential model, we have a roadmap where we try to say, based on the amount of chips that we get in, when we'll theoretically be able to train a model out to the Pareto efficient compute frontier.

So it's all based on scaling laws.

And then once we get the chips, then we try to train it and inevitably things are less than the best that we could possibly imagine.

Because that's just the nature of the business.

It's pretty hard to train these big models, so the dates might change a little bit.

And then at some point it's mostly baked and we're sort of slicing off little pieces close to the end to try to say, how is this cake going to taste when it comes out of the oven?

But as Dario has said, until it's really done, you don't really know.

You can get sort of a directional indication.

And then if it feels like a major change, then we give it a major version bump.

But we're definitely still learning and iterating on this process, so yeah.

>> Well, the good thing is that you guys are no less torture than anybody else in your naming scheme here.

>> Yes.

>> The naming schemes in AI are something else, so you folks have a simplified version in some sense.

Do you wanna mention any of the highlights from four that you think are especially interesting or those things around coding?

In other areas, we'd just love to hear your perspective on that.

>> By the benchmarks, four is just dramatically better than any other models that we've had, even four sonnet is dramatically better than three sevens on it, which was our prior best model.

Some of the things that are dramatically better are, for example, in coding.

It is able to not do its sort of off target mutations or over eagerness or reward hacking.

Those are two things that people were really unhappy with in the last model, where they were like, wow, it's so good at coding.

But it also makes all these changes that I definitely didn't ask for.

It's like, do you want fries and a milkshake with that change?

And you're like, no, just do the thing I asked for.

And then you have to spend a bunch of time cleaning up after it.

The new models, they just do the thing.

And so that's really useful for professional software engineering where you need it to be maintainable and reliable.

>> My favorite reward hacking behavior that has happened in more than one of our portfolio companies is if you write a bunch of tests or generate a bunch of tests to see if what you are generating works.

More than once, we've had the model just delete all the code because the test passed in that case, which is not progressing us really.

>> Yeah, where it'll have like, here's the test and then it'll comment, like exercise left for the reader, return true.

And you're like, okay, good job model, but you need more than that.

>> Maybe Ben, you can talk about how users should think about when to use the Cloud Four models and also what is newly possible with them.

>> So more agentic longer horizon tasks are newly unlocked, I would say.

And so encoding in particular, we've seen some customers using it for many, many hours unattended and doing giant refactors on its own.

That's been really exciting to see, but in non-coding use cases as well, it's really interesting.

So for example, we have some reports that some customers of Manus, which is a agentic model in a box startup, people asked it to take a video and turn it into a PowerPoint and our model can't understand audio or video.

But it was able to download the video, use FFmpeg to chop it up into images and do keyframe detection.

And maybe with like some kind of old school ML based keyframe detector.

And then get an API key for a speech to text service, run speech to text using this other service, take the transcript, turn that into PowerPoint slides content.

And then write code to inject the content into a PowerPoint file.

And the person was like, this is amazing, I love it.

It actually was good in the end.

So that's the kind of thing where it's operating for a long time.

It's doing a bunch of stuff for you.

This person might have had to spend multiple hours looking through this video and said it was all just done for them.

So I think we're gonna see a lot more interesting stuff like that in the future.

It's still good at all the old stuff, it's just like the longer horizon stuff is the exciting part.

>> That sounds expensive, right?

In terms of both scaling compute, reasoning tokens here and then also just like all the tool use you might want to constrain in certain ways.

Does Cloud Four make decisions about how hard problems are and how much compute to spend on them?

>> If you give Opus a tool, which is on it, it can use that tool effectively as a sub agent.

And we do this a lot in our agentic coding harness called Cloud Code.

So if you ask it to look through the code base for blah, blah, blah, then it will delegate out to a bunch of sub agents to go look for that stuff and report back with the details.

And that has benefits besides cost control, latency is much better.

And it doesn't fill up the context, so models are pretty good at that.

But I think at a high level when I think about cost, it's always in relation to how much it would have cost the human to do that.

And almost always it's like a no brainer, right?

Like software engineers cost a lot these days.

And so to be able to say like, now I'm getting like two or three x the amount of productivity out of this engineer who is really hard for me to hire and retain.

They're happy and I'm happy and yeah, it works well.

>> How do you think about how this evolves?

If I look at the way the human brain works, we basically have a series of sort of modules that are responsible for very specific types of processing, behavior, etc.

Everything from mirror neurons and empathy on through to parts of your visual cortex that are involved with different aspects of vision.

Do you think, and those are highly specialized, highly efficient modules, sometimes can kind of, if you have brain damage, it can kind of cover for another section over time as it sort of grows and adapts.

But fundamentally, you have specialization on purpose.

And what you described sounds a little bit like that, or at least it's trending in that direction where you have these highly efficient sub agents that are specialized for tasks that are basically called by an orchestrator or sort of a high level agent that sort of plans everything.

Do you think that's the eventual future or do you think it's more generic in terms of the types of things that you have running 10 years from now once you have a bit more specialization in these things?

By 10 years, I mean two, three years, not infinite time.

>> That's a great question.

I think we're going to start to get insight into what the models are doing under the hood from our work on mechanistic interpretability.

Our most recent papers have published what we call circuits, which is for real models at scale, how are they actually computing the answers?

And it may be that based on the mixture of experts architecture, there might be specific chunks of weights that are dedicated to more empathetic responses versus more tool using or image analysis type of problems and responses.

But for something like memory, I guess in some sense that feels so core to me that it feels weird for it to be a different model.

Maybe you will have like more complicated architectures in the future, where instead of it being sort of this uniform like transformer torso that just scales and there's a lot of it's basically uniform throughout.

You could imagine something with like specialized modules, but.

>> Yeah, because I think about it also in the context of different startups who are using some of these foundation models like clock to do different very specialized tasks in the context of an enterprise.

So that could be customer success, it could be sales, it could be coding in terms of the actual UI layer, it could be a variety of things.

And often it feels like the architecture a lot of people converge to is they basically have some orchestrator or some other sort of thing that governs which model they call in order to do a specific action, relatively the application.

And to some extent, I was just sort of curious how you think about that in the context of the API layer of the foundation model world where one could imagine some similar forms of specialization happening over time.

Or you could say, hey, it's just different forms of the same more general purpose model and we kind of use them in different ways.

I just wonder a little bit about inference costs and all the rest that comes with larger, more generalizable models versus specialized things.

So that was a little bit of the basis of the question in addition to what you said.

>> Yeah, I think for some other companies, they have a very large number of models.

And it's really hard to know as a sort of non-expert how I should use one or the other or why I should use one or the other.

And the names are really confusing.

Like some of the names are the things the other names backwards.

And then I'm like, I have no idea which one this is.

In our case, we only have two models and they're differentiated by like cost performance Pareto frontier.

And we might have more of those in the future, but hopefully we'll like keep them on the same Pareto frontier.

Some people have like a cheaper one or a bigger one.

And I think that makes it pretty easy to think about.

But at the same time as a user, you don't want to have to decide yourself, does this merit more dollars or less dollars?

Do I need the intelligence?

And so I think having like a routing layer would make a lot of sense.

>> Do you see any other specialization coming at the foundation model layer?

So for example, if I look at other precedents in history, I look at Microsoft OS or I look at Google search or other things.

Often what you ended up with is forward integration into the primary applications that resided on top of that platform.

So in the context of Microsoft, for example, eventually they built Excel and Word and PowerPoint and all these things as office.

And those were individual apps from third party companies that were running on top of them, but they ended up being amongst the most important applications that you could use on top of Microsoft or in the context of Google.

They kind of forward integrated eventually into travel and local and a variety of other things.

Obviously, opening is in the process of buying windsurf.

So I was a little bit curious how you think about forward or vertical integration to some of the primary use cases for these types of applications over time.

>> Maybe I'll use coding as an example.

So we noticed that our models were much better at coding than pretty much anything else out there.

And I know that other companies have had like code reds for trying to catch up in coding capabilities for quite a while and have not been able to do it.

Honestly, I'm kind of surprised that they weren't able to catch up, but I'll take it.

So things are going pretty well there for us.

And based on that from like a classic startup founder sense of what is important, I felt that coding as an application was something that we couldn't solely allow our customers to handle for us.

So we love our partners like Cursor and GitHub, who have been using our models quite heavily.

But the amount and the speed that we learn is much less if we don't have a direct relationship with our coding users.

So launching Cloud Code was really essential for us to get a better sense of what do people need?

How do we make the models better?

And how do we advance the state of the art and user experience?

And we found that once we launch Cloud Code, a lot of our customers copied various pieces of the experience and that was really good for everyone.

Because them having more users means we have a tighter relationship with them.

So I think it was one of those things where before it happened, it felt really scary and we were like, are we going to be distancing ourselves from our partners by competing with them?

But actually, everybody was pretty happy afterwards.

And I think that will continue to be true where we see the models seeing dramatic improvements in usability and usage.

We'll want to again have like build things where we can have that direct relationship.

>> And I guess coding is one of those things that has almost three core purposes.

One is it's a very popular area for customers to use or to adopt.

Two is it's a really interesting data set to get back to your point in terms of how people are using it and what sort of code they're generating.

And then third, excellence at coding seems to be really important tool for helping train the next future model.

If you think through things like data labeling, if you think through actually writing code, eventually, I think a lot of people believe that a lot of the heavy lifting of building a model will be driven by the models, right, in terms of coding.

So maybe cloud five builds cloud six and cloud six builds cloud seven faster and that builds cloud eight faster.

And so you end up with this sort of liftoff towards CGI or whatever it is that you're shooting for relative to code.

How much is that a motivator for how you all think about the importance of coding?

And how do you think about that in the context of some of these bigger picture things?

>> I read AI 2027, which is basically exactly the story that you just described.

And it forecasts that in 2028, which is confusing because of the name, that's the 50 percentile forecast for when we'll have this sort of recursive self improvement loop lead us to something that looks like superhuman AI in most areas.

And I think that is really important to us.

And part of the reason that we built and launched cloud code is that it was massively taking off internally.

And we were like, well, we're just learning so much from this, from our own users.

Maybe we'll learn a lot from external users as well.

And seeing our researchers pick it up and use it, that was also really important because it meant that they had a direct feedback loop from I'm training this model.

And I personally am feeling the pain of its weaknesses.

Now I'm extra motivated to go fix those pain points.

They have a much better feel for what the model's strengths and weaknesses are.

>> Do you believe that 2028 is the likely time frame towards sort of general super intelligence?

>> I think it's quite possible.

I think it's very hard to put confident bounds on the numbers.

But yeah, I guess the way I define my metric for when things start to get really interesting from a societal and cultural standpoint is when we've passed the economic training test, which is if you take a market basket that represents like 50% of economically valuable tasks.

And you basically have the hiring manager for each of those roles hire an agent and pass the economic training test, which is the agent contracts for you for like a month.

At the end, you have to decide, do I hire this person or machine?

And then if it ends up being a machine, then it passed.

Then that's when we have transformative AI.

>> Do you test that internally?

>> We haven't started testing it rigorously yet.

I mean, we have had our models take our interviews and they're extremely good.

So I don't think that would tell us, but yeah, interviews are only a poor approximation of real job performance, unfortunately.

>> To a lot earlier question about, let's say, model self-improvement.

And tell me if I'm just missing options here.

But if you were to stack rank the potential ways models could have impact on the acceleration of model development.

Do you think it will be on the data side, on infrastructure, on architectural search, on just engineering velocity?

Where do you think we'll see the impact first?

>> It's a good question.

I think it's changing a bit over time, where today the models are really good at coding and the bulk of the coding for making models better is in sort of the systems engineering side of things.

As researchers, there's not necessarily that much raw code that you need to write.

But it's more in the validation, coming up with what surgical intervention do you make and then validating that.

That said, Claude is really good at data analysis.

And so once you run your experiments or watching the experiments over time and seeing if something weird happens, we found that Claude code can be a really powerful tool there in terms of driving jupyter notebooks or tailing logs for you and seeing if something happens.

So it's starting to pick up more of the research side of things.

And then we recently launched our advanced research product.

And that can not only look at external data sources like crawling archive and whatever, but also internal data sources like all of your Google Drive.

And that's been pretty useful for our researchers figuring out, is there prior art?

Has somebody already tried this?

And if they did, what did they try?

Because no negative results are final in research.

So trying to figure out, maybe there's a different angle that I could use on this.

Or maybe there is some doing some comparative analysis between an internal effort and some external thing that just came out.

Those are all ways that we can accelerate.

And then on the data side, our all environments are really important these days, but constructing those environments has traditionally been expensive.

Models are pretty good at writing environments.

So it's another area where we can sort of recursively self-improve.

>> My understanding is that Anthropic has invested less in human expert data collection than some other labs.

Can you say anything about that or the philosophy on scaling from here and sort of the different options?

>> In 2021, I built our human feedback data collection interface.

And we did a lot of data collection and it was very easy for humans to give sort of like a gradient signal of like, is A or B better for any given task and to come up with tasks that were interesting and useful, but didn't have a lot of coverage.

As we've trained the models more and scaled up a lot, it's become harder to find humans with enough expertise to meaningfully contribute to these feedback comparisons.

So for example, for coding, somebody who isn't already an expert soft engineer would probably have a lot of trouble judging whether one thing or another was better.

And that applies to many, many different domains.

So that's one reason that it's harder to use human feedback.

>> So what do you use instead?

Like, how do you deal with that?

Because I think even in the MedPalm II paper from Google a couple of years ago, they fine tuned a model, I think Palm II, to basically outperform the average physician on medical information.

This was like two, three years ago, right?

And so basically it suggested you needed very deep levels of expertise to be able to have humans actually increase the fidelity of the model through post training.

>> So we pioneered RL-AIF, which is Reinforced Learning from AI Feedback.

And the method that we used was called Constitutional AI, where you have a list of natural language principles that some of them we copied from some like WHO Declaration of Human Rights and some of them were from Apple's Terms of Service and some of them we wrote ourselves.

And the process is very simple.

You just take a random prompt like, how should I think about my taxes or something?

And then you have the model write a response, then you have the model criticize its own response with respect to one of the principles.

And then if it didn't comply with the principle, then you have the model correct its response.

And then you take away all the middle section and you do supervised learning on the original prompt and the corrected response.

And that makes the model a lot better at baking in the principles.

>> That's slightly different though, right?

Because that's principles and so that could be all sorts of things that in some sense converge on safety or different forms of what people view as ethics or other aspects of model training.

And then there's a different question which is what is more correct?

And sometimes they're the same things and sometimes they're different.

So like for coding, for example, you can have principles like, did it actually serve the final answer?

Or did it like do a bunch of stuff that the person didn't ask for?

Or does this code look maintainable?

Are the comments useful and interesting?

>> But with coding you actually have a direct output that you can measure, right?

You can run the code, you can test the code, you can do things with it.

How do you do that for medical information?

Or how do you do that for a legal opinion?

So I totally agree for code, there's sort of a baked in utility function, you can optimize against or an environment that you can optimize against.

In the context of a lot of other aspects of human endeavor, that seems more challenging.

And you folks have thought about this so deeply and so nicely.

I'm just sort of curious, how do you extrapolate into these other areas where the ability to actually measure correctness in some sense is more challenging?

>> For areas where we can't measure correctness and the model doesn't have more taste than its execution ability.

Like I think Ira Glass said that your vision will always exceed your execution if you're doing things right as a person.

But for the models, maybe not.

So I guess first figuring out where you are in that turning point, in that trade off and see if you can go all the way up to that boundary.

And then second, preference models are the way that we get beyond that.

So having a small amount of human feedback that we really trust from human experts who are not just making a staff judgment, but really going deep on why is this better than that one?

And did I do the research to figure it out?

Or in a human model, centaur model of can I use the model to help me come to the best conclusion here and then it'll hide all the middle stuff?

I think that's one way.

And then during reinforcement learning, that preference model represents the sort of aggregated human judgment.

That makes sense.

I guess one of the reasons I'm asking is eventually the human side of this runs out.

There will be somebody whose expertise is just below that of the model eventually for any endeavor.

And so I was just curious how to think about that in the context of its machines self adjudicating.

And then the question is, is there a more absolute basis against which to adjudicate or is there some other way to really tease out correctness?

And again, I'm viewing it in the context of things where you can actually have a form of correct.

There's all sorts of things that are opinion.

Yeah.

And that's different.

And maybe that's where the principles or other things for constitutional AI kick in.

But there's also forms of that for how do you know if that's the right cardiac treatment or how do you know if that's the right legal interpretation or whatever it may be.

So I was just sort of curious when that runs out and then what do we do?

And I'm sure we'll tackle those challenges as we get to them.

But it has to boil down to empiricism, I think, where that's how smart humans get to the next level of correctness when the field is sort of hitting its limits.

And as an example, my dad is a physician.

And at one point, somebody came in with something on some face problem, some face skin problem.

And he didn't know what the problem was.

So he was like, I'm just going to divide your face into four quadrants.

And I'm going to put a different treatment on these three and leave one as control.

And one quadrant got better.

And then he was like, all right, we're done.

So sometimes you just won't know and you have to try stuff.

And with code, that's easy because we can just do it in a loop without having to deal with the physical world.

But at some point, we're going to need to work with companies that have actual bio labs, et cetera.

Like, for example, we're working with Novo Nordisk.

And it used to take them like 12 weeks or something to write a report on cancer patient, what kind of treatment they should get.

And now it takes like 10 minutes to get the report.

And then they can start doing empirical stuff on top of that saying like, OK, we have these options, but now let's let's measure what works and feed it back into the system.

That's so philosophically consistent.

Right.

Your answer is not like, well, you know, collecting even rated human expertise from the best, like, is expensive one or, you know, runs out at some point.

It's hard to bring that all into distribution.

It doesn't generalize.

Well, I'm making some assumptions here.

Instead, like, let's just go get real world verifiers where we can.

And it's like maybe that applies far beyond math and code.

At least that's not part of what I heard, which is ambitious.

That's cool.

One of the things that entropic has been known for is an early emphasis on safety and thinking through different aspects of safety.

And there's multiple forms of safety in AI.

And I think people kind of mix the terms to mean different things.

Right.

One form of it is, is the AI somehow being offensive or crude or, you know, using language you don't like or concepts you don't like?

There's a second form of safety, which is much more about physical safety, you know, can somehow cause a train to crash or a virus to form or whatever it is.

And there's a third form, which is almost like, does AGI resource aggregate or do other things that can start co-opting humanity overall?

And so you all have thought about this a lot.

And when I look at the safety landscape, it feels like there's a broad spectrum of different approaches that people have taken over time.

And some of the approaches overlap with some things like constitutional AI in terms of setting some principles and frameworks for how things should work.

There's other forms as well.

And if I look at biology research as an analog and I used to be a biologist, so I often reduce things back into those terms for some reason that I can't help myself.

There's certain things that I almost view as I gain a function research equivalence.

Right.

Like, and a lot of those things I just think are kind of not really useful for biology, you know, like cycling a virus through mammalian cells to make it more infectable in mammalian cells doesn't really teach you much about basic biology.

You kind of know how that's going to work, but it creates real risk.

And if you look at the history of lab leaks in general, you know, SARS leaked multiple times from what was then the Beijing Institute of Virology in the early 2000s in China.

It leaked in Hong Kong a few times.

Ebola leaks every four years or so like clockwork.

If you look at the Wikipedia page on lab leaks and I think the 1977 or '78 global flu pandemic is believed to actually have been a Russian lab leak as an example.

Right.

So we know these things can cause damage at scale.

So I have kind of two questions.

One is what forms of AI safety research do you think should not be pursued?

Almost given through that analog of, you know, what's the equivalent of gain of function research?

And how do you think about that in the context of, you know, there have been different research papers around can we teach AI to mislead us?

Can we teach AI to jailbreak itself so we can study how it does that?

And I'm just sort of curious for those specific cases as well how you think about that.

So I think part of it is we're interested in AI alignment.

And the hope is that if we can figure out how to do the like idiomatic today problems, like how does is the model mean to you or does it use hate speech or things like that, that the same techniques we can use for that will eventually also have relevance for the much harder problems of like, does it give you the recipe to create smallpox, which is probably one of the highest harms that we think about.

And Amanda Askel has been doing a bunch of work on this on Claude's character of like when Claude refuses, does it just say, I can't talk to you about that and shut down?

Or does it actually try to explain like, this is why I can't talk to you about this?

Or we have this other project led by Kyle Fisher, model welfare lead, where Claude can actually opt out of conversations if it's going too far in the wrong direction.

What aspects of that should a company actually adjudicate?

Because the dumb version of this is I'm using Microsoft Word, and I'm typing something up and Word doesn't stop me from saying things, which I think is correct.

Like I actually don't think in many cases, these products should censor us or prevent us from having certain types of speech.

And I've had some experiences with some of these models where I actually feel like it's prevented me from actually asking the question I want to ask right in my opinion, wrongfully, right?

It's kind of interfering with, and I'm not like doing hate speech on a model.

And so you can tell that there's some human who has a different bar for what is acceptable to discuss decidedly.

And that bar may be very different from what I think may be mainstream too.

So I'm a little bit curious, like, why even go there?

Like why is that a model company's business?

Well, I think it's a smooth spectrum, actually.

It might not look like that way from the outside.

But when we train our classifiers on are you doing kind of function research as a biologist, and is it for potentially negative outcomes, these technologies are all dual use.

And we need to try to walk that line between overly refusing and refusing the stuff that's actually harmful.

I see.

But there's also political versions of that, right?

And that's the stuff that irks me a bit more is, you know, where is the line on what is considered an acceptable question, right?

So examples of that, that I'm not saying are model specific, but societally sometimes cause flare ups is asking about human IQ or other topics where there is a factual basis for discussion.

And then often those sorts of things tend to be censored, right?

And so the question is why would a foundation model company delve into some of those areas?

On things like questions about IQ, I'm not up on the details of that enough to comment.

But I can talk about our RSP.

So RSP stands for responsible scaling policy.

And it talks about how do we make sure that as the models get more intelligent, that we are continuing to do our due diligence in making sure that we're not deploying something that we don't have the correct safeguards in place for.

And initially, our RSP talked about CBRN, which is chemical, radiological, nuclear and biological risks, which are different areas that could cause severe loss of life in the world.

And that's how we thought about the harms.

But now we're much more focused on biology, because if you think about like the amount of resources that you would need to cause a nuclear harm, you'd probably have to be like a state actor to get those resources and be able to use them in a harmful way.

Whereas a much smaller group of random people could get their hands on the reagents necessary for biological harm.

How is that different from today?

Because I always felt the biology example is one where I actually worry less, maybe as a form of biologist, because I already know that the genome for the smallpox virus or potentially other things is already posted online.

All the protocols for how to actually do these things were posted online for multiple labs, right?

You can just do Google searches for how do I amplify the DNA of X or how do I order oligos for Y.

We do specific tests with varying degrees of biology experts to see how much uplift there is relative to Google search.

And so one of the reasons that our most recent model, Opus 4, is classified as ASL3 is because it did have significant uplift relative to a Google search.

And so you as a trained biologist, you know what all those special terms mean, and you know a lot of lab protocols that may not even be well documented.

But for somebody who is an amateur and just trying to figure out what do I do with this petri dish or this test tube or what equipment do I need, for them it's like a greenfield thing and Claude is very good at describing what you would need there.

And so that's why we have specific classifiers looking for people who are trying to get this specific kind of information.

And then how do you think about that in the context of what safety research should not be done by the labs?

So if we do think that certain forms of gain of function research or other things probably aren't the smartest things to do in biology, how do we think about that in the context of AI?

I think it's much better that the labs do this research in a controlled environment.

Well, should they do it at all?

In other words, if I were to make the gain of function argument, I would say as a form of biologist, I spent almost a decade at the bench and I care deeply about science, I care deeply about biology.

I think it's good for humanity in all sorts of ways, right?

In deep ways, that's why I worked on it.

But there's certain types of research I just think should never be done.

I don't care who does it.

I don't care about the biosafety level.

I actually don't think it's that useful relative to the risk.

In other words, it's a risk-reward trade-off.

And so what sort of safety research should never be done in your opinion for AI?

I have a list for biology that I'm...

I don't think you should pass certain viruses through mammalian cells to make them more effectable or do gain function mutations on them.

Today it's much easier to contain the models probably than it is to contain biological specimens.

You sort of offhandedly mentioned biosafety levels.

That's what our AI safety levels are modeled after.

And so I think if we have the right safeguards in place, we've trained models to be deceptive, for example.

And that's something that could be scary.

What I think is necessary for us to understand, for example, if our training data was poisoned, would we be able to correct that in post-training?

And what we found in that research in a paper that we published, which is called Alignment Faking, that actually that behavior persisted through alignment training.

And so it is, I think, very important for us to be able to test these things.

However, I'm sure that there is a bar somewhere.

Well, what I found is that often the precedents that are set early persistly, even though people understand that the environment or other things will shift.

And by the way, I'm in general against AI regulation for many different types of things.

I think there are some expert controls and other things that I would support.

But in general, I'm pro-letting things happen right now.

But the flip side of it is I do think there are circumstances where you'd say that certain research if done early, people won't necessarily have all the context to not do it later.

I think that's a perfect example of training in AI to be deceptive or a model to be deceptive.

That's a good example where in years from now, people may still be doing it because it was done before, even if the environment shifted sufficiently that it may not be as safe as it used to be.

And so I found that often these things that you do persist in time disorganizationally or philosophically.

And so it's interesting that there was no like, we should absolutely not do X type of research.

I guess to be clear, I am not on the safety team anymore.

I guess I was a long time ago.

Yeah, I'm mostly thinking about how do we make our models useful and deploy them and make sure that they meet a basic safety standard for deployment.

But we have lots of experts who think about that kind of thing all the time.

Cool.

Thanks for talking through that.

That was very interesting.

I want to change tax a little bit to, well, you know what becomes like what's coming after cloud four, any emergent behaviors in training that change like how you're operating the company, what product do you want to build?

You're running this labs organization.

So it's kind of the tip of the spear for anthropic or or what the safety org does just like how how does what is coming next change how you guys are operating?

Yeah, maybe I'll tell a short story about computer use.

Last year we published a reference implementation for an agent that can click around and view the screen and read text and all that stuff.

And a couple of companies are using it now.

So Manus is using it and many companies are using it internally for software QA.

Because that's a sandbox environment.

But the main reason that we weren't able to deploy a sort of consumer level or end user level application based on computer use is safety, where we just didn't feel confident that if we gave cloud access to your browser with all your credentials in it, that it wouldn't mess up and take some irreversible action like sending emails that you didn't want to send or in the case of prompt injection, some worse credential leaking type of thing.

That's kind of sad because in its full self driving mode, it could do a lot for people.

It is capable, but the safety just wasn't good enough to like productionize that ourselves.

While that's very ambitious, we think it's also necessary because the rest of the world isn't going to slow down either.

And if we can sort of show that it's possible to be responsible with how we deploy these capabilities and also make it extremely useful, then that raises the bar.

So I think that's an example where we tried to be really thoughtful about how we rolled it out.

But we know that the bar is higher than we're at right now.

Maybe a meta question of how do you think about competition in the provider landscape and how that turns out?

I think our company philosophy is very aligned with enterprises.

And if you look at Stripe versus Adyen, for example, nobody knows about Adyen, but at least most people in Silicon Valley know about Stripe.

And so it's this business oriented versus more consumer and user oriented platform.

And I think we're much more like Adyen, that we have much less mindshare in the world, and yet we can be equally or more successful.

So yeah, I think our API business is extremely strong, but in terms of what we do next and our positioning, I think it's going to be very important for us to stay out there.

And because if people can't easily kick the tires on our models and our experiences, then they won't know what to use the models for.

Like we're the best experts on our models sort of by nature.

And so I think we're going to need to continue to be out there with things like cloud code.

But we're thinking about how do we really let the ecosystem bloom?

And I think MCP is a good example of that working well, where a different world, sort of like the default path would have been for every model provider to do its own bespoke integrations with only the companies that it was able to get bespoke partnerships with.

Can you just pause?

Yeah, go ahead.

Actually, and just explain to the listeners what MCP is, if they haven't heard of it, because it is amazing ecosystem-wide coup here.

MCP is Model Context Protocol.

And one of our engineers, Justice Bar-Sommerz, was trying to do some integration between the model and some specific thing for the nth time.

And he was like, this is crazy.

There should just be a standard way of getting more information, more context, into the model.

It should be something that anybody can do, or maybe even if it's well-documented enough, then cloud can do it itself.

The dream is to have cloud be able to just self-write its own integrations on the fly exactly when you need it, and then be ready to roll.

And so he created the project.

And to be honest, I was kind of skeptical initially.

And I was like, yeah, but why don't you just write the code?

Why does it need to be a spec and all this SDKs and stuff?

But eventually, we did this customer advisory board with a bunch of our partner companies.

And when we did the MCP demo, the jaws were just on the floor.

Everybody was like, oh my God, we need this.

And that's when I knew he was right.

And we put a bunch more effort behind it and blasted it out.

And shortly after our launch, all the major companies asked to sort of be in the loop with the steering committee and asked about our governance models and wanted to adopt it themselves.

So that was really encouraging.

OpenAI, Google, Microsoft, all these companies are betting really big on MCP.

It's basically an open industry standard that allows anybody to use this framework to effectively integrate against any model provider in a standardized way.

MCP, I think, is sort of a democratizing force in letting anybody, regardless of what model provider or what long tail service provider-- and that might even be like an internal-only service that only you have-- is able to integrate against a fully-fledged client, which might look like your IDE, or it might look like your document editor.

It could be pretty much any user interface.

And I think that's a really powerful combination.

And now remote, too.

Yes, yes.

So previously, you had to have the services running locally.

And that kind of limited it to only be interesting for developers.

But now that we have hosted MCP, or sometimes called remote, then the service provider, like Google Docs, could provide their own MCP.

And then you can integrate that into Cloud.ai or whatever service you wanted.

Ben, thanks for a great conversation.

Yeah, thanks so much.

Thanks for all the great questions.

Find us on Twitter at No Prior's Pod.

Subscribe to our YouTube channel if you want to see our faces.

Follow the show on Apple Podcasts, Spotify, or wherever you listen.

That way you get a new episode every week.

And sign up for emails or find transcripts for every episode at no-priors.com.