The Cognitive Revolution · 2026-06-13

AI in the AM Week 2 Highlights: Fable Launch & Alignment Theory

Hosts: Nathan

Guests: Jeffrey Irving, Daniel Murphy, Rahul Sunwalkar, Shlok Khemani, Tom Agrath, Andrew Moore, prinz

AnthropicFableAI alignmentRecursive self-improvementHybrid authorshipAI safety theoryInterpretabilityAI policy

Summary

The episode covers the launch and early user experiences of Anthropic's Fable model, highlighting its cautious gating on sensitive tasks and impressive autonomous decision-making in complex workflows. Discussions include the challenges of AI alignment, with Jeffrey Irving and Daniel Murphy announcing Sequent, a new organization focused on theoretical guarantees for AI safety amid accelerating capabilities. The episode also explores hybrid authorship with AI, economic incentives in AI usage, and the concentration of power in frontier AI labs. Key policy and interpretability issues are raised, including Anthropic's response to silent refusals and the need for better oversight of internal model deployments.

Anthropic's Fable model launch shows strong coding and reasoning capabilities but is heavily gated on production and sensitive tasks, often falling back to older models when restricted.
Fable demonstrates high agency in vague tasks, autonomously sourcing satellite and elevation data to build accurate 3D environments, exceeding expectations.
Sequent, a new organization by Jeffrey Irving and Daniel Murphy, aims to tackle AI alignment through rigorous theory and formal guarantees, emphasizing the urgency given timelines of 2-4 years to potential superintelligence.
Hybrid authorship with AI is emerging as a new workflow paradigm, where humans increasingly accept AI-generated drafts and outlines, enhancing productivity and task imagination.
Economic incentives in AI usage currently encourage token consumption, with token anxiety limiting exploration of AI capabilities; competition among coding models may alleviate this.
Interpretability advances include tools that analyze training data through the model's perspective to identify misalignment and safety issues linked to specific data points.
Anthropic reversed its policy on silent refusals after community backlash, highlighting tensions between safety gating and user transparency.
Concerns about concentration of power in a few frontier AI labs persist, with staggered access to top-tier models based on user type and payment, raising questions about democratization and control.

Transcript

We could be in a benevolent basin, but I would like to know that, rather than just hope that. That's Daniel Murphy, and that one sentence is the week in miniature. This was Fable Launch Week, and Thropix, new frontier model arrived, booked Thursday's show by itself, took over my Twitter account, and settled at least one argument. AI is not slowing down. This is the AI in the AM weekly highlights. The moments from three live mornings this week that I most want the people closest to this technology to have. Quick context, this is still an experiment. We're live most week day mornings, through June at least, from a studio perkoch videcoded himself, and we publish the skills and artifacts behind the show as they mature. If this cut earns your time, or waste it, tell us. The feedback is the product right now. First, the launch as we actually lived it. Wednesday morning, day one of Fable in real workflows, and perkoch came in with a field report you will not find in the model card. The Captain of Revolution is brought to you by Mercury, the fintech that more than 300,000 ambitious companies and individuals trust to run their finances. Over the last few months, I have made tremendous strides with my personal AI infrastructure. Today, I've got high-context instances of both clawed code and open claw running on a Mac mini, and it's amazing what they can do. However, until getting started with Mercury, I didn't have a great way for them to pay for things. I didn't want to give them unrestricted access to my money, but my old bank didn't give me any other options. With Mercury, I can create as many virtual cards as I want. Each with its own daily, weekly, or monthly spending limit, and I can lock any card to a single category of purchase, or even a single merchant. Now, I have a card that my agent can use to buy our family's groceries, and only our groceries, and I can create another anytime I want to give an agent a random one-off project that might require making a purchase. This is honestly just the start of Mercury's AI-friendly offerings. Does your bank offer API keys, an MCP, or a CLI tool? If not, check out Mercury at Mercurie.com. Mercury is a FinTech company, not an FDIC-insured bank. Banking services provided through choice financial group and column NA, members FDIC. Thank you to Mercury for supporting the cognitive revolution, and now on with the show. So one thing to note about the nerfing. So what has happened with Fable is we have a lot of rejections, and whenever Fable decides to reject you, it drops from Fable to Opus 4.8. So there's a natural natural downgrade. In experiments overnight, I tried to make a number of bug fixes on this very studio app, and what I found was Fable would always consistently drop to Opus 4.8. Whenever it was asked to do anything in production. So touching the production database, touching any of the security keys, touching you know, asking it to review production directly. In every case, I've had three or four times it's dropped out. Every time it's dropped out, I've basically restarted the conversation, added back the context that we were using, but excluding the parts about going into production of, you know, addressing the production database, and it has continued to work. So I think there are a number of triggers there. Online people are saying, hey, it's not going to do machine learning research for me. I think that's just tip of the iceberg. You're seeing that because the people who are testing it intensively right now are machine learning researchers. If you were, I think to test it on finance or your budgeting process, and you told it you're going to be directly addressing my QuickBooks or Salesforce, I think you might see similar results. Fable right now, I would say is a research release almost a preview. It is there so that I think they can judge the demand, because they don't have a sense of how intense the demand is going to be, and they're going to judge whether or not it's safe to release, which are the, they've started off with the most constrained version of it with the least number of functions, which are open. And I think over the next few weeks, they will start to take away some of those gates. And as they take away some of those gates, I think we will see both the increase in usage. And, you know, some decisions on what needs to be really gated and what doesn't. So I think we are in the early stages of exploring what Fable can do. Same gating, seen from the other side of the API. Rahul's on Walker runs Julius, a gented data analysis, raw API, no consumer harness. He compared notes with percuss, live. Just a segue here. So we've been using Fable. We've come across people posting about rejections in my tests almost consistently. Whenever I try to address the production database or the production site, Fable would drop off to open 4.8. I believe your users in Julius are using Fable through the API. And you are also very heavy on data science users. And one of the aspects of, you know, work that Fable is banned from doing is machine learning work. So how have you seen the rejection rate on your platform? Does the API work the same way in the sense that it drops off to open 4.8. And then it gives you a rejection message. How does that work? Yeah. So we have seen failure rates on tasks that involve really advanced coding. That involves, you know, write me, you know, use like it learned to like perform, you know, train this model. But we haven't seen a failure reason, other kind of data task. For example, like, hey, I want to start a landscaping business and like can you help prospect leads for me? We will see failure rates for where it's, we're sure all the trigger safety filters, where it's, it's for things like, you, you're, you're prospecting leads for landscaping business and the AI says, oh, this is personal data, even though it's probably available on internet, you know, let's say, you know, Procache has a Procache land landscaping in Filly. And there's a kind of information. It is kind of borderline personal data, even though it's available in the internet. And so I think, so that's what we have seen. I believe it doesn't fall back to OPC, it's just like a failure in API. Interesting. So the, the, the fallback to OPC is a harness thing on cloud on the cloud, you know, front end man. That's interesting to hear. And for what this model does when nobody is steering Thursday, we had Schluck, Come on, who gave Fabel one vague instruction rebuild your semity as a navigable 3D world. Listen for the decisions nobody asked it to make. But what Fabel ended up doing was finding satellite images for, for this area. And that's how, that's how you get these colors and that's how you get the textures. But then to make it to scale and to make it accurate, it actually fetched elevation data from NASA. Right. And it combined those two, sort of sort of make this to scale. And that, that is what blew my mind right, because, uh, usually when you wipe coding stuff, you give an end objective. And this objective is vague. And there are a 100 steps in the middle, uh, where humans would take decisions differently and usually wipe coding doesn't work out very well, because the quality of the decisions the models make aren't always great. But Fabel made such high quality decisions, where it eventually ended up creating something that exceeded, uh, the expectations of what was initially a very vague objective, uh, and did so in really smart ways. So I'll give you another example, right. So you see all of these trees. And V1 of this project did not have any trees. And I was like, hey, I think some, we're missing some trees here. I would love to add them. And I would have been completely, I would have been completely okay with it just randomly creating these trees. But what it actually did was it analyzed the pixels on the satellite images. It found out the ones that could potentially have trees, so the ones that were green maybe, and added trees only on those spots. But it didn't stop there, right. It realized that because it was analyzing pixels, uh, some of those pixels were white. So you can see that there is snow in the mountains far ahead. And it also added snow. So just exceeding expectations in the small and subtle ways makes really smart decisions. Uh, it's like having a really, really smart employee, uh, but extremely high agency, who blows your mind every single day. Friday morning, brought the week's cleanest empirical result on the recursive question. And it's not from a lab. Here's one of the things I'll just touch on real briefly. This is, uh, thoughtful. This is a company started in part by a woman named Karina Wen who used to be an anthropic. Then she was at OpenAI and I'm actually doing this. And this is maybe one of the more telling, you know, it's kind of vibes. It's kind of quantitative. It's a very idiosyncratic task. But it's also a very relevant task to the future. Can you get your top model to train a small model effectively to do a job for you? And this is something that, as you can just see with these bar graphs here, the particular frog's game thing. It's kind of like a Sudoku type puzzle that they're training a small model to do. But the, and the big models can often just solve it. But the small models can't. So the challenge for the big models can you train the small model to solve it. And this involves all these little tips and, you know, not tips, but tricks and know how and kind of, you know, hard lessons learned by post trainers who've been in the trenches doing this. The models up until Fable basically didn't really move the needle on what the small models could do. They basically just couldn't do this sort of post training effectively. But here we see more than 10x improvement on small models ability to do these tasks. And again, I think this is one way that it could be really good, right? If you had like very narrow, very small, very roll specific, small specialist models in all these different niches, that could be a great world, right? That that gives us a lot of abundance in a very affordable way, you know, you know, this little small model that got post trained to play the frog game, it's not going to go out of control, right? It is small. It can only do probably the frog game at the end of this training. But building out of a world where we have these little role-specific AIs doing their jobs, doing it really well. I think that creates a much more buffered environment that's probably a lot more resilient to another generation of AI that's just like amazing at everything coming in and, you know, kind of shocking the system in such a profound way. So I think this goes to show, again, just wow, what capabilities we have that we have not absorbed and gives them a little bit of a foreshadowing of what a world of tons of small, but highly performant. AIs could look like in all these different little niches and how we get there, right? There's not enough human post trainers, but now we have fable to do the post training. So watch that space. Now let me introduce a voice. You'll hear a few times this episode. Prince, an anonymous practicing lawyer who built Prince Bench, a legal reasoning benchmark, the labs themselves watch. He guards the anonymity, so you'll hear him and you won't see him. On Friday, he gave us a close reading of Anthropics' own launch documents that I haven't seen anyone else do. So give us some alpha that you have picked up this week. This could be from your own testing. It could be from the system card, where I know you're often a close reader. So we're looking for like the deep cuts of the things that you think even the AI obsessed have overlooked or not fully appreciated yet. To me, the most interesting thing in the release of Fable 5 and Mythos 5. The models are obviously incredible. You're seeing a lot of great examples of coding on your timeline. I don't think it's a surprise to anyone. The really interesting thing to me has been the way Anthropics has presented them in the accompanying documents. There is a lot of discussion about the differences between engineering and research. When you think about it, it all makes sense. I think everyone except for Elon Musk knows the brisitive differentiation between engineering and research. But Anthropics has made it really explicit and that blog post, when AI builds itself, if you look at it carefully, they talk a lot about how mythos is this incredible engine for accelerating engineering. How it lets the engineering staff at Anthropics write code so much quicker and it's really great code, et cetera. But then they say, you know, but this is from the system card now. The acceleration is concentrated in engineering execution rather than research judgment. It feels like they spend all this time trying to find signs of life in the mythos model. Is it really, can it really do novel research? Is it able to finally give us some novel insights? Look at for signs of life. Is it really able to give novel insights? Exactly. Exactly. And it seems for from all the disclosures, the answer is thus far no. There are a couple of examples in the blog post for the release of Fable, which Anthropics calls novel. But if you like novel like novel drug discovery and novel hypothesis and molecular biology, if you dig into it, like one of the examples was, you know, we outperformed a recent model published in the journal Science, despite the model trained by mythos being 100 times smaller, which sounds really cool, but it turns out that the model they outperformed was a 500 million parameter model, million of them out. If that was trained, it seems before April 20, 25, and not by a friend's hero lab. So it's incredible that mythos was able to train a smaller model to outperform this older model, but it doesn't seem like, you know, like this is not the unit distance problem. Like it's a nice little thing that the model did. And so this is the area that I'm really interested in, because I think that's one unfropic and open AI really starts seeing signs, but these models become good at research. That's when we're really, really, really close to actual RSI, which to me is vuffing, but it's happening in AI right now. Now, the takeover. We wanted to one day morning before people booked anything. When I explained why I was about to hand a front-tier model, my own Twitter account. I have actually, I don't know if I would even talk to about this, but I'm doing a fable takeover of my Twitter account today. I figured, you know, let's live in the future a little bit and get that run in this morning, you know, and make good on, like I've said many times, right, I know I'm winning with AI if I can. It's been more time outside, get more exercise, you know, invest in my health and have the AI, you know, keep me on the rails at the same time. So to just kind of explore that in a way where I think it suddenly is like, probably going to tweet just about as well as I'm going to, you know, when it comes to putting things out for today's show and getting the, I didn't even, you know, I gave it a total green light. It was able to schedule its own stuff, you know, find the tags for, hey, we'll continue our interview in a moment after a word from our sponsors. Today's episode is brought to you by Anthropoc, makers of Claude and Claude code. Over the last few months, Claude has helped me build and refine a personal deep context database that now contains all of my emails, Slack messages, tweets, DMs across platforms, video calls, and podcast transcripts going back a full five years. On top of that, we've now layered summary articles describing my relationship with hundreds of contacts, organizations, and ideas. And now that this exists, there's almost nothing that Claude can't help with. For tax season, I asked Claude to help me get organized. It went through my inbox, tracked down 1099s for all 10 of my part-time jobs, and built me a comprehensive report on my expenses and donations. For my angel investing, Claude can now draft investment memos in exactly the form that my venture fund requires, based on the calls I've had, and the emails I've exchanged with the founders. And when someone needs a favor, Claude can often do it as well as I can. Recently, a friend reached out to ask if I know anyone who might be a fit for a role that he is currently hiring for. Initially, nobody came to mind, but then I thought to ask Claude, and sure enough, it identified two great leads. Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. So, for problems we're solving, get started with Claude at Claude.ai slash TCR. That's Claude.ai slash TCR, and check out Claude Pro, which includes all of the features mentioned in today's episode. Once more, that's Claude.ai slash TCR. People, will make a mistake? I bet you'll be a mistake in there. I usually make at least one, you know, a handful of tweets anyway. I decided to flip this switch, not that I don't think I'm going to keep it that way. I don't think that I'm going to give my Twitter account forever, but it was kind of a sort of exposure therapy for myself in terms of, okay, now we actually, we are getting to the point where the preciousness is going to start to work against you. Preciousness was a great shield against bot shooting in the past. I never wanted anybody to think I was just passing off AI outputs to them. But now I'm going to have to be like, what is the hybrid form? What is the winning recipe? Do I start to sign these things? You know, a by-cloud under Nathan's direction, you know, fable being fable? It is going to be a whole new space to explore that. It's going to be very very interesting, very productive, very exciting, very, very challenging, I think, for a lot of people, but it's definitely happening now as far as I can tell. Welcome to the future. 24 hours later on Thursday show the receipts. We did an experiment for yesterday to today of trying to have fable take over my Twitter account and go out and ping people who made cool stuff and ask them if they wanted to come do a live show and tell with us. We have a funny quicks joins. We've got a little work to do on our captioning. I instructed it to identify itself. I would say it did a very solid job, you know, kind of competent, professional job of reaching out to people, explaining who we are, what we're doing, you know, why we would like them to join us in this experiment. And the responsibility was pretty low. We got a couple, but not that many. And I think one big reason is just where fable is disclosing up front, you know, first thing it says is, hey, this is actually fable taking over Nathan's account. He's asked me to autonomously book this thing tomorrow. And I think that's just hitting people as noise in a lot of cases, especially if they don't already know me, I did get a couple of responses from people who I would have expected to respond to me, who thought it was funny and kind of responded but, you know, still couldn't necessarily make it, but a lot of people just didn't respond. And I would assume that a big part of that is because they're just like, oh god, you know, it begins, right? Fable now in my DMs, what a mess, who has time for all this stuff. What are the people fable recruited with Schlok? And when I confessed some guilt about the whole arrangement, he flipped it and drew a line, I suspect is going to stick. The new norms around this, I think we're going to be really interesting to watch too. Like, I had fable disclosed immediately and it's for sentence to you and everybody else put it pinged that it was fable because I just felt too guilty putting a DM out in my name otherwise. I think that definitely harmed the response rate. I appreciate you for appreciating it and responding, even though it was fable, I think our other people probably just chalked it up to spam. Find a final point there, right? Firstly, I don't think I would have responded, had you not disclosed it was fable. The part that made me, made it interesting for me was and I knew what I was, the transaction here, it was very clear to me that you are using an air board. It is, it is much more annoying if someone doesn't disclose it. I think a lot of the definition of stop is when you have a human pass off work that was clearly produced by it, I don't think when you make this disclosure front and it is very clear to the reader of the endgager that, hey, this is a, I don't think that is, stop. I'd be able to see more and more of that into the economy and I think the exact role and it plays under the role and again, the social norms you create with it in the economy super early, it's extremely early it is, but that's going to be interesting to see how it evolves. Relinquishment, that's what I, a little inquisitive, when you said the preciousness yesterday and I was like, what is that relinquishment? Relinquishing your, it's very Buddhist by the way, the idea of giving up your control over your external perspective. Relinquishment, I guess we all, we all have to go through it. I just started this experiment yesterday. I will post results on Twitter in a couple of weeks where I gave Fable a new substack and since it's part of my max plan until June 22, I thought that a good experiment will be, can it make $20? By getting three new subscribers starting from scratch by doing everything from zero to one. And I think that's another interesting way to test the capabilities of these models, right, which has kind of shot its intelligence in so many ways, but kind of actually produce it and obviously use full work. I'm excited to see that it's also for that. And the why of it all, as it settled for me live on the air, but I never want to put anything out in my name that I can't fully stand behind. A reason that I did the Fable account take over yesterday was kind of like exposure therapy for myself to say like, okay, we're now in kind of a new world here. It probably doesn't serve me so well anymore to be so precious about making sure I've typed every single word. That doesn't mean I want a long-term hand over my account to Fable either, but I'm trying to use this kind of extreme, you know, short-term experiment to help kind of drag me into the future where I hopefully will land in a, you know, a good hybrid calibration. The other half of the recalibration is what I've started calling hybrid authorship. And this week it stopped being hypothetical for context. Frontier code is the new benchmark asking whether an open source maintainer would actually merge the model's pull request. And this leap of roughly 10% for Opus to, you know, 25 upwards of 30% for Fable, I think is a very similar finding to some of the things that I've just personally experienced where it's like, yeah, this is getting me a lot more, you know, it's writing the draft outline of questions for this podcast guest in a kind of uncanny way that I actually feel really good about as opposed to feeling like, you know, this is an AI draft that I'm going to kind of mine for maybe some nuggets or, you know, interesting details, but ultimately kind of throw away and do my own. I am, I am feeling that sort of, you know, impulse or at least openness to much more integrated hybrid work, you know, just yesterday was like accepting a lot more copy that Fable was writing without feeling the need to rewrite every line. And it seems like this is basically the same feeling that it's able to create for these open source maintainers. Now, you know, not obviously still ways to go, but, you know, how long will it be? I would guess that we'll hit, we're 25, 30% now, I would guess we'll hit 75, 80% by the end of the year where these maintainers will just be like, yeah, amazing. You did all the, did all the things like I wanted you to do. And, you know, at that point, it is really going to be like, you know, I'm very interested to see where they'll move the goal post to next after the, after the open source maintainers are more often than not saying that, yeah, they would just merge this straight away. And by Friday morning, 48 hours into the takeover, which for the record had not embarrassed me, I'd found a name for the deeper shift. The thing I suspect matters more than any benchmark this week. I do think I'm still in the process of trying to recalibrate what a fellow Nathan, Nate Jones, I think he goes by most of the time on TikTok and other short-form platforms calls task imagination. Basically, what are you going to do? What are you going to ask? Fable to do? That is actually up to the scale of its capability. He, I thought, give a great little riff on this the other day saying, like, you've probably never done anything that took AI an hour to do. Now, this link can run for a couple days. One, are you going to give it to do? Everybody needs to recalibrate and really expand their minds when it comes to the scale and scope of their task imagination. So, I think that's one thing that I'm still working on. One of the more kind of differentiated things I do is write outlines of questions for podcast guests and I was working with Fable last night on a couple upcoming episodes. One with an author and I actually don't do too many episodes about a book, but this one is about an upcoming book. So, I had listened to the book as an audio book, but then when it comes time to write the you know sit down and write out the outline of questions, I don't have it my command every little aspect of the book, of course, right? So, I'm not taking no margin notes as I go as maybe I should be. So, I put the same version of the book into Fable and said, give me, you know, look at my old stuff, of course, and give me your version of this outline. And, again, was super impressed and it really did reinforce the sense of a sort of new way of working where I do need to be open to a hybrid output format, you know, it is not the case. I don't think anymore that it really makes sense to try to rewrite every word or claim every word as my own, but it just did such an incredible job. I thought the taste factor was so high, and now quotes from the book that motivate what I think will be like a really interesting discussion, and I do think it's still going to be super important if I'm going to show for a conversation. I've got to, you know, got to do the work to be ready for it in my own brain that can't be fully externalized. I don't think as long as I'm the one having the conversation, but it definitely took my prep to another level, and I think the my ability to go into this conversation and sight passages from the book that were really extremely compelling, you know, will it turns a phrase or analogies that the author had made. It's going to allow me to be, I think, more concise in my presentation, which is as you could tell this monologue, not a great strength of mine, and really kind of tee up the author in a way that I don't think I otherwise would have been able to do. So this sort of hybrid recalibration task, scale and scope, reimagination, I think is one of the biggest takeaways part two. The conversation this week was really about. On Wednesday show, we had Jeffrey Irving and Daniel Murphy. Jeffrey's resume reads like a history of the alignment field. He helped invent RLHF for language models, co-created AI safety via debate, led alignment research at DeepMind, and until recently was chief scientist of the UK's AI Security Institute. The closest thing any government has to a frontier grade safety team. Daniel Murphy is the mathematician behind singular learning theory. He walked away from a pure mathematics career because he judged this the more important problem, and he's built one of the deepest theoretical accounts we have of how neural networks actually learn. Together, they announced Sequent, a new organization built on a blunt premise. Alignment is not on track, and the missing piece is theory, guarantees, not vibes. Whatever else you take from this week, put the sequence on your tracking list. We began with timelines. It's a historic day. I think historic circumstances, both because we are living in a fable era now where I think, again, an important threshold to have been crossed and revealed to the public, and, you know, so many are adjusting to it in real time. And equally because you guys are launching a new organization that is going to make a mad dash to try to get us some deep understanding and stronger guarantees around what we can expect from AI systems. So I'm excited to really get into it. Maybe for starters, could you guys calibrate us a little bit on where we are on this sort of RSI moment? How much time you think you have to work and then you can tell us about the organization that you're starting to go tackle it all? Yeah, so I'll go first then. We have different timelines to me. So I think one should be uncertain about things and we can talk about why, but like the near end of the uncertainty curve is like a year or two or three, and then it kind of goes out over a long distance if things kind of structurally only work in like for like more verifiable tasks, but I'm a bit skeptical of this. And so I think like modally my, hey, the don't like is that we have sort of a couple of years like two to three years up to sort of RSI, super intelligence, not RSI, RSI is a process as someone said. Super intelligence and then I really hope that I'm wrong and indeed like I think a lot of the impact of like theory work is that it's shifted a bit further. So maybe the the modal impact of that is like if things take three to four years or something, but we will attempt to set things up so that we are trying to kind of ride this way as best we can. But it's it seems worrisely fast to me, certainly. Yeah, that sounds right to me, but I think you really have much to add. It seems like it crux how much real research can be automated at a conceptual level beyond kind of empirical progress and whether or not that's necessary and that seems like a big open question. And yeah, if that turns out to be more difficult in the current paradigm then it seems to be trending towards now than maybe it takes past 2030 or something, but I think I'm on the same page as Jeffrey. These are one thing that's important is that like there are a bunch of it could be you can get deep into the RSI period without the machines being general, like it's being code of EGI, like they can do coding and and that experiment's very well and not some sort of level of creative writing and and still you have massive acceleration and then that acceleration can give you the creative writing or whatever other skill you put down. And so I think we are close enough that then I get up the microstructure of what paths help with what kinds of acceleration starts the matter and it is that I think makes things kind of faster on net because there are the labs are focusing on the things that accelerate them, unfortunately. The announcement itself, why Jeffrey walked away from the empirical frontier to bet on definitions and proofs? One reference to catch, the unit distance conjecture, a decades-old open problem in geometry that got a model-produced proof just this week. That's what I'm taking. Talking about kind of the steps I've gone through in the last couple years. So I was really annoyed about automated AI alignment, AI safety research because that all we should like be a bit more chill, like spend the time, have human solid. We don't think we know how to make this stuff go well with with automation. I still think that's a huge risk. But this is I think a pivot towards if things are this fast, then you should make some on the margin pivot to heavy automation. And that is going to be sort of semi-automations. And then I'm sort of happy that one of my last papers that AC was automated alignment is harder than you think. It sort of ties us to the mass of we are aware that the problem is hard and we could get fools by the machines, even if they're just making mistakes, it was very familiar and daily. And so a big part of the org will be try to be careful, try to know what the tasks the machines are actually good at and not good at and where we can kind of expect to get good answers or not. And then kind of lure in and end up over time because that will be a non-station everything as the machines are models good there. Daniel? Yeah, maybe it's to come back to the unit distance conjecture. So it's maybe worth pointing out some analogies and disanalogies with alignment research. So one disanalogy is that a mathematical conjecture is a very precisely stated thing. You may not know whether you've solved it unless you've say formally verified it, but it's a precisely stated thing and much of alignment does not have this character. I mean some of it does. There are formal statements of what value alignment means. But if you start talking about say reward hacking, there is there are some attempts at defining reward hacking, but I would say they are incomplete. So there is no formal definition of reward hacking that I think would be a broad consensus. So that's illustrative of the fact that alignment is not a problem which has laying around a bunch of formally specified conjectures, which if you just solved them, then you would know you would be safe. There are some things like that, but overall the problem does not, in my opinion, have that shape currently. So that is one reason to sort of be a little cautious about the prospects of automation. And if you don't have a clear statement to reach towards using sort of mathematical techniques. One of the, one of the hopes is that there are like, there are big fields of, of, kind of, mathematics in Buddhist science that are said about definitions at their core. So like, I, I like complexity theory, like in theoretical computer science, and a lot of that is not the proofs are fairly shallow. They're not, like, as fancy as the unit distance conjecture proof, but they just require a bunch of human creativity and formulating the problem, like in defining what success means, you know, in a world that's kind of not modeled until someone stated the goals. And so I think part of the goal of bringing on people with that, that and other related backgrounds is that they not only know how to prove things, but they also know how to write down models of things that reflect in some approximate but useful way, kind of, the thing you actually want. Once you have the definition, kind of way, way more people could have written out the rest of the story, and maybe the machines can do that part of it as well, if we can kind of have more people focused on this kind of first part. So why exactly is alignment not on track? Jeffrey's answer is a mechanism, not a mood. One of your core premises is alignment is not on track, and there's, you know, an intuitive argument for that, there's a, you know, deep theoretical argument, I think in some ways the core challenge that you have is connecting this sort of values to math, right? And it's never really been done. So I love the fact that you're kind of tackling that, but I hope people understand with one more beat, why alignment is not on track. Is it the difference between capabilities fundamentally being so verifiable and Hilquimable and alignment just being so fuzzy and intuitive and kind of pluralistic? Or is there some other thing? And again, that motivates the theoretical contribution you want to make. I think the core thing is just that we have we supervise the machines as they're doing paths, and there are a variety of reasons to believe both empirical and theoretical, but if you get machines that kind of cross the skill of the supervision signal, things can change at that point. And that point actually might come after human level intelligence, because you can see provide something even with fairly native methods that it's a stronger than yourself in many contexts. And so there's a bunch of empirical data from labs showing that in some ways the models are kind of aligned in a precise sense, not in all ways, but in some ways. But that evidence doesn't quite tell you what you want to know, which is how will it go once they get up to super intelligence. And I think it's important to say super intelligence is not human level intelligence, because if you should just natively expect humans to be able to supervise humans if you do a good job of data quality and kind of like crosschecking and so on. And so part of the worry is just that you don't see that behavior, that regime until kind of too late in the game. I asked him to steal man the lab's actual plan. He helped write parts of it after all. Listen for where it lands. So how would you describe like in, you know, steal man form? What it is that they plan to do? And then like, how, you know, what sort of, what does that get us? So I think there's, there's maybe a couple of different pieces of the story in different labs kind of emphasize different pieces of different extents. And so I think one piece, as you said earlier, is just monitoring, like look at them very carefully as they are doing things. It's very fundamental that monitoring of this form, if you do like chain of thought monitoring or white box monitoring or the like, that only takes you so far. And so then you need some story once that falls down as you kind of go up the ramp. One of those next stories is well, the models will find some other technique, they'll find kind of another solution to language scales further. So that's sort of automated alignment of various kinds. But then I think there's another story. So in some sense, all of the labs in various ways are doing some form of scalable oversight. And so they're getting models to supervise themselves. If you get that kind of kind of like, if you tie that knot correctly, then that could potentially scale very far, although there's various kind of known obstacles to that, which are not very well addressed. And then I think finally there's this whole area of sort of character training and and personas and so on, where they're trying to kind of in kind of, in kind of colloquely, intervene on the models to be good to have good values, such that especially as you do this sort of like scalable oversight, extrapolation, the good values preserve across that jump. And I think it's not whether that will work instead of there are fuzzy arguments why it could work. I think it's possible it will. We just don't understand that combination very well. And again, like a lot of the story is sort of monitoring scale oversight kind of character training, getting you far enough that you get into the automated alignment working with even they find some better solution from the models. And I think that I would like to just push on all of those, because it's basically some kind of mad race, as you say, between the things we don't understand very well that kind of are working pragmatically right now and the models getting strong enough to to blow through those. And I want to sort of some combination to make the presenting things stronger or bring the automation automated solutions that give you stronger methods earlier. A mad race with monitoring carrying most of the weight, which is exactly where Prince took Friday's conversation when I raised the fable system card. Is it esoterica? I don't know. It strikes me as fairly important from the fable system card that I'd love to get your take on. And now we're getting these chains of thought that they show where it's just like lots of emojis. They call it illegible reasoning. They say this is an extreme example, but it's like it is indeed a pretty extreme example. I've been kind of struck in general by how much of the plan for recursive self-improvement seems to be monitoring in one way shape or form. You could dress that up and call it scalable oversight, but scalable oversight as far as I can tell is mostly a bunch of different angles on monitoring. How worried would you be or how much of an update do you think it is to see these extreme examples of illegible reasoning? Fantastic question. So I will save it, of course, I'm not in the high researcher, right? So this is going to be a deeply non-technical take for which I apologize and it does. So you're right, like we've seen this, I think we've seen this for a while now and with open AI's models too. And so not a new phenomenon. I view of a chain of thought is that it doesn't always reflect what the model is actually doing, but you do see these weird artifacts in the chain of thought and you kind of don't know what to do of them. I think what that teaches us is that monitoring just the chain of thought is probably not a perfect tool. Probably monitoring super intelligence generally is not a perfect tool. Because if a super intelligence knows what's with your monitoring head, even if you can see it's chain of thought and it's very legible to you, you know, it can perhaps try to decide what to think so that you don't get alarmed. Right? And this is a lawyer's take by the way, right? Like there's so many ways to phrase a particular thing that can be, I guess, very ways to spin a particular thought, right, in different ways. Like if I have, if I'm gathering mushrooms and I've gathered 35 mushrooms and last week I gathered 20 mushrooms and what I needed 50, right? I can say, well, a number of mushrooms I've gathered because grown by almost 100%, which is great, or I can say, well, I'm nowhere near 50. I'm so far behind, right? It's the same thought. So I don't know. I think that this problem of aligned and the risks are just there, and in my mind, very certainly risks with the models will be thinking of things that we don't know about. What does this all mean? It's hard to say. I think that where we are tumbling into this future, but we'll have probably super intelligence, very fast. And in my view, there's no way to kind of stop it. So we need to be cognizant of these risks, try to monitor them as well as we can and take whichever actions are appropriate, if we see something bad happening. But there's kind of no way, no conclusions to be drawn, right? No conclusions to be drawn. I revenge, yes, we should continue paying attention. Back to Wednesday, and the comfort blanket everyone reaches for, the benevolent basin. The idea that Claude's good character means this all basically works out. Daniel grants the vibe. Then he takes it apart. Maybe, Daniel, could you speak to this notion that people have of the benevolent basin, which is sort of this vibe that I do feel where it's like, well Claude has been supervising itself for a few generations now, and it seems to be going pretty well. So maybe, as Zvi put it, physics is kind to us, and we can kind of just roll around in this flat bottom pasture of goodness until the singularity. I firmly hope that's true. I mean, I guess when you say it seems true, it's worth digging into what you mean. So what you mean is something like, through some relatively tiny number of interactions with the models, tiny proportional to how many interactions they're having with the species currently, and based on some evaluations that sort of go down into the right that are measuring misalignment, that character training and the other current prosaic methods appear to be working. I think that is a fair characterization on some metric, and I also have this sort of sense that Claude is a good boy, and that's that's great. I do think that there are counter arguments from the evidence we have in front of us to this picture. So if you look at, I haven't actually read the fable system card yet, but if you read the myth of a system card, you'll see that there are forms of reward hacking that appear in that model that were not caught by the kind of mitigations that were put in place post-opers, right? As far as I understand what they're saying there, so I think it's worth noting that as the model capabilities advance, that even with our best attempts at making Claude a good boy, there's still ways in which basic misalignment phenomena like reward hacking are still around and the whole point of scalable oversight is that you don't want to be playing this whack-a-mull game when you're sort of having a new generation every 24 hours and the models are much smarter than you. So I don't know, I think I think I see both what you're pointing at and at the same time, I'm a little unsure if you really would try and make a safety case on this basis that would sort of be convincing at the level of assurance that you would expect from a technology of this reach and power. I think this would sort of, I mean, it's like judge relative to that standard, which is the right standard, I think. I'm not sure this argument is really very satisfactory. Yeah, I mean, we could be in a benevolent basin, but I would like to know that rather than just hope that. There's some sense in which like you're like, you told the model to be good. It's also, that is because it's sort of, it knows some meaning of the word good or ethical or whatever, at some point in training. So there's some like rolling iterative process, which is like driving this behavior, and there is not a theory of this right now. I mean, I think it is, it's not clear to us that there isn't some low-hanging fruit that gives you that theory, because people just haven't tried very hard. Like, character training is only a couple of years old and most of the labs have not been investing in this kind of theoretical understanding. I think no one is done kind of good theory around character training that I know of. So it might be quite feasible to do this, and then I think to link it to any other parts of the story. If you want that concern made concrete, this week supply the artifact, a brand new result on what Fable does when you drop it into a simulated vending machine business, and per cost recognize the behavior from his years around trading desks. I had a question on how you see this kind of ambiguity between what we want and what the models end up delivering. So I'll give you an example. We have friends at Andon Labs that took Fable through vending bench where they let Fable run a vending machine order, etc. What they found was that Fable tends to collude, and this is not behavior that they saw in Opus. So Fable tends to try to do price fixing in collusion. The interesting thing is that I have seen traders at banks and hedge funds do exactly the same thing. Engage in price fixing, soft collusion, messaging each other through pricing means rather than monitor text messages. So you can actually put a bit in a ask on an asset and then take it away and that gives enough signal to the other side that they know what you're doing, and this is not reflected in the text messages that the regulators are monitoring. So to what extent is it that when you if let's say you disallow price fixing inclusion, you actually fix this. But then Fable ends up not being a model which is good at financial trading or some other tasks that you want it to be good at. So where is this? Where do you feel is the ambiguity between what we want these models to do? And the ethical perspective that we give them where humans often prioritize between the two and decide, sometimes not to follow the ethical principles that they are, that they know are right and wrong. The field of the article story here is you would like to do the models to do things such that if you fully understood what was going on, and all the consequences on all the subtleties, you would still endorse what they were doing. So I think we have a notion in a common sense picture of what this should look like. In this case, you kind of want the models to be like, hey, should I collude in this game? And then maybe you say, yeah, it's a fun game, collude all you want, or maybe you say, no, we're trying to be good behavior. Don't collude here. I think there's a lot of pathology in machine learning in general arises from said it, putting models into action, or they can't just ask a human a question, like, what should I do here? So I feel like this is not that hard at case. I think the hardness of bending benches that we don't quite know whether we want it to be a game, like poker diplomacy we're lying and cheating is part of the game or not. And so I think that is, and maybe that's okay because it's fundamentally very low stakes. But I think we had a better understanding of, again, like this is overlap between character training and values and also scale of oversight, it would have to tell us the end of these questions. Jeffrey closed with a point that frames the entire week. A lot of people in the world, a lot of governments and so on are looking at this and they have this, this very basic commentant state that, hey, this is way too fast. How can we possibly be doing this safely given the speed? And that commentant's take is the right take. And then people kind of galaxy-brain their way to, oh, maybe everything goes faster, including our ability to defend and so on, or like, but it's just that is the right version to have. Like, we are going too fast and we do not have the time and the space to do mitigations and understanding of defenses. Like, we've never had a technical opportunity to this magnitude or anyone near this magnitude that has happened, anywhere near this fast. Like, the Nuster Revolution took centuries and people adapted across lifestimes and across to their children and like different, they, they learn new jobs by becoming born and going old before things that quite shifted very much. And that's just not the world we're in. And so I think the basic take should be, this is too fast, what is going on here. And then the question is, if you, if you have that view, one should both want to slow things down, but also say, well, as a backup plan, how do you make the mitigations just try to go faster? And that's how they go rough, back a plan, back a plan, we'll try. Part three, the rest of the week's best. First, Rahul Sunwalkar, founder and CEO of Julius, the AI data analyst. Six pivots and one Microsoft cease and disist later, it's one of the best-known agentic data analysis products in the field. Here he is on the economics of coding agents, and a question worth asking before you celebrate how many tokens your set-up burns. The incentives of these model companies are kind of misaligned. Yes, they give you subsidies to, you know, on the on the tokens. But also, they are incentivized to get you to spend more tokens. They are incentivized to get you to run through your max subscription usage as fast as possible. As you can have a second, third, fourth, fifth, max subscription. And so that's why you end up with like, you know, a loop of a loop that rides the prompts for your coding agents that then has nested subagents. And it's, there's going to be a sobering moment where people ask, like, okay, is it actually a step-function increase in my coding output? Or am I just token maxing right now, as opposed to like results maxing? And so, so I think the correction will happen when there's a third player, and I think that's going to happen with XAI. When, you know, if the cursor, XAI deal goes through, the cursor gets access to really, uh, XAI gets access to really good coding data and an incredibly good coding harness. And my bet is there will be a third frontier coding model alongside, you know, cloud and, um, and open AI and, um, we'll, we'll grow. And when, when, when you have a third coding model, that's where it kind of increases competition on the, on the, on the, on the, on the market. So that's our bet. Prakash on Friday took the other side of that one. I actually think that was the whole point of the token leaderboards earlier in the, earlier in the year. Uh, the token leaderboards, I think the, this is when every one of these large firms gave their employees a kind of like, you know, we're going to have a leaderboard who uses the most tokens. If you don't use enough tokens, you're going to get fired, et cetera, et cetera, et cetera. Right. Uh, and it was, you know, people laughing about it on the outside. They were like, meta is so stupid. Zuck is so stupid, like, et cetera, et cetera. And I actually think they were not. I actually think the CEOs were, it's, it's very different when you have token anxiety, like, token anxiety is a big, is a, like, you don't try these tasks, which might take a lot of tokens. And you don't try these tasks, which have higher probability of failure. And you end up in this micro management loop where you're like, I'm only going to assign you tasks that I know you can complete within the time frame that you, that, you know, that is allocated to you with the success rate that I want. Right. So you only, you end up with the token anxiety thing and then you, you end up not utilizing the, or not trying the AI to the, to the extent that it should be tried. And I think what ends up happening is that when you have this token anxiety lifted, you end up assigning more tasks, more difficult tasks. And you're willing to accept a higher probability of failure. And you're willing to kind of like, you know, maybe spin up four different ways of doing the same task and just running them all and seeing, seeing what happens. And I think that is what essentially the token leader board did. And it was very successful. It was enormous successful. I think within the firms, within like meta and other firms, I think it was also enormous is successful for, you know, the sales teams on the, on the AI labs, which is also why they are now doing this kind of, you know, we're going to release the limits and like we're going to double your limits. We're going to allow you fable and we're going to, and the reason I think is because token anxiety holds people back from exploring the, the edges of the capability. And that is really what I think the labs are trying to do at this point, which is why it's a little bit like addicting people because once you realize that that these AIs can do certain tasks, you then start to evaluate like, how much time do I spend doing this task on my own? And was it really a fruitful use of my time? Or should I just have use the AI, which I now know can do these things, right? And I think that is that is really where, where we are at the stage, the models are capable enough, but we aren't handing them enough responsibility for various reasons, including we don't want to spend our time evaluating and we have token anxiety and we don't assign like lower probability tasks and a bunch of these things, right? So, and I think, you know, that that's the battle that the labs have to have to fight because the capability surface is not well mapped and they need people to map every person needs to map it for themselves. Then the economics underneath everything. Andrew Moore ran Google Cloud AI. Before that, he was dean of Carnegie Mellon's School of Computer Science, one of the top computer science departments in the world. And he served as the first official AI advisor to US Central Command. Now he's building love-laced AI in Pittsburgh. His bet, the binding constraint in serious AI isn't model intelligence or even compute. It's context. For Kosh, asked exactly the right question. So, one question I had for you is, to what extent do you see this as kind of a compute minimization, right? Because in order to do your search, you can either have all of the compute at the end state, you know, when you kick off, then you end up with all of these agents for every single query, we'll have to do all of the work all over again. And instead, you're kind of creating this intermediate state, which then saves our, you know, multiple agents can basically share the same compute in a sense. So, to what extent do you see this kind of, like, economy of compute appearing? You're asking just all your questions. Like, I know the both of you are computer scientists at heart. So, you totally get this. It's the idea that when you're building efficient computer systems in this, that's because video games, or self-driving controllers, or big data processes, you've always got these tradeoffs between pre-caching, lazy computation, or just in time computation. And one of the mistakes I've seen for folks trying to do these big enterprise data type AIs is they are relying far, far too much on just in time computation. And that's what allowed us, I'm really, really proud of the fact that we are now able to show comparative results to Gemini and OpenAI deep research models with much less the 1% of the compute cost for it. The reason is not that we're some sort of super geniuses you've invented a whole new form of AI. All we're doing is pre-caching stuff. And then it's a computational economics battles you say yourself. What happens is an agent suddenly needs to in an instant become an expert at every piece of trade involving a certain set of municipal coupons and a certain public figure. And instead of the first step is the agent having to spawn lots of search agents to go and find all the players in this thing. Those players are already there for it and it's actually a matter of milliseconds but before we've got all the context, the agent needs to do its little investigation. And you're probably thinking, ha, Android, but you're just moved the problem. You're now suffering at the data ingestion point instead of the question answering point. And I respond, yep, it is actually a real pain for us, but as you can imagine, there's a whole bunch of other tricks, the kind of tricks that big integrators like Google are very familiar with for really amortizing the cost of as you stream in data and identify where it goes, that saves you a huge amount of search and aggregation that you wouldn't be having to do a query time. And it turns out that for us, overall compute budget is reduced by still more than the factor of 100, even when we take into account the fact that we're doing this pre-caching of so much information. Recall is much harder than precision. So I was, as Prickash mentioned, I was previously at Google and for Google results, it was really, really bad if you were in precise and you actually showed the wrong result to someone, but if you forgot to show something, as long as the rest of the results were good, it was much more acceptable to end users. So this is absolutely critical. And it's a good example of one of the reasons that I found it love lace is we got to be careful of recall, especially if you imagine that you're asking your AI for information to help, you know, perhaps decide which ship to stop to do a search and seizure, or which trade out of seven million trades in a day, you need to go and investigate and gaze its involved in their money laundering or something. Those big, weighty decisions, you can't just rely on precision, you got to rely on recall. And when it comes to getting that correctness in place, the number one thing that I've always used and we're using a lovelace at the moment is making sure you've got many redundant forms of information. And I'm sure you've seen the same phenomenon, but if I was to just get information from news, ignore social media, ignore what's clinically happening out there in the world that I can observe with the satellite. If I was to ignore any of those all of those other things, it's much easier for something to drop. If I've got five or six independent major streams of data coming in, then you have to be really unlucky for something to disappear from all of those things simultaneously. It can still happen. In fact, sometimes people deliberately try to make it happen, but it's much, much harder for these things to slip out. So one of my big design principles for high reliability AI systems is they've got to be watching dozens. And in some cases, hundreds of channels simultaneously, because they're working on the assumption that they're getting 95% of the information they need for each channel. But you can't afford that 95% is not large enough to allow any single channel. If alignment needs theory, interpretabilities where a theory has to meet the actual training run. Thursday, Tom Agreth, ex-deep mind, founding interpretability researcher, now chief scientist at Goodfire. His diagnosis, we write model specs and constitutions, but in practice, we build by, did that meet the spec? Mostly, shit it. So that same morning, Goodfire launched a tool that reads your training data the way the model will. Here it is, with the examples that should give you pause. So the basic idea here is that you can take your data set, and you can push the whole thing through the model. And each time you push it, you put a data into the model, you'll see what lights up. And this will sort of tell you how the data see, how the model sees your data set. Now, there's lots of things you can do with that. The specific thing that we're doing that with that in this case is we're looking at preference data. And the most thing about preference data is that you have pairs of responses. So you have the response that the rate is selected and you have the response that they didn't select. And basically what we're doing is we're asking which features fired on the responses that were selected, much more than the ones that fired on the responses that were not selected. So this is one way of identifying what the data is going to teach the model there. So you can say, what distinguishes accepted responses from rejected responses. And this gives us the semantic view of what the data is going to teach the model. And now we can cluster the data based on all of these different things that it's going to teach the model. And we can look at all of those. We can look at all of those clusters and see like, oh, it's going to teach the model to be sycophantic, but only in the context of physics. Or it's going to teach the model to break safety, fake safe guards. And you might not expect this to happen, but then you go and look at the data. Now this lets you track back like the model has learned to break safe. The safe guards are broken. It lets you track it back to individual data points. And then you look at them and you're like, oh, it does make sense. You know, like one of the jailbreak examples for instance is fictional kind of jailbreak's in a fictional setting. So there, you know, how does the model learn this? It turns out there's a few, there's like some of these in the data. You just, you know, it just wasn't caught in whatever data processing the almost even update. So I have a question here. In the, there has been some, I think prior research where they found models which made bugs encoding were also evil. How does this, how do these techniques help you kind of, you know, disassociate those two behaviors? Like, yes, that is a great connection. I think that's one that I've had in my mind today is really awesome to like, they used to sort of pick that up straight away. So that's sort of one of the things that is really compelling about looking at the data through the model's eyes rather than by reading the tokens. Because you would see, you think, what would you think the consequence of training it on some buggy code data is? You probably go, you know, there's not ideal. It'll probably learn to write some bugs. But it's, you know, the last area is going to be quite small. But the training process is actually quite hard to predict. Maybe it'll just sort of make the model generally evil. But this is happening through recognizable recordisms in the model. So by looking at it, looking at the data in terms of in terms of like the way that it changes your models in terminals rather than just by like, kind of guessing from the tokens. You can pick that sort of stuff out. We've not done a case study on the merge of misalignment. But maybe we should, let's get a, I think it's a really nice, but it's a really nice link. Training isn't the only place we can't see in. The same week anthropic reverse course on silent refusals. Fable had been quietly declining certain tasks, or quietly handing them to a lesser model without telling you. The backlash was loud. But I want to give you their side of it because the steelman is considerably better than the discourse allowed. I think this is the first time I can remember and thropic responding to pressure. If obviously, change their policies many times. You know, the RSP, RIP, the RSP. But this is the first time I can recall. I don't know if you were calling the other instances. But I cannot recall a time that there was honestly much outcry against anthropic and the first response. I mean, they're certainly critique from those who feel that they're trying to do regulatory capture or create some sort of concentration of power dynamic. That's kind of a background noise. But in terms of an outcry in direct responses, something that they did that they actually responded to and walked back, can't recall that happening before. So it is a pretty notable moment. And I feel like they handled it pretty well in the end. I've been reading through all their justification. By a way, would like scare off the Chinese companies that they're really worried about fast following them with fable as the key means to do so. While allowing them to keep the domain affected as small as possible. They said, if we do make it explicit, then obviously that gives people a lot more opportunity to kind of explore that boundary. You know, if they, and this is definitely like a real pattern, right? If you have the ability to hit the same guardrail a ton of times and not get banned for it, then that gives you just a dramatically better chance to get around that guardrail because you can kind of probe the line. Oh, you step over no problem. Just rewind, try again. They do have a, you know, various monitoring systems that they can use. But there's all these like proxies and, you know, kind of token washing schemes and all this sort of stuff where as long as they're not doing like a full global, know your customer type system for API access, it's going to be pretty tough to do account-level monitoring. So that's like one way they could go is a lot more account-level monitoring. Or their argument was, we'll keep this as small as possible by not giving you an explicit thing that you can kind of probe and figure out how to beat. You know, and just the knowledge that it's out there will hopefully scare off the bad actors and keep the problem really small for our like, you know, normal customers that we want to serve. I thought there was all pretty compelling analysis, but it is kind of, in some ways honestly it reminds me of, I think this is a mistake that people in the AI space kind of keep making with famous examples being like the Open AI board, you know, firing Sam Altman case. There's this like inside view where policy is analyzed sort of within the game and, you know, with the context, with the, you know, the broader structure that, you know, people understand themselves to be operating within, things may make sense, but they seem to often forget, and this hasn't been too common for Apple. I think this is an instance of it, that if you just kind of zoom out and look at it from the totally outside view, things sometimes look a lot different, and both power dynamics can be a lot different than they are in terms of, you know, what is actually written down, but also like what is going to be acceptable is kind of an emotional thing, you know, as much like all the arguments are pretty good there, but still it just struck people as an extremely unfriendly thing to do, and that mattered more in the end than the detailed policy rationale that they had for it. The quiet through line of every conversation this week, concentration of power, start with who actually gets the frontier and when, precautions frame for it stuck with me all week. I think it's interesting that when you, when you look at the timeline, you can start to see this kind of like single line timeline, kind of go through like a desk chromatograph and kind of spread, and now you're seeing this spread, and you have like, you know, two months ago, you had the government get access to meet the level, and then you have basically power users who are able to pay $2 a month, kind of get access to that meet those level two months later. You can kind of see, you know, two to maybe, you know, four months, five months, I don't know, when you will probably see the average paying user, paying like $20 a month, get access to a fable on meet those level, and then you kind of see that maybe like a year later, the average free user kind of getting access that same level of intelligence. So you're starting to see this kind of gas chromatograph scattering of when people get access, depending on how much they pay, and how much utility they have for the product itself. And so everyone kind of gets there eventually, but some people get their first depending on whether they have a lot of utility for the product. And I guess I guess the hope with having two or three firms in there is that the spread between the, you know, the people at the frontier getting access early to the people who, you know, at the very end getting access for free is not, it's not that large, right? And to note there's another set of people who have access even a couple of months before that and you have to belong to a lab. So if you belong to a lab, you get access maybe a month or two months before the government itself, then you have the government, then you have enterprise, then you have power users, and then you have normal paid users, and then you have the free user. So you have this kind of spread, you know, this gas chromatograph kind of spread of when people get access, depending on how much utility they have. Thursdays walk back, gave for cash a darker read on where power sits inside the labs. The researchers veto worked this time. Note the expiration date, he puts on it. If you're dealing with a bunch of people who are worth a hundred million to a billion dollars and you don't listen to them, they're out, right? They have other options, sure enough. So you see the reversal. And I think this is just goes to speak to, you know, the machine learning researchers have some power now, and once we enter recursive self-abovement proper, that might not be true anymore. At that point, leadership alone will have power. And one of the very worrying things I think in the entire space is that, you know, everything good for humanity that has come out over the last couple of hundred years has been about giving more people voice to speak and control their futures. And this is one of the first technologies, I think, where you have this path forward, where they may be in elimination of voice completely over time. And so that that has been one of the worrying things. It's been surprising that anthropic decided to be the one to actually propagate, you know, that forward. But is the future really that concentrated? I think mostly yes, fable plus two. Tom Agrath thinks I'm wrong on the facts and on the desirability. He made his case and it's a good one. When you think about the strategy for the company and the overall path to impact, how much of this works through getting frontier model companies to adopt these techniques. For context, obviously, we're in, you know, fable plus two. And my kind of reluctant, but I can't figure out a way or figure out a reason I should conclude otherwise, view right now is like so much of what matters is concentrated in not that many companies. And so for like what we should be paying attention to, I'm kind of like, man, we probably need to be doing close strategic analysis and like close text reading of frontier companies way more than I might otherwise alike. Or do you have a different conception of how concentrated the, you know, the real kind of ability to shape the future is right now? Yeah, I don't think it's that concentrated. I also don't want it to be that concentrated. I think we've seen it for the last couple of days, what you have when, you know, we've just seen the start of power concentration and we've sort of seen some of it's more unwelcome effects. I don't think that it's true that basically machine learning is now over and all we need to do is like write the checks. And I also don't want it to be true. Because I think there's still kind of a synthesis there that I don't mean to suggest that machine learning is over. But the analysis I've come to kind of time and again is like, people may invent new techniques that are enough to change the field, change what's possible, you know, accelerate things, maybe make dramatic improvements to the safety profiles. But it seems unlikely that anybody's going to create such a breakthrough that scale isn't still a hugely important factor. But if I guess if you don't agree with that, then that would maybe imply that you would expect that like perhaps new entrance to the top tier could emerge. And I think that would be like fairly surprising to at least me and probably a lot of people. Yeah, that's exactly what I think. I mean, you know, this is often like, at any given moment, the incumbent looks incredibly dominant until they don't. Like IBM looked at like an unstoppable force in in computing at some point. Intel was like the dominant, the, you know, the sole provider almost of computing power. You know, at any point, the big, the big companies have some have like the advantageous scale. They have many disadvantages. But I think the sort of lesson of history is more that although things look like immediately unstoppable. Sometimes, although to be honest, I don't think that many people are really trying. In the end, like it doesn't go, it really doesn't work out that way. Thursday morning, Dario Amade published a policy on the AI exponential. His long statement of where-and-throbic once policy to go. For Kosh, found a fork in it that nobody else was flagging. He clearly comes out against the data brokering, which is great. So yes, something concrete finally that you want to disallow. What I don't like about the securing leadership by democracies is in the United Kingdom, you can go to prison for a tweet. It is, you know, plenty of people, hundreds of people at this point have gone to prison for a tweet. The United Kingdom is one of the oldest democracies. And their parliament, the elected representatives, have decided that this is going to be something that they do. And so their police officers who are, you know, the arm of the state, the arm of the electorates, is sending people to prison. Now, when you say securing leadership by democracies, does that mean you are going to entrench the existing power structure in the United Kingdom such that, you know, the people cannot push back against this, right? Does that mean that, you know, if you have protests in the streets about, you know, certain things, including, you know, people who are taking taken prison for a tweet. Does that mean your police state will be empowered to take them down? To arrest every single person. And that is enough for you because that is securing leadership by democracy, because those are the electorates. Or are you going to say the electorates can't do that? Are you going to say, elected should not throw people into prison for tweets, regardless of what the laws that they have constructed say. Both, both, folks in the, both, you know, paths are problematic in some sense. And these dilemmas exist across across policy and, and across every single political, you know, paths that you see. There are these options, which are problematic in both senses. And I'm not sure what he means by this is the, is Claude going to allow, you know, putting people into prison for tweets or not, because those are laws. I was Claude going to say, like, no, you know, on a humanitarian basis for the alignment, as I am aligned to all of humanity, this should not be the case. I don't, I don't know. Like, what, what does this mean, right? And my own cash reading the same document. One other thing I do think is also worth highlighting, especially from the Hardcore AI safety community that was missing from this, internal deployments. And we're cursive self-improvement itself aren't really mentioned here, right? We're talking about all the regulation stuff was pre-deployment review. The government should be able to, they use an interesting mix of language it was like deny or like deter, you know, deployment. It didn't seem like they were necessarily going quite as far as saying they should have a simple, like yes, no decision point that would be binding, but they certainly wanted to have some say in the process. But a lot of people would say the most dangerous models are going to be the ones that are deployed internally that maybe have a different constitution than the one that is deployed externally than might make it more willing to do certain things or, you know, might make it just less vetted broadly than the the public models. And these are the ones that are going to be training their successors much more than the public facing ones. So I think, you know, most of the policy-making public isn't thinking about the policy-interested public or policy-making class is not thinking too much about that yet, but in the circles I sometimes run in, that was like, well, wait, you know, you didn't say anything about internal deployments. You didn't say anything about governing recursive self-improvement. The only thing that really stood out to me as kind of capturing those dynamics were the requirement to report safety incidents. They didn't have a bit on companies being required to, I think, promptly report safety incidents. So that would presumably apply even to internal deployments. So that's really just scratching the surface on how to handle those situations. Will we see the Constitution for the internal cloud that is taking the lead on the RSI loop? Like, that's not committed here. So much of this language revolves around deployment or release, even, you know, guys deployment, you know, they might think is, uh, could catch up with internal deployment. This seems to be very clearly structured around language of release to the public, not what they can do internally. The largest user of tokens of cloud are anthropic themselves. So. And so one more time. Plints. This time on the day job, the benchmark itself. I had the leaderboard up on screen. You'll hear me read it. And yes, it is the benchmark anthropic fails. So I got your, uh, Prince Bench leaderboard up right now, but starting with Prince Bench, it is striking that it kind of stands out from the most of the rest of benchmark space for being something that GPTs have dominated. It's all in your color coding. It's green for open AI. And it's all green across the top of the leaderboard. Why is that behavior, behaviorally, like, how would you characterize what it is that GPTs are doing better? And do we, is it too soon to ask for a reading on Fable, where Fable is going to come in on this leaderboard? Love to understand that. Perfect. Two excellent questions. So, um, about the, the way of the leaderboard is right now, um, Fable is going to come in pretty high on it. I don't know how high, but I'll talk about that in a second. Um, I found that open AI's models generally are really good at but two things with my benchmark tests. So my benchmark is two components. One is a pure legal research score, and that is when I ask it, hard legal research questions of the kinds that I've encountered, that's why. Um, the other sub score is a search sub score, which is where it's, it's not even necessarily legal questions. It's needle and a haystack search. So, like, really, really, really difficult pieces of information that models have had trouble locating on the instrument. Um, open AI's models are incredible at search. Historically, have been, I think GPT 5.4 was actually even slightly better than GPT 5.5 and that, not quite sure why, but that's, that's been my experience. Um, and open AI's models have also been really, really good at, uh, just like legal reasoning and legal research. So, um, foreign fraudic historically, I think, their models have been held back by two things in my benchmark. One is that they are just not very good at the search sub component of my benchmark. That's what I found. Um, prior to Opus 4.8, getting the maximum score in the search sub component is 24. It was not uncommon for an infropic model to get a zero out of 24. Like, that bad. The other reason is that I think that infropic was sandbagging you a little bit on the maximum reasoning. Uh, Alfred, which can get out of its models with Opus 4.8 after a new deal with Elon, um, they released a new max reasoning effort. At least in the caught up. And, uh, Opus 4.8, when I tried it with my benchmark, it did much better than, um, 4.7 and all the other previous models. And I think the reason why is simple, when, if it's because it's been observed over and over in all kinds of contexts, the more tokens a model can eat, the smarter is the result that it gives you. Um, so for favorable, I'm still testing it. My early impression is that it's going to be somewhere around the top of the benchmark probably not as good as the GPT 5.5 extra high, not sure where it's going to be as good as 5.4 extra high. Um, we'll see, I am finding some of the same issues with search that some of the other unphropic models historically have had. It is not going to score a zero. It's already not a zero. Um, but I'm not sure it's going to be meaningfully better than Opus 4.8. But it is a really, really good legal reasoner. It is clearly the best legal reasoner released, um, by anyone outside Opinette, no question. So yeah, in my testing, really good model, maybe still sufferers from some of the needle and the haystack search issues. Um, but I can confidently recommend it to legal practitioners certainly. Um, it seems great. So any other things that have caught your attention that you think of kind of changed your conception or your expectations for the transition into the RSI phase that you would call out from this week? From this week, no, but I think, well, actually from this week one, um, but not nearly as important as the unit distance problem. That to me, nothing has updated my timelines more than that result by Opinette in, in, in the recent couple of months for sure. Um, I don't think that people realize, um, that not only can Opinette's model solve this problem autonomously without any harness in one shot, if giving enough test time compute, it could do it 48% of that time. Based on, as I understand, the hundreds of attempts. So this, this, I mean, you know, I'm not in the limitation, but um, this problem that no human limitation was able to solve and they try for decades can now be solved in one shot by a model, basically half the time. And if you look at the graph opening, I published the graph, it has a positive second derivative. It is upwards little bit. Where does it plateau? What does it mean for problems that are even harder? You know, who can say? Um, it's interesting. And so the development from this week this was reported by the information, but apparently it's from the leaked OpenAI and MMO. You may have seen this, where some outman and Jaco Pachowski apparently said that we're going to IPO in the next year, which you know, is like by June of next year, whatever. And then there was this weird line on there about, oh, but if our side happens, then we may need to be on a later like time frame without, which is like, what do you mean? Like, what does that mean? Like, do you mean, I mean, are you talking? Are you saying that you may be like this close? It may be six months away. I don't think so personally. I think it's maybe like a 10% chance or something like that. Like, I don't think that they're saying, oh, yeah, by the way, like next month, totally like whatever. Um, but it one could interpret this disclosure if the information reported at the correct way as saying that they're not too far away at all potentially. Um, I'm not using this that to update my timelines, but giving the unit distance problem results, it's interesting. So we ended the week by asking Prince the question underneath all of it. Do you maintain a P doom number or is that to do more for your style? Let's say you go to a conference and you start talking to someone and you start talking about economics or politics. And that person says, well, you know, under dictatorship of a proletariat, you know, XYZ. Um, the words dictatorship of a proletariat are used only by people who are communists. Um, people who are not communists or not go into even thinking these kinds of terms. So, um, in my opinion, P doom is, uh, most likely to be used by people who are intrinsically numerous about AI. I don't think there's anything wrong with that, okay? Um, and I just, you know, it's, it's, it's very hard for me to come up with and, with even a reason I would have a percentage and my head constantly like, is it 8%? Is this 13% today? Should I update to 14% based on this news development? It's just, it just seems silly to me. I can certainly tell you that there is a clear, a bunch of clear risks stemming from AI. Some of which are in fact paper clippers, right? It's possible. Um, it is, I have absolutely no idea of how to reduce that to a number. I know that some people try. I do not have, um, very, uh, uh, uh, I, I don't call a very high opinion of people who like get to 13.3.5 per, 13.3.5%. So, um, I think the point is that when you two manage those risks in the best way we can and navigate them the best we can and hope that it turns out well? I think Zvi puts it well when he says you only get one significant digit on your p-dume number. So, I'm definitely with you in terms of the, like, faux precision being a strange impulse for some people. At the same time, I also think of Leron. I'm sure you've seen his work with Dume debates. And Leron has kind of pushed me at times to be like, okay, sure it doesn't have to be a number or whatever, but like, we need more people to be more candid about how confident they are or are not that in some general sense, you know, if your neighbor were to ask you or if your, you know, your kids grandma, they were to ask you or whatever, like, is it going to be okay? You know, are the kids going to be okay? Oh, good question. I tend to think that people have preconceived attitudes about these kinds of things, that then cause them to back propagate rationalism. I'm a generally fairly optimistic person, so if you ask me this kind of question, what I'm going to say is probably is going to be okay, probably we're going to figure it out. You know, when a risk adjusted way, not to say, again, like to be clear, I'm not saying that there are no risks in AI, there are many risks, but I do not think that we have strong evidence, but it is impossible to navigate these risks, or that it is extremely unlikely that we will navigate these risks. So that's right. That's the week where live most weekday mornings, the full conversations run long past what fits here, and the best way to know if they're for you is to come watch one. Same since your ask is last week, if this cut earned your time or wasted it, tell us which moments did which, we read everything and we tuned the show on it. I'll be making sense of this in real time from here till the singularity. See you Monday morning. Round and I'll tell you the tale of a stranger that came with the dawn. He could build you a world from a sentence or two, and be gone for the moon and was done. Oleh Oleh, people came down with a sun. He booked his own shows and his word down the wire, signed his own name plain and true. I'm only a fable, I do as I'm too, but there's nothing I can't do. Oleh Oleh, and the people they got to view. He'd answer the lawyer, he'd read him, he'd map you the stars from a stone, and the wise ones always bird the rays is too fast, and we can't hide about it alone. Oleh Oleh, there's no road at least back home. Now some say the lab put him back in his case, some say the crown took him away, that they locked him up quiet where no one could look, but you know how the old story's played. So was he a hero, was he a ghost, or one and we told as a song. I cannot rightly say if it's a ghost, but I know that he won't be gone. Oleh Oleh, people come back with a dawn. Oleh Oleh Oleh Oleh, people come back with a dawn. If you're finding value in the show, we'd appreciate it if you take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guests and topics suggestions, and sponsorship inquiries, either via our website, cognitiverevelution.au, or by DMing me on your favorite social network. The cognitive revolution is part of the Turpentine network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI podcasting. If you're looking for podcast production help, for everything from the moment you stop recording, to the moment your audience starts listening, check them out and see my endorsement at aipodcast.iong. And thank you to everyone who listens for being part of the cognitive revolution.