The Cognitive Revolution · 2025-04-05

Jack Rae (Google DeepMind) on Gemini 2.5 Pro, Reasoning Scaling & Path to AGI

Hosts: Nathan Lambert

Guests: Jack Rae

Gemini 2.5 ProReasoning modelsInference-time / test-time compute scalingReinforcement learningLong contextChain-of-thoughtAI safetyMechanistic interpretabilityLatent-space reasoningAGI roadmapMultimodal models

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

Gemini 2.5 Pro is described as Google's first broadly #1 model.

Key claims

Gemini 2.5 Pro is described as Google's first broadly #1 model; standout capability is deep command of very long contexts (400K+ tokens tested by host)
RL on correctness signals has been working at Google for over a year; what looks like a sudden breakthrough is really a smooth internal curve crossing public perception thresholds
Industry-wide convergence on reasoning/thinking models is attributed to obvious next steps, abundant low-hanging fruit, and compute/talent density—not primarily shared research
Gemini shows the raw, unmodified chain-of-thought in both AI Studio and the Gemini app; decision-making involves safety, research, and leadership input and is still evolving

Episode summary

Summary

Jack Rae, Principal Research Scientist at Google DeepMind and technical lead on Gemini's thinking/inference-time scaling work, joins The Cognitive Revolution to discuss Gemini 2.5 Pro—a release the host describes as the first time a Google model ranks #1 across many important dimensions, particularly for long-context command (he tested it with a 400K-token research codebase). Rae attributes Gemini 2.5 Pro to a full Gemini team effort spanning pre-training, thinking, and post-training, and notes that long-context capability and deeper thinking/reasoning have stacked unusually well together to unlock new problems.

Gemini 2.5 Pro is described as Google's first broadly #1 model; standout capability is deep command of very long contexts (400K+ tokens tested by host)
RL on correctness signals has been working at Google for over a year; what looks like a sudden breakthrough is really a smooth internal curve crossing public perception thresholds
Industry-wide convergence on reasoning/thinking models is attributed to obvious next steps, abundant low-hanging fruit, and compute/talent density—not primarily shared research
Gemini shows the raw, unmodified chain-of-thought in both AI Studio and the Gemini app; decision-making involves safety, research, and leadership input and is still evolving
Rae endorses the OpenAI obfuscated-reward-hacking paper's warning against heavy RLHF pressure on the CoT, prioritizing that thoughts remain faithful to the model's computation
Open to reasoning in latent space (MuZero analogy) if interpretable; sees capability and interpretability research as likely to advance together rather than diverge
AGI roadmap: agents + much deeper reasoning + a true memory breakthrough (not just context scaling) + deeper multimodal integration
Gemini 2.5 Pro system card and full technical report will come at general availability; the experimental release has still gone through extensive internal/external safety testing and red teaming

Source material

Transcript

the potential to develop the best practices and strategies for the future.

Hello, and welcome back to The Cognitive Revolution.

Today, I've got the honor of speaking with Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work.

As one of the key contributors to Google's Blockbuster Gemini 2.5 Pro release, Jack has tremendous insight into the technical drivers of large language model progress and a highly credible model.

Gemini 2.5 Pro, as I'm sure you know, marks a significant milestone on Google's AI journey.

It's the first time that many observers, myself included, would rank a Google model as the number one top performing model across many important dimensions.

And this is not just about topping leaderboards.

In my initial testing of Gemini 2.5, which I conducted before Google's PR team reached out to schedule this interview, I experienced one of those rare moments where a model actually exceeded my expectations, forcing me to reevaluate my sense of what's possible today and inviting me to reimagine my workflows to take advantage of its unique strength in not just accepting, but actually demonstrating incredibly deep command of hundreds of thousands of tokens of input context.

This is a practical step up that I could feel almost immediately.

So naturally, I jumped at the chance to talk to Jack about all the work that went into it and how he understands the current state of play in a bunch of critical conceptual dimensions.

We begin by asking why techniques like reinforcement learning from correctness signals appear to have suddenly started to work so effectively across the industry.

Does this represent a proper breakthrough, or is this more a culmination of steady incremental progress that has finally crossed important thresholds of practical utility?

We also unpack the reasons that nearly all Frontier model developers are releasing similar reasoning or thinking models in such a short period of time.

Is this simultaneous invention driven by obvious next steps, or is there more cross pollination somehow happening behind the scenes?

We then consider the relationship between reasoning and agency.

Will these reasoning advances translate to agenda capabilities, or is something more still needed?

From there, we look at the role of human data in shaping model behavior.

How does Google think about collecting human reasoning and step-by-step task processing data?

And how intentional has Google been in training models to follow recognizable cognitive behaviors versus letting them develop their own problem solving approaches during the training process?

We also exchange intuitions about the relationship between models internal feature representations and the patterns of behavior they use to leverage them, consider whether reasoning in latent space should scare us or can be made safe via mechanistic interpretability, and discuss whether the application of reinforcement learning pressure to the chain of thought itself should be avoided, as OpenAI recently argued, in their obfuscated reward hacking paper.

Finally, we'll discuss the roadmap from our current capabilities to AGI.

What are the remaining bottlenecks?

Do we need a memory breakthrough, or will continued scaling of context windows be enough to overcome all practical limitations?

And should we expect deep integration of more and more modalities, as we've recently seen with text and image?

Throughout our conversation, Jack provides thoughtful, nuanced responses that absolutely should help us improve our understanding of today's AI systems, the work going on inside Frontier Labs, and the overall trajectory of AI development.

Personally, I leave this conversation with the sense that for most developments we see from the Frontier Labs, the simple explanation is the best one.

There's still a lot of low-hanging fruit left in large language model development.

Researchers have internalized the bitter lesson and are trying to keep their approaches as simple and scalable as possible.

And the rapid progress we observe is mostly the result of pursuing pretty obvious high-level conceptual directions, and then methodically chipping away at the practical engineering challenges required to make them work at scale.

The teams involved, as you'll hear, are seriously concerned with developing the technology safely, but are also feeling both a high level of genuine excitement and competitive pressure that keeps them moving forward as quickly as possible.

As always, if you're finding value in the show, and I definitely think this is one of the higher alpha episodes we've done, we'd appreciate it if you'd share it with friends, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube.

And, considering that the future is radically uncertain and the stakes are crazy high, with outcomes from a post-scarcity, disease-free utopia to an existential catastrophe or even outright human extinction, all live possibilities in just the next two to twenty years, I take my responsibility in making this show extremely seriously, and I earnestly invite your feedback and suggestions.

You can reach us either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network.

Now, I hope you enjoy this insider's perspective on scaling large-language model thinking and the path from here to AGI with Jack Ray, Principal Research Scientist at Google DeepMind.

Jack Ray, Principal Research Scientist at Google DeepMind and Technical Lead on Google's Thinking and Inference Time Scaling work, welcome to the Cognitive Revolution.

Cool, thank you so much for having me.

I'm excited for this conversation.

Congratulations on Gemini 2.5 Pro Experimental 325, I think it is.

The long name doesn't reflect what a big release this is, and obviously that's a common trope in the model wars these days, but it is a big deal.

In my estimation and in my testing, this has been the first time that I would say a Google DeepMind model has been the number one model in many important respects, and it has also kind of given me one of those hair-raising moments that don't come along too often, although remarkably often, but when I dumped a full research code base into the thing, 400,000 tokens, and said I want to extend this, I want to reuse as much as I can, but I want to make a really light touch and not mess with other people's code because this is sort of a shared collaborative space, I was really amazed by how much command the model had of the super long context, and it was hair-raising because it did feel like a qualitative difference, a very just immediately noticeable step up.

So, we're all still adjusting to what it can do in calibrating ourselves, but I think as the kids say these days, it is safe to say that you guys have cooked on this one, so great work, and I'm really looking forward to understanding a lot of the work that went into it.

Yeah, and I'm going to probably say this a lot, but we're super happy with this model, we are really happy with the trajectory of our models, and this one was like a true Gemini team effort, like I'll probably touch up on this, but this was a knockout performance from the pre-training team, from thinking, from post-training, from many areas across Gemini just really pulling this together, and we feel pretty good about it, we liked it internally, we didn't know exactly how it would be received, it's great to see that people are really finding it useful, they're really feeling the AGI with it, they're seeing noticeable deltas on real-world tasks, so that's been very cool to see, I really appreciate the praise and yeah, and I just want to say, this one was kind of a Gemini team full knockout, but I'm really happy to talk about some of the model development and especially things on the thinking side.

Cool, well let's get started with a question that I've been thinking about a lot recently, and I think a lot of other people have too, and that is, why didn't the simple, and of course I'm sure you guys used more complicated techniques, but here I'm really thinking about the R1-0 demonstration that a really simple RL setup with a correctness signal can work now, and I'm kind of wondering why didn't that work sooner, I assume many people tried it in many contexts and I'm not sure if they were missing something or the models were missing something or what it was that sort of kept that idea at bay for a while and now of course it seems to be working everywhere.

Yeah, I suppose from my vantage point, we've basically been leaning more and more on RL to improve the models reasoning ability for quite a while, for at least a year within our Gemini large language model, so as we've been releasing models, there has been kind of a greater and greater presence of using reinforcement learning for accuracy based tasks, so we're getting a very discrete verifiable reward signal and using that to improve the models reasoning and we've actually been doing that before thinking even started and we've been shipping models with that, and it's been helping the models reasoning process, so I think really the way I see it is, this has been something that's been improving from a lot of amazing reasoning researchers and RL experts for a while it kind of has hit a bit of an inflection point in progress where it's really captured people's attention and maybe it feels like there was a kind of a threshold moment for a lot of people maybe around say this like deep sea technical report, but I think it's been working for a while there hasn't been one key thing which has like discreetly made it working, it's just kind of crossed the capability threshold where people have really taken notice.

Interesting, so fair to say you see that what may seem to outsiders as sort of an emergent phenomenon as more of a mirage, under the hood it's a pretty smooth curve.

I feel that's how I see it, a lot of these capabilities, when we like internally track these things they're kind of going up with sometimes almost like scarily predictable improvement, almost like a kind of a Moore's law style improvement that we see and I just feel like what I've come to notice and this happened also from my time in pre-training is we would have that phenomenon and each given piece of improvement, each piece of improvement to the reinforcement learning recipe or the model recipe, you don't always know what will help so there's a bit of stuck elasticity there, but as you accumulate things in there is this almost like trend of improvement and then usually what I feel like happens in the public domain is it just crosses these thresholds occasionally where people really take notice and get very excited and it kind of captures people's imagination and crucially the model just gets sufficiently good that it really feels like a step change, especially with these kind of discrete releases that we make.

Yeah, so that's my perspective on it anyway.

Yeah, that juxtaposition between smooth progress on sort of leading indicator metrics and then the threshold effects of downstream tasks is one of the most interesting dances in the entire field I think and probably will be for a while to come.

On your just like personal production function, obviously there's everything going exponential in the space right now and the number of papers and different techniques being published is in keeping with that.

How do you allocate your time or how do you think about allocating your time between reading and keeping up with research that the rest of the field is doing versus keeping your head down and just pursuing your own ideas and are there any AI tools that are making that more manageable for you right now?

Yeah, I mean in terms of reading research to kind of doing coding running experiments and things, on some level I don't know whether my own experience is just influenced by also just like career progression and changing how I work but earlier on in my career I'd spend a lot of time reading research.

There's so much to brush up on and also it felt like maybe at conferences and things this is where all of the kind of action happened and it was really about consuming a lot of different ideas and things.

Now I feel like, and this could just be partly because I've switched from more of a junior research role to something where we're kind of directing things a little bit more, there's just a lot of very known problems of which there's no research out there that has the solution.

The solution is going to be discovered amongst the group of people that I'm working with day to day.

The amount of time I spend reading research has gone down definitely a lot since if I took myself from now to five years ago or even ten years ago.

But yeah, I still find it very inspiring and useful when people are publishing cool ideas.

I still take the time, I use X, I follow people, I use archive filters to try and filter out interesting papers or blog posts or podcast interviews or YouTube videos.

A lot of this stuff is coming through different formats now.

In terms of tools, I feel like I have, I know this may sound predictable or cliche, but right now I do use Gemini a lot for reading and summarizing and asking questions of papers especially because that has been kind of its forte for a long time.

I feel like I can trust its ability to ingest in certainly a whole paper, but even sometimes a collection of papers if I want to add in a bunch of cited papers and then ask questions or ask summaries.

That's pretty useful especially because as you read research more and more you start to get a bit more demanding on just cutting straight through to the critical idea and the critical results.

Sometimes it's just a bit hard to do that if you don't have the time to pass through the text brute force and kind of look for what you need to know.

It's very useful to have the model do this.

That's one thing Gemini's long context ability is really good.

It's been very good at question answering and summarizing for a long span of technical text.

So I kind of like it for that.

I use that.

That's my go-to tool.

So another kind of striking observation about the field right now is I guess close to all maybe not quite all of the frontier model developers have pursued what from the outside appears to be like a very similar trajectory over the last year and we basically see now a whole new class of reasoning models which follow a similar paradigm where they sort of have a chain of thought, where they're thinking for a while, and then they give you a final answer.

That convergence is something that I would like to understand better.

And I don't know if it is just like simultaneous invention because the conditions were just so over determined to make that the next logical step or the other theory that you hear is that people are meeting up at these infamous San Francisco parties and sharing what they're working on over drinks or whatever.

So how would you describe your understanding of why everybody is kind of developing seemingly very similar ideas in parallel right now?

Yeah, I think it's just a phenomenon that has existed even before the invention of SF tech parties.

People are always looking for where there's avenues of progress and I think even very small bits of information that we can see a model is improving in a certain way, people very quickly notice that now, especially now we have an unprecedented number of smart people working in AI.

An unprecedented amount of compute that allows us to react quickly and we're seeing that just follow through to an unprecedented level of speed and velocity when there is a new paradigm, let's say test time compute in this case and there's a bunch of performance and capability to explore in this domain then people will flood into it very fast.

Even if I think of how this has kind of unfolded within Google within Gemini, we assembled the reasoning groups to work on the specific topic of thinking and test time compute in September/October time and within a month or so of just focusing around this space we're finding what we felt like were modeling breakthroughs that were very exciting.

That led us to shipping our first model in December, kind of an experimental model based on flash with thinking and I think, yeah, it's just like if we emulate if I think about and reflect about how that team's progress went there was just a very natural process of people exploring in this place and really getting involved and more and more people thinking about it running experiments and just like progress happened very fast and I would imagine that is just a common phenomenon now within these very talented research groups and that's why you get to see suddenly a bunch of reasoning models within kind of a short time span of each other.

There's just like a very natural phenomenon of curiosity and exploration and talent right now.

So people are always super kind of motivated to find the next big breakthrough and explore it as fast as possible.

So can I summarize that as the idea itself was a pretty obvious candidate and the sort of density of low hanging fruit, you know, or the richness of that vein was just so striking once you started to mine it.

That sort of accounts for all of the leading developers at least exploring that a bit and then all of them finding that like, yes, this is really a way we clearly should be investing a lot.

Yeah, that's how, that's at least how things have kind of unfolded within Gemini and I think also we've been kind of seeing a lot of like initial signs of this making sense and kind of had some initial results and we also fortunately, this whole thing required a deep confidence in applying reinforcement learning on language models which is something we, with at Google, we were like very comfortable and like interested in and working on so in that respect there was just like, it was kind of a low barrier to entry to really explore this space and then find a bunch of really cool capability breakthroughs from thinking.

So it was kind of a natural extension for us.

I can't really come on to a number of labs but I imagine there must be similar things happening across the board.

Yeah, one really like small detail but I wonder how you would contextualize this for me is in the R1 paper they had said that they tried reinforcement learning on smaller models and basically couldn't get it to work and I, you know, they seem to be pretty cracked as the kids say so that, you know, seems like they would have been trying something pretty smart.

Later though it does seem like it's kind of working everywhere.

I mean, any light to shed on, you know, what would account for if somebody in the recent past was trying to apply reinforcement learning on somewhat less powerful based models and couldn't get it to work?

Like, does that sound right or wrong to you?

Yeah, it's completely valid.

These things way more difficult I think than often people realize to get working well.

Even pre-training.

Pre-training I think people now consider in the book of like completely solved completely obvious.

I was working on pre-training let's say six years ago where training a large language model, 100 billion parameters or more, was like a, there was a million components of things that could go wrong or diverge and it was kind of in an alchemy stage.

Training reinforcement learning on these powerful language models and getting them to reason and think more deeply.

I imagine people have tried and failed many times because there's a lot of very key crucial details to get right.

So I just think it's hard and it requires like a lot of things to be fixed.

And when you have like five things broken, it can be very difficult to like, you know, you might find one thing that's broken, you fix it and nothing changes and you get disheartened.

And at some point maybe you feel like this just doesn't make sense, this won't work.

And then it just requires a few iterations of that until more and more things are lined up and then the whole thing starts to shine.

I feel like we saw some initial sparks that were very cool last year where just with reinforcement learning the model was using thinking and we started to see really cool phenomenon happening during the thoughts like self-correction, exploring different ideas.

That's exactly what we would have hoped would just emerge from reinforcement learning but we didn't really know like if it was possible until we kind of saw it for ourselves in our own experiments.

And we'll see you in our interview in a moment after a word from our sponsors.

[Music] [Music] [Music] [Music] [Music] Once more, that's Shopify.com/Cognitive [Music] Yeah, so how do you...

those sort of cognitive behaviors as they're increasingly commonly known are, you know, obviously there are multiple different ways that those can come to exist in a model.

Some, you know, one possible explanation for like why the RL maybe doesn't work on smaller models is that you need a big enough scale of model and training to have those, you know, sort of begin to take shape at all in the model so that the reinforcement learning can sort of bring them out.

But you can maybe also get them to be learned during supervised fine tuning or, you know, or maybe, you know, if you just do enough RL they can sort of pop out kind of send me randomly.

How much work do you guys do to sort of sculpt, you know, and really curate those cognitive behaviors versus how much are you sort of seeing arise, you know, at which stage of the training process?

Yeah, I think like people have different opinions on this.

We're a pretty outcome driven team so like at the end of the day we'll do whatever like recipe gives us the best results and best model generalization, the best like final result.

But I think you know, taking one step back from that, there are some kind of like priors and opinions in this space.

One school of thought which I'm quite in favor of is like whatever the simplest recipe that leads to the model, like choose that one.

So there's a bit of a knock ends raiser.

If you can impose less, fewer and fewer priors into like what the cognitive faculties should be and you can still get a really powerful model so everything is more purely learned from data that always feels like a better approach.

That said we, you know, we explore human data, we use like model based synthetic distillation data, we try and have a lot of things if they can arise like from end to end reinforcement learning.

So we try everything and then in terms of like the final model and the final mixture we just go with what works best with some kind of preference for simplicity and generalization.

So yeah I don't know if that's a satisfying enough answer.

Obviously we can't like kind of go deep into what our training recipe is and it's also always evolving so fast.

That's the general kind of principles we use.

Yeah, that makes sense.

Don't expect you to spill all the secrets.

The human data is obviously you know, has some nice upsides and that we would expect that models trained on it might be a little more human like.

I obviously don't overstate how human like they become but you know, you would at least hope you might avoid.

I guess I wonder have you seen, you know, one of the famous tidbits from the R1 paper was that they reported this language switching behavior in the context of the chain of thought.

I've also personally seen that from Grok.

I have not seen it from Gemini.

Is that something that you guys that or other sort of weirdness in the chain of thought were those things that you observed and did you take any action to try to select against those sort of weird behaviors or maybe not necessarily select against but like set proper prior so they didn't come online in the first place?

Yeah, I think well this one is not we, okay, I think ultimately one principle is that we want the model to use its thinking tokens to just be a smarter and better model and from that perspective there may be some kind of slightly weird phenomena happening in the thinking tokens might get quite cyclic or it may appear to be doing to emitting text that's not so useful all the time but if it leads to the model then being much stronger in solving the problem one philosophy is that you should just let it do that.

This is supposed to be a scratch base for the model to figure out how to respond with the best accuracy, safety, factuality, etc.

That said, you know, we did notice some things about the thoughts.

One the Gemini thoughts are usually in English.

They usually prefer to be in English.

We actually found the model was quite strong at reasoning tasks.

I18n, so we call it I18n, but non-English reasoning tasks, it would mostly perform its reasoning though in English and that was one question of, you know, is this a bad product experience or should we allow it to do that if it allows the model to, you know, maybe it's useful that it does that to be quite strong at these reasoning tasks.

So that was like one debate over this one.

So it's kind of like not language switching you could say for the thoughts and just sticking in one language.

Another was, you know, some of the thoughts especially in the original Flash thinking launch were quite templated.

The model would often choose to use quite like a formulaic structure of how to break down the problem and then formulate a request.

And that was another line of research.

Do we want this to be very templated?

Ideally not.

It should be quite natural.

It should be the model thinking through the problem and not necessarily always, it feels like if it's maybe always adopting a particular template then maybe it's not getting the most benefit out of that thinking compute.

And other things, those other aspects of the thinking token.

We obviously want it to be efficient and maximally benefit the capability of the model.

So those are some topics we're always thinking about.

Yeah, cool.

Okay, that's interesting.

Just to make sure I have a clear understanding of what I am looking at when I look at the chain of thought, is it fair to say that what is being shown, I actually mostly use the AI Studio.

Maybe you could comment on that if it's at all different from the, you know, the Gemini app itself.

Yeah.

But am I correct that I am seeing the full raw like unmodified chain of thought?

Yeah, that's right.

We launched in December and then launched again in January and with 2.5 Pro in all cases you're seeing the like raw chain of thought tokens from the model, both in AI Studio and on the Gemini app.

You know, this is something we're always thinking about.

It's not clear what the best thing to do honestly is.

People do like to see the raw tokens.

At the same time, they can be quite verbose.

We might want to create summaries that are actually more useful.

We might want to do with the transformations.

There was a cool piece of work in Notebook LM where there's kind of like a thought explorer with a graph and you can follow different ideas in a graph structure.

It's like, it's still a pretty new space and I think the best way to surface thoughts we haven't finalized on right now they're the raw thoughts.

Yeah, interesting.

So we're just wondering sort of what the, what if any debate went into the decision to share the full chain of thought because obviously OpenAI initially chose not to and cited a mix of reasons but I think most people sort of interpreted it primarily as a competitive consideration that they didn't want to share the full chain of thoughts.

Everybody could just go run and distill or do SFT or whatever on their work.

That does not seem to have proven a durable note for them but I wonder kind of what considerations or what debates you guys had as you decided yeah, let's go ahead and share the whole thing.

Yeah, I feel like these kind of decisions, they're often a mixture of input from safety team from the researchers, from leadership and it really is kind of a complex decision.

I couldn't give you like a very specific roadmap but for each release it's like very carefully considered.

We have like our leaders like Korai Demis will often want to like have a very good understanding of the pros and cons and yeah for me, I just, this isn't something I weigh in on so I'm not really the best person to ask but I just try and make sure all the models are incredibly strong and we have a lot of good options on the table.

Yeah, so I think it's an area of active kind of exploration.

We haven't settled.

We're not like fixed on one particular way of surfacing these thoughts and in fairness also for OpenAI, you know, I don't know why they chose to show summaries we could speculate.

They did give us some reasons but I'm sure there could be a mixture of reasons that go beyond just things like distillation to other aspects.

I think there was like an initial worry from some group of people that maybe if we show thoughts then we have to then start RLHFing thoughts to make them look really nice to users and maybe we don't want to encourage models to have deceitful thoughts.

There's another school of thought which is once you have these thoughts it's great for interpretability and you can understand how the model formed its like output.

So I guess there's just a whole debate going on about what's the best way to ingest and communicate this content.

From my perspective I just want to make sure the thoughts are resulting in a way stronger answer, way more capable model and that's my main kind of concern.

Yeah, so is it fair to say then that you don't concern yourself with how the chain of thought looks to the user?

I mean, OpenAI of course recently also put out this obfuscated reward hacking paper where they showed that fears of reinforcement learning on the chain of thought are not entirely unfounded.

They showed that when they started off with a model that learned to reward hack and then they put pressure on the chain of thought to not reason about reward hacking that initially would tamp down the reward hacking behavior but then later you'd see the reward hacking behavior come back without the reasoning showing up in the chain of thought thus the obfuscated reward hacking.

So it seems like there is something quite concerning there.

I guess do you see that as like concerning?

Do you sort of endorse what I take to be the conclusion of that paper which is like, "Thou shalt not select intensively on the quality of the chain of thought?"

I think we show the chain of thought right now as part of these experimental model releases and we're trying to get feedback and learn from real user behavior.

This is often an incredibly important aspect of releasing any technology and then we're very seriously taking in feedback, looking at how these things are used in practice and then make more educated decisions on how to kind of surface information from chain of thought in future and safety is definitely one thing that plays into a big part of that decision.

But I guess to just put a little bit finer point on it like you could do RL on the chain of thought for any number of different objectives to try to make it more readable or to try to avoid weird cyclic behaviors or to try to tamp down reasoning about reward hacking which may have this downstream negative effect but there's definitely a strong school of thought out there that says, "Don't do that."

Do you see that as a strong taboo because of the obfuscation that it can create or do you think there's some way to do it and not have such a big problem?

I think it's a pretty safe angle to say that we want these thoughts to actually improve the factuality, safety, capability of the model.

We want it to have that scratch base.

We also do, like if we're going to be showing thoughts then we want them to be interpretable and faithful to the computation that the model is taking and we probably don't want to add training objectives which encourage things like deceit.

So I think that's a very valid point.

Hey, we'll continue our interview in a moment after a word from our sponsors.

(music) (music) That's netsweek.com/cognitive Yeah, okay, cool.

Going back to the mix of different data types and human data for a second, I have tried in my own work a bit to get people to record their chain of thought even back before all this reasoning stuff.

I just personally found that when fine tuning a model by simply including example chain of thought in my fine tuning data set I would get I'm usually just one or a very small number of tasks, I would just get a lot better performance.

So as I've worked with other people to try to help them build their AI applications or automations or whatever, I very often am like, okay, what I need you to do is staple your pants to the chair and I don't really care how you do it.

You can do it in text, you can turn on your webcam and record yourself whatever.

But I need your live chain of thought as you, the expert, whose work we're going to try to automate, actually do the work like we need to know not just what your inputs and outputs are but how you're thinking about it, why you're making these little incremental decisions along the way.

I find that to be really hard to get out of people in a lot of situations and this may be a little bit outside of your specific responsibility set but I wonder what you/the broader team have kind of learned about how to coax that data out of people, if anything, or maybe it's just so hard for you guys as well that you're sort of just like, oh god, we'll go with synthetic.

But what's the state of actually eliciting human chain of thought out of humans?

Yeah, well, your question I guess had two components.

One was like how do you get that process data so it's not just prompt and then solution or response but it's actually what was the process that led to the solution and then there's something like chain of thought which I guess is one instance of that.

Funnily enough, I think it's really hard to get people to transcribe actual chain of thought faithfully.

It's a pretty latent thing and actually, I think part of the reason all of these models, especially Gemini, are able to click into this mode well is because people have already detailed their own thinking process.

Maybe when not put under kind of the task of doing this explicitly but even in essays or various pieces of work or online discussion, people will often break down how they're going to solve the problem or why they're writing what they're writing.

So there's already in the pre-trained model a bunch of examples of what does it mean to reason through a process and that's partly why even before we were really trying to bring this out and make it really powerful with reinforcement learning, you could do things like prompt the model like let's think about this step by step and it was basically doing this zero shot.

What I found is though when you put people artificially and say now you have to record all of your reasoning towards a problem, so when it's not happening organically but when it's happening under a directive, it seems to be quite hard to get a lot of value out of that kind of data.

But I think that is a bit separate to your other question which was like how can you record process and I think that is very valuable.

If we can get more and more examples and training of the process of which people are naturally using to solve their tasks, that feels very valuable.

I'm just not so sure people are so good at describing their inner monologue and then training on that and that being useful when asked to do that.

So when you talk about recording process, are you imagining computer use, like how people click around and interact with the environment or what sort of recording are you envisioning there?

Yeah, I think more in this kind of space where you're going to solve a more open ended task and you have to do a lot of things, intermediate calculations, maybe actions for example.

Yeah, that's kind of what I have in mind.

But you know this is really kind of, I would say part of this question is really kind of then moving into like what's the best way of getting more kind of agentic data and things.

That kind of side of things, it really isn't my area of expertise so I'd be not the best person to chat to about that.

Does that mean you see like a significant distinction between reasoning and agentic behavior?

Because I think a lot of people right now have the sense that the reasoning is going to be the unlock for the agentic behavior.

No, absolutely, yeah.

I just feel like reasoning and agentic behavior as a research thing is very tightly coupled but there's a part, you can still kind of segment which part, like kind of the critical research questions for acting and creating environments for agents that part, you know, we have a really good group and we compartmentalize it and there is a group of people that work on that.

The thinking area really collaborates when it comes down to like the reasoning behind actions or behind responses.

Yeah.

Okay, so you mentioned a minute ago that people struggle to write down their thoughts in part because it's a sort of latent thing so I want to take a turn into the latent space with you if you will.

First of all, I'd love to just give you a sort of undoubtedly overly simplified understanding of what's going on in a model as it's reasoning and then have you kind of critique or elaborate or expand upon it.

So my general working model has been that the pre-training process determines what abstractions or representations or features, whatever you want to call them, a model has to work with, what concepts it has basically, and then post-training determines the patterns of behavior by which it sort of deploys those concepts and puts them in juxtaposition against each other and sort of tries to figure out a path through to a solution.

My sense is, well, react to that.

Yeah, I think one way of maybe paraphrasing what you're saying, I largely agree is pre-training can learn this massive bag of function approximators that allows you to model the whole distribution of both good, bad behavior, strong reasoning behavior, incorrect reasoning behavior, you get kind of everything.

You can try and mold it a little bit with your selection of your pre-training data, but it still is really trying to reflect all types of behaviors and really just trying to understand.

So the better you can predict the next token, the better you can compress this text, maybe even the better you can understand the whole distribution.

During post-training you're going to drop a lot of modes, you're going to drop a lot of types of behavior, and really try and fixate on a couple of types of ways of reasoning, ways of responding or acting on to various different tasks that are important, and then hopefully if we do reinforcement learning really well, you are also going to then learn to compose maybe some more primitive skills to build up your skill set towards this smaller set of important tasks.

I feel like I don't know if I'm critiquing or exactly mirroring what you're saying, but that's how I think of it.

I guess the distinction, and maybe this will sort of blur, I guess part of the premise of the idea has been that the vast majority of the compute goes into the pre-training, and then the post-training is by comparison a very small, maybe like two orders of magnitude less.

I think now, obviously, the reinforcement learning scale is going up as well, and maybe this is sort of dichotomy is ultimately going to become a spectrum that certainly is a common theme in everything that I study.

Maybe one way to put it is, do models learn new fundamental concepts about the world during the post-training, or is that largely learned during the pre-training, and is that going to change as we go from 1% to 10% or whatever of FLOPS being deployed in that post-training phase?

My sense is, they have to.

It's absolutely crucial if we're going to build HCI that during reinforcement learning stage where we're not just kind of reshaping known concepts, but we have to learn new skills, especially if we want these models to eventually completely surpasses very critical tasks.

It can't then just be reshaping the knowledge that it's seen from behavioral cloning during the pre-training stage.

I think, yeah, that's one of the most exciting research directions we're all in right now, is how do we get the composition of reinforcement learning to help cycle up these models' capabilities to being incredibly powerful and general and robust.

I would totally bound it being during reinforcement learning.

So another big, maybe the one frontier model developer that hasn't joined the reasoning party in full force at this point would be META.

They did put out though, I thought, a very interesting paper, although kind of a scary paper from some points of view at least, about reasoning in latent space where instead of actually cashing out to a token at the end of a forward pass, they would just take the last latent state before that final decoding, pass that in as the token for the next, as the embedding for the next token position, and just kind of let the model chew on its own thoughts for however many forward passes in a row.

To me, there is something quite scary about that.

I would like to be able to know what my AI is thinking as much as possible.

There also were some nice features about it.

There's an attractor state there, I think, where it's like it required fewer forward passes to kind of reach similar performance.

There was some evidence that they could do breadth first search as opposed to having to go depth first, which seems to be more the pattern that the explicit chain of thought lends itself to.

So what do you think about reasoning in latent space?

Like should we be scared of it?

Should we taboo that?

Or are there some ways that we could embrace it safely?

Yeah, okay.

I think tabooing a piece of technology before it's been researched and understood, I'm never in favor of, unless there's incredibly strong arguments to do so.

In this case, I would say that the reason that people could kind of raise a question mark over it is this interpretability question.

We need those latent vectors to be interpretable.

I actually want to draw an analogy.

So I'd say we should pursue it if it leads to better thinking, and it can be interpretable and made safe.

Why not explore this direction?

It seems very promising.

And actually I want to draw one analogy to, I don't know if you, this is kind of a pre-LOMs, but mu zero.

Mu zero was an extension.

We had alpha go, and alpha zero, and mu zero.

So there's kind of a series of algorithm developments.

Obviously alpha go was the moment where we had a reinforcement learning model kind of be the world champion of alpha zero.

And so the difference from alpha zero, which was essentially only use self-play, no SFT, and there was many other algorithm improvements, but that's the tagline to mu zero, was instead of essentially unrolling over states, which is happening in alpha zero, they unrolled in latent vectors.

Those vectors could still be decoded into states.

And there was a lot of advantages that they found with mu zero to being able to kind of search in this latent space.

So I was pretty inspired by that.

And often when I think about thinking in latent space, I think of this mu zero, that was definitely the most powerful way of that series.

It was the most powerful progression.

And they still could make it kind of interpretable because they could decode states from these latent vectors.

So I think it's quite possible that this could be a very promising direction.

I wouldn't rule it out at this stage.

It seemed a good idea.

Yeah, I guess the skeptic, or the safety hawk might say, it's all well and good when you're talking about game states that you can sort of decode to in a quite high confidence way.

I mean, ultimately there is a game state that this thing sort of has to operate in, and we know what that is, and it's like it can't go off into far, far away places.

But we don't have a similar sort of ground state that we can feel so confident in when it comes to what exactly is going on inside a general purpose AI.

And I've spent quite a few hours reading the outstanding work that Anthropic just put out about tracing language model thoughts.

And I think the headlines of that have unfortunately maybe led a lot of people to who are not in the field, to a high level of overconfidence in our ability to really understand what's going on.

I mean, as much as I think the work itself is awesome, I tend to also look at like, well geez, the replacement models that they create can only explain 50% of the behaviors, and there's a lot of error terms that are being added in to sort of make sure all this is being explained.

So I guess big picture, do you think, my sense is that the field at large does not think that we're going to get interpretability working well enough by the time we expect to have powerful or transformative or whatever you want to call it, AI, to really be confident in what the models are thinking or like why they're doing what they're doing.

What's your overall outlook for interpretability?

Do you think it will get there faster?

And we really will know what they're thinking as we get these powerful systems everybody's expecting?

Yeah.

There's a rapid advancement in capability.

What I usually believe is that these also transfer not only to the models doing tasks like coding or agentic tasks that people find to be useful in the real world, it also does accelerate mechanistic interpretability.

So if we have more powerful models we have more powerful tools to examine these questions.

So it's not super clear to me that the capability is going to kind of improve exponentially and our ability to do kind of mechanistic interpretability or safety work is going to improve linearly and we're going to have a massive mismatch.

I would imagine the two are going to track each other, but actually to your question about kind of latent vectors versus thoughts in tokens like this is a really good point.

In any case you want some really good piece of research and tools eventually, artifacts, that can try and trace how close the actual content of the thoughts is to maybe the underlying computation and thus what the outcome of the model's answer will be.

I feel like that is just a very interesting research problem and yeah, it's great.

That was a really cool piece of work from Anthropic.

We have really cool people working on this within Gemini.

It's a really important problem and we should try and solve this in any case whether it's latent vectors or continuous tokens.

It seems like people like this kind of and need this kind of interpretability from the model.

Yeah, I think it's a huge, especially if we're going to have these things like running large swaths of the economy or, you know, heaven forbid, the military which seems to be more and more the kind of thing certain people are dreaming about.

Knowing why they're doing what they're doing is, you know, it seems to me an imperative.

One of the big challenges there with the interpretability where like the auto interpretability might be a huge unlock or might be sort of a spinning plate that we could see sort of crash at any given time is the auto labeling of features.

This is another one where again, the Anthropic work is just beautiful.

The interface, the way they've published it where any of these features that appear in line in the post you can kind of expand and see like what are the actual passages from the dataset that caused this feature to fire.

Some of them, I have to say, I look at them and I'm like, I would not have come up with that label.

So this becomes like quite philosophical.

Maybe I'll ask it in a philosophical way.

I'm sure you've seen the paper called the Platonic Model Hypothesis.

I wonder to what degree you buy that hypothesis and what that means to me is like, A, there's sort of a convergence between models with growing scale which seems to suggest that there may be converging on some one true world model.

Do you think that that is actually what is happening and by extension, with further scale, should we be more confident in our reading of what the models are doing?

Maybe could you paraphrase the question a little bit.

Are you saying across all the different models that are being trained as they're growing in scale they will start to converge more?

Or actually I wasn't sure exactly.

Yeah, and maybe more deeply and philosophically, are they converging on some actual representation of reality that we can trust as being well grounded?

Well, I would say that the only place I feel like I have a very strong theoretical kind of conviction is what is happening with pre-training where as we're approaching and just decreasing perplexity, improving the compression of the text that we see if we could do this to kind of hit the noise floor, hit the entropy of the text and kind of have this Bayes optimal text compression, then we would have the model which kind of best understands the world model which generated this text.

That is a thing I feel like has a very clear mathematical grounding from Ray Solonoff's work, even Claude Shannon's work it's always referenced of how we can how the optimal text compressor would have the best world model of the generation process which generates this text.

That does sometimes feel like a philosophical argument though because even that object is not what we really want for AGI.

It's not just something which has the optimal understanding of the dynamics that generates text that exists today.

We want the model to be trained to then go and do something useful and to be able to faithfully follow the instructions that we give and to do complex tasks that maybe never been done before to generalize to completely new and unseen environments.

All of those aspects I feel like are not covered by that kind of world model description of what's happening in pretraining and that's why I think even though I've spent most of my career on pretraining, pretraining is not the only component to building AGI.

In some level I think it sounds like maybe I agree with the hypothesis that you said but also it's relevance.

I don't think it's the full story of how we build AGI so maybe it's kind of something that's been down-weighted in my mind as being the only story I should think about.

I do think that once you are starting to get into the realm of training these models with reinforcement learning at scale, they're definitely not all converging to one model.

Actually there's a lot of responsibility in doing this well such that we really build the systems that are useful.

I don't feel like right now you can even see it on the ground.

The models are quite different already.

There's already a lot of different pros and cons across them and a lot of capabilities that we work very discreetly on within Gemini to kind of make them more useful in certain domains that I don't think just naturally arise across the board, across all models.

So yeah it still feels very steerable.

It doesn't feel like one eventual process towards one kind of world model of everything.

It still feels very directable from the research side.

But you know I'm not a philosopher.

I just try and make these things work really well.

I feel like I would be very interested in hearing what a couple of philosophers that are keeping up to date on AI would think about this.

Yeah it seems like to summarize, and you know I think the empirical sciences definitely have a lot to inform the philosophers on as well, especially these days.

It seems like you're sort of saying the world model itself is something that maybe everything is converging on, but how you navigate that world behaviorally is still a vast scope of possibility where there's not a single right answer and that's kind of where the taste and the safety and all these sort of things have a lot of space to explore and diverge.

So maybe last 10 minutes or so how do you map the roadmap from here to AGI and obviously I don't mean like in a detailed technical sense, but sort of one big thing that Gemini has is really long context.

Do you think that we can just sort of, I guess my read of that is that nothing too crazy is going on there, that it's basically just a matter of scale it up and have some data where you have to actually have command of long context to succeed and the model will learn from that.

I may be oversimplifying, tell me if I am.

But is just kind of continuing to push on that going to be enough or are we going to need some sort of like more integrated more sort of holistic process of memory and forgetting to really have these sort of long running agents that people imagine.

I guess is memory something you think is already solved if we just push on our current levers or do we need some sort of conceptual breakthrough?

Yeah, I mean it's a good question.

When I joined DeepMind in 2014 I started on in an area that was called episodic memory.

Memory is like what Demis, he did his kind of PhD looking into episodic memory and imagination and things and so yeah, like I've always been very inspired by human memory, human episodic memory, the hippocampus and my own PhD was on lifelong reasoning with sparse and compressive memories like how do we have a memory system and a neural network that is expressive and has this kind of huge range of time spans as we have in our own kind of mind.

And you know like when I started that PhD I would have never imagined how much progress we'd have made.

We now have something that say 1 million tokens, 10 million tokens, these kind of context lengths depending on how you represent your text or your video they are starting to kind of like verge on kind of like lifelong scales.

But I still don't think memory is solved.

I don't think it's all done yet.

I think there's some really cool breakthroughs we'll have even in the memory space.

And there was a lot of very cool ideas.

We had a deep mind, we had this kind of neural Turing machine, differentiable neural computer.

These were a mix of like large attention systems but with a lot of different like read write mechanisms.

My sense is probably something in this space will prevail and this will be a very cool way of having like extremely long infinite lifelong memory.

But you know it's still kind of an active research area.

But the roadmap towards AGI I suppose you know with each piece that we make it does seem to compound very well.

So like a year ago we released kind of what we felt was a breakthrough in long context.

And that has ended up stacking really well with our current reasoning and thinking work.

Because we found that there's like just a really useful coupling of being able to think very long and deeply about a problem and also being able to use a ton of context, maybe or millions of tokens.

And that has ended up unlocking a bunch of extra kind of problems that we now can solve.

That if we didn't have both of them we would have needed.

So I think that the path remaining to AGI obviously agents is kind of a super high priority area.

Thinking and reasoning, it's still, it's not like we're at the kind of end point.

These models have a long way to go in terms of being so reliable and so general that you really feel like you can trust their response on more and more open ended tasks.

So from our perspective there's still a lot of like just make the system better.

There's a lot of known bottlenecks right now and we will just continue doing that.

Make thinking better and there will be, you know, within agents make agents better.

But I feel like with combinations of much better agentic kind of capabilities, better reasoning even ideally better memory systems such that we can have almost like a lifelong kind of range of understanding and reasoning across time, then that will really feel like AGI to a lot of people.

I mean the current systems to me feel like AGI.

I feel the AGI using 2.5 Pro.

It can now kind of, you know, one shot complex like code bases and that was something we felt like was a futuristic piece of technology three years ago and now it's just there and it works.

We're always hungry for the next thing.

But I think, yeah, those combination of things much better memory system with a much deeper thinking and reasoning system with a capability to work on with many different tools and action space that's very open ended, that will really feel like AGI.

And when it's coming, I think it's hard to say but it's all kind of being developed actively right now.

So I feel like it's coming quite fast and that's kind of exciting.

Yeah.

Okay.

Two more quick questions and then I'll give you the floor to share any final thoughts that you have.

Yeah.

One thing I didn't hear you mention in that description is integration of more modalities.

Yeah.

And I've been inspired to think these last couple weeks as we've seen Gemini 2.0 Flash image out and also the GPD 4.0 image out that, boy, there is a lot of power in a deep integration of the text and the image modality as opposed to a sort of arm's length tool call type of integration.

Yeah.

Do you see that happening across many more modalities?

Is there a world in the future where Gemini, whatever, whatever pro, instead of calling Alpha Fold is sort of deeply integrated with Alpha Fold such that those different spaces are actually merged and sort of co-navigated in the way that we are now seeing with language and image?

Yeah.

Alpha Fold's a good question.

I would say, okay, multimodal, I feel a very good design decision for Gemini was that it would make it multimodal first.

And it's been incredibly strong with image understanding, video understanding.

It had native image generation trained within Gemini 1.

Actually, it's in the technical report.

It didn't end up getting kind of released really as it was in its first form.

But yeah, I think to your world model question, having everything deeply multimodal is super important and training everything and getting that world model not just over text but over multimodal, video, images, audio.

That's been a cool aspect of Gemini and it's great to see now these things are launching, people really liked the native image generation.

They loved the fact that you can edit images, you can do a lot more interactions instead of just calling what would be just as pure text to image model as a tool and it's very static.

So anything that you can bring in to the world model and train jointly, you're going to have a much deeper experience and understanding.

I think that's very cool.

And then it goes to like, yeah, what's the dividing line?

Where do you decide when to bring things into the pre-training mix and have them jointly in a third?

That's a really difficult question.

I think right now what you're seeing kind of across the board is a pragmatic choice of almost like the most compressed information sources and large information sources first and then they've been built out.

So that was why text I think was a very natural starting place for a lot of these large generated models started with text that's so compressed and so knowledge rich and there's available at scale.

But then the decision of how to grow this out to maybe smaller scale sources of data or slightly less kind of information compressed is a difficult one.

I know in like bio for example, genomics, it's very cool to try and co-train like genomic, generative models with a large language model.

People look into that.

And I don't know where the dividing line is, but it's going to be something about how much you get from co-training versus just calling as a tool, how much positive transfer there is from all the world knowledge within your kind of text and video and image space to this new task.

If there's not much positive transfer, maybe there's not much benefit in co-training it and maybe you just want to learn to use this tool.

Yeah, I think those are the main decision factors of whether you should bring it all into one world model or leave it as a separate expert system.

I'm betting I'm the one world model, but we'll continue to watch the space.

So last question and I really appreciate your time and coming in and sharing so much alpha with the community here.

But one question people would definitely be upset with me if I didn't ask is, where is the system card for Gemini 2.5 Pro?

We sort of thought we were going to get them and it seems like the last couple models we haven't and so I don't know if there's a policy on that that determines when a model actually gets the full technical report treatment.

Yeah, the kind of, the approaches with experimental releases, we release these models because we really want to get them into the hands of consumers and developers, get real feedback, understand their limitations, but they are released this experimental tag means we don't do the full provisioning of these models, we don't necessarily have all the artifacts like system cards.

We are moving as fast as we can to get these into a stable state where we feel like they're ready for general availability there will be system cards when the model is made generally available.

Has all the sort of safety testing been done though at this point?

We do extensive industry unprecedented level of safety testing before we release models, but we do have kind of the experimental models maybe there's a different level, a tier of testing that we take and partially part of the experimental release is getting this kind of real world feedback, which is also a useful part of the testing process.

But they are, for these releases, it goes through a very standard process in terms of policy team, safety team, there's a lot of red teaming and things that is happening, but yeah, right now we're kind of in this experimental stage and we're kind of racing to get towards general availability which will have even better kind of provisioning and things like system cards.

A lot of the questions that I was at a cloud event last week and there was a lot of like when will it be made vertex and I'm like soon and then it ended up being the next day.

So in some of these cases we kind of under promise over deliver, like these things are happening pretty fast.

The technology is also moving very fast.

Does that red teaming process include the third party red teamers?

Do you guys work with like Apollo or Hayes Labs or Meter?

I mean you know the usual suspects.

Yeah, so we publish these Gemini technical reports and we usually detail external red team, but I can't comment on who our partners are at this stage.

There's I think good reasons why we don't always discuss who our red teaming partners are, but yeah, we do work with external red teamers.

Gotcha.

When the technical report comes out, that will have the roster of the external partners.

I'd have to check, but my understanding is in our past technical reports this is something we acknowledge.

Yeah.

Yeah, okay.

Cool.

Fantastic conversation.

I really appreciate you working through all these questions with me.

And I guess maybe just in closing, any other thoughts or notions that we didn't touch on that you'd like to leave people with?

Yeah, I'm curious if so you've played with 2.5 Pro a little bit so far.

Are there any things that kind of you found that it was unlocking that you haven't seen before or any feedback you had?

The long context for me was the thing that felt different.

I was in a I have a kind of general sort of complaint with like almost all rag apps regardless of whether it's an IDE integrated one or otherwise where I feel like they don't and often this is like a sort of business problem more so than a technical problem.

There's like I pay a flat monthly amount for whatever product.

They want to have some margin.

So they sort of set the hyper parameters in a way that tries to like give me the best performance they can while also like not spending too much money and just burning all the cash they have.

So that typically I find leads to not enough context being included in the model calls and then I feel like oh god you know so often there's just something that was could have been there that wasn't there that was leading me to like you know not get as good of an answer as I could have.

So what I often do is if I can I'll just print my entire code base to a single text file and then just paste that into the model.

And I do a lot of like small you know personal projects proof of concepts.

So usually I can get away with you know hundred thousand tokens or whatever.

I can put that into any of the leading models.

But this recent one with the research code base I happen to be the least valuable author on the emergent misalignment paper.

Long story but I call myself the forest gump of AI because I sometimes wander through these like important scenes as an extra and this happened again here.

But you know I kind of had this research code base and you know it's not production code.

It's like folders are sort of like you know Daniel folder Nathan folder right.

It's like the it's not best practices software engineering.

We all know that but you know we're just all kind of exploring stuff.

This was 400,000 tokens.

So it was like significantly too much for me to put into any other model.

And the command that it had of it was just incredible.

I you know really was like boy previous Gemini models obviously could handle that much but I was never 100% sure if it was really in full command or you know only in sort of partial command.

But this felt to me like really incredibly strong command of that full context window and that to me did feel like a real game changer.

Without having strong benchmarks or anything you know to really ground myself in my feeling is that I can take dumps of information and have much higher confidence.

I still don't want to be overly trusting of course but I feel like I can take dumps of information that I don't even really necessarily know what's in there and be much more if not you know still of course not fully confident that the 2.5 model will latch on to what is actually important and help me navigate this super deep context even if I myself don't have a good sense of what's in there at the start.

So that to me does feel like a huge difference because it's one thing to be able to sort of help you navigate long context if you know the long context yourself.

But it's a very very different thing if it can help you navigate long context that you don't have great command of.

I think there's more work to do to really validate that for myself and obviously the community at large and you guys all working together but it feels different I can say that for sure.

I mean that's great to hear because I know like I worked with a lot of the long context people last year when we were kind of in the run up to the original breakthrough and I communicate a lot with because I used to be in pre-training for a long time with some of the people that have been particularly focused on making long context really good for 2.5 Pro and yeah there was a lot of work you know not only just to in the initial phase to make 1 million, 2 million and we'll see more happen but also to make it really effective.

So with the 2.5 Pro release I actually forget the name of the external leaderboard but I think there is an external leaderboard it's shared on X a lot where it's like 128k context you know Gemini 2.5 Pro it's using it way more effectively than basically any other model out there right now.

So which is cool.

So it's not only that it can go to a million but now especially with seeing it with 2.5 Pro it feels like it's read everything and it's not dropping things it's not missing out key details it feels like it's read and studied all that information and that kind of gives people a bit of a AGI feel where it's like within a second you feel like you've studied but like a very large code base and know like every kind of detail in quite good kind of understanding kind of level.

That's quite a remarkable kind of thing.

But yeah that's great to hear.

Yeah well it's a well deserved praise.

These step changes I'll never forget where I was when I first tried GPT 4 and there aren't that many moments in the last 2.5 years where I felt like oh this is like qualitatively different than everything I had used up until that particular moment but this was one it really it didn't have that quality where it was like okay I can feel a new level of unlock I'm gonna have to kind of recalibrate myself a little bit to what this makes possible so that's definitely an exciting time this has been fantastic I really appreciate it the final send off of course Jack Ray principal research scientist at Google Deep Mind thank you for being part of the cognitive revolution.

Great thank you so much for having me Cheers.

It is both energizing and enlightening to hear why people listen and learn what they value about the show.

So please don't hesitate to reach out via email at tcr@turpentime.co or you can DM me on the social media platform of your choice.