
The Cognitive Revolution · 2026-06-03
Nested Learning & Continual Learning: Ali Behrouz on AI Architectures Beyond Transformers
Hosts: Nathan Labenz
Guests: Ali Behrouz
Why it matters
Nested learning introduces multiple update frequencies in model components, enabling rapid context adaptation and long-term knowledge retention.
Key claims
- Nested learning introduces multiple update frequencies in model components, enabling rapid context adaptation and long-term knowledge retention.
- The "Language Models Need Sleep" framework proposes active and sleep phases, where sleep enables memory consolidation and self-modification via synthetic data distillation.
- Attention mechanisms are viewed as infinite frequency update modules, essential for fast context adaptation and likely to remain integral in AI architectures.
- Continual learning models can outperform transformers on hard tasks like multi-language translation and long-context recall, while maintaining competitive perplexity scores.
Briefing memo
Summary
In this episode of The Cognitive Revolution, Ali Behrouz, a grad student at Cornell and researcher at Google DeepMind, discusses his pioneering work on nested learning and continual learning architectures. He critiques current transformer-based models for their inability to continually learn and adapt over time without catastrophic forgetting. Ali introduces the nested learning paradigm, which incorporates multiple update frequencies across different model components, inspired by human memory systems with fast and slow learning layers. This approach enables models to rapidly adapt to new contexts while preserving long-term knowledge, showing competitive or superior performance to transformers on challenging tasks such as multi-language translation and long-context recall.
Ali also elaborates on his recent work "Language Models Need Sleep," which proposes a two-phase learning process mimicking human active and sleep phases. During the sleep phase, models consolidate memories and self-modify through a distillation process using synthetic data generated from recent experiences. This offline consolidation helps models maintain and abstract knowledge over time, addressing efficiency and forgetting issues. The conversation covers the conceptual insight that deep learning architectures are forms of associative memory and that attention mechanisms act as infinite frequency update modules, likely to remain central in AI systems.
The episode further explores the implications of continual learning for AI alignment, privacy, and human-AI interaction. Ali expresses cautious optimism that models evolving through ongoing interaction could better serve individual needs and foster a diverse, stable AI ecosystem. However, he acknowledges risks such as privacy concerns and alignment drift. The discussion also touches on the challenges of evaluating continually learning models, the potential for winner-take-all dynamics in AI deployment, and the need for new paradigms in model versioning and safety assurance.
- Nested learning introduces multiple update frequencies in model components, enabling rapid context adaptation and long-term knowledge retention.
- The "Language Models Need Sleep" framework proposes active and sleep phases, where sleep enables memory consolidation and self-modification via synthetic data distillation.
- Attention mechanisms are viewed as infinite frequency update modules, essential for fast context adaptation and likely to remain integral in AI architectures.
- Continual learning models can outperform transformers on hard tasks like multi-language translation and long-context recall, while maintaining competitive perplexity scores.
- The nested learning paradigm treats architectures and optimizers as interconnected nested learning systems, with learned update rules improving performance.
- Challenges include catastrophic forgetting, efficiency of updates, privacy risks, and alignment drift in continually learning AI systems.
- Potential user experience shifts include long-term personalized AI relationships, with models adapting to individual values and styles over time.
- Evaluation and versioning of continually learning models require new frameworks beyond traditional train/test splits, considering ongoing updates and safety.
Source material
Transcript
Hello, and welcome back to the Cockenter Revolution.
Today, I'm excited to share a conversation with Ali Beirus, grad student at Cornell, researcher at Google, and author of nested learning.
This episode was recorded a few months back, and while I normally believe that AI content does not age well, this conversation with Ali is an exception.
His work is some of the most inspired and potentially transformative that I've seen anywhere in the quest for new machine learning architectures that are capable of genuine, continual learning.
This, of course, is one of the most important capability advances on the horizon today.
Arguably, it is the main gap between today's models and a digital AGI that would be capable of joining and contributing to human teams just as humans do.
And Ali is advancing the frontier with an approach that is both biologically inspired and technically elegant.
His blockbuster paper, nested learning, which has been touted as a harbinger of a possible paradigm shift by no less than Jeff Dean, develops a simple strategy that allows models to rapidly adapt to their current context on an ongoing basis.
While preserving corn knowledge, by updating different parts of the system at different frequencies, much like humans manage memory on multiple times scales from working memory to long-term memory.
His latest work, Language Models Need Sleep, learning to self-modify and consolidate memories, which I actually heard about live for the first time on this recording, and which has now finally become fully public.
Takes inspiration from how humans consolidate memories and learn from dreams while sleeping.
Introducing a new offline mode in which models transfer new knowledge from their high frequency update layers to their more slowly evolving layers via distillation, and also learn new abstractions and connections between concepts by generating and training on synthetic data derived from their recent experiences.
In addition to the details of these architectures, which, like so many AI innovations I find both extremely exciting and of its scary, we also discuss how scaling for performance may shift from stacking more layers to nesting more frequency update rates.
How Ali understands all components of machine learning systems as forms of associative memory that compress a given context flow.
Why this leads him to call deep learning architectures and illusion, and how he's operationalized this conceptual insight by developing expressive optimizers that learn update rules and are capable of outperforming both atom and new one.
We also discuss how the attention mechanism can be understood as an infinite frequency update module and why Ali expects that attention layers will therefore remain fixtures of AI systems indefinitely.
We cover the empirical results showing that Ali's new architectures compete effectively with transformers on standard measures, while also outperforming them on hard tasks, such as effectively we're calling information from up to 10 million tokens of context, and also learning to translate multiple previously unseen languages at the same time.
Finally, we discuss why Ali sees continual learning as both an opportunity and a huge risk for privacy and alignment.
How human AI relationships might evolve, and why Ali is cautiously optimistic that models that evolve over time based on our interactions with them could both serve our individual needs more effectively and also lead to a more diverse and hopefully stable AI ecosystem overall.
The bottom line for me is that for all the debate and speculation about whether or not current architectures can scale to AI and beyond, there is a very good chance that conceptual breakthroughs will render that question mood before we even manage to answer it.
Transformers have changed the world clearly, but they aren't the end of history, and as tough as it is to keep up with AI developments, anyone who wants to get a handle on where things are going from here can't afford blind spots when it comes to new research directions like Ali's.
And so, without further ado, I hope you enjoy this deep dive preview of AI systems that learn on an ongoing basis in increasingly human-like ways.
With the brilliant, Ali Bay Rouge.
The cognitive revolution is brought to you by Mercury, the fintech that more than 300,000 ambitious companies and individuals trust to run their finances.
Over the last few months, I have made tremendous strides with my personal AI infrastructure.
Today, I've got high-context instances of both cloud code and open- claw running on a Mac Mini, and it's amazing what they can do.
However, until getting started with Mercury, I didn't have a great way for them to pay for things.
I didn't want to give them unrestricted access to my money, but my old bank didn't give me any other options.
With Mercury, I can create as many virtual cards as I want.
Each with its own daily, weekly, or monthly spending limit, and I can lock any card to a single category of purchase or even a single merchant.
Now, I have a card that my agent can use to buy our family's groceries and only our groceries, and I can create another anytime I want to give an agent a random one-off project that might require making a purchase.
This is honestly just the start of Mercury's AI-friendly offerings.
Does your bank offer API keys, an MCP, or a CLI tool?
If not, check out Mercurie at Mercurie.com.
Mercury is a FinTech company, not an FDIC-insured bank.
Banking services provided through choice financial group and column NA, members FDIC.
Thank you to Mercury for supporting the cognitive revolution, and now, on with the show.
Ali Beirus, author of nested learning, and the new language models need sleep, learning to self-modify and consolidate memory.
Welcome back to the cognitive revolution.
Thank you very much for having me in deficit.
I am super excited about this conversation today.
I appreciate you for being willing to take some time and come back and do a deeper dive into your work.
I think it is super fascinating and genuinely some of the most inspired work that I have seen in recent times, and a big part of your method, as I understand it, is looking at what human cognition consists of, and identifying things that we as humans are doing that seem like they're quite important and really critical to our successful function in the world, and then trying to figure out what an AI version of that might look like, and then starting to develop the architectures, or system designs, maybe even more abstractly than architectures, that start to make those capabilities possible in AI systems, and it's really striking both like how well some of these ideas have worked, and also striking to me how elegant they feel, and how kind of right they seem, as I really take time to dig in and understand them.
So, first question, and you know, big picture, how do you think about what it is you are trying to do?
Obviously, you identify gaps in what current systems can do.
How do you think about those gaps?
How do you conceive of what it is that you are trying to unlock with the new architectures that you're developing?
Getting inspired from the brain, and actually like what does it mean for me, probably means a different things for different people, for me, I really like to get inspired from brain and generally evolution, and the main reason is that I think it has like a lot of data to train itself in a natural way of training, and so one thing that we can see now is very like evolved version of a very complicated logical brain, and so generally it's a great source of inspiration, but when I'm saying that I don't mean that I want to like replicate brain for me to do something that that brain does, because most of the time we don't know what that is actually, and for example, I think there are different levels of understanding about how the brain works, so the first one is that we know it works, so that's the first level.
The second level of understanding is that there are some modules in the brain, each of them are responsible for different parts, and we have memory, we have other things, and that's generally the process, and I think the hard part when we want to get inspired from the brain is that at what level of granularity we want to focus on the brain and get inspired from that, because if we go too much into the details, like how the brain works and so on and so forth, there are like two issues about that, one is we are overfeating ourselves to one specific form of intelligence, and another one is that we don't actually know how the brain does that specific thing, so generally in all of the like the works that I have done, for example, on titans, but I also like this in this learning, one thing that is happening is we can see that the models are facing some challenges in real word applications, and then the question is, is that a specific challenge is solved by humans, and can simply human do something to address that challenge and overcome it or not, and then the second question is like how they can do that, but again, for example, in titans, we discuss surprise metric, we discuss, for example, how the memory should be like decomposing to short-term and long-term memory and so on and so forth, but the point is, there's a high chance that brain is not exactly performing gradient descent, to, for example, understand the surprise or something like that, that just a high level of our understanding about how the brain works.
I think keep that as a source of inspiration, I mean, that's level, it would be a great source, but if we go into more details, then potentially we might face some challenges, specifically because our understanding of the brain is changing over time, and so we might like overfeet to one specific design choices.
So that's generally one thing, but about theness of learning, I think one cornstitch is missing in the current models is about two parts, one is about how they can adopt the environment and the context that they are in.
Another point is about how the model can understand new knowledge and incorporate it into their parameters over time, so they can make sure that they don't face catastrophe for getting, which means that, for example, one specific task that they have trained on is forgotten and they don't have any discussing that specific direction.
So I think these two are important things about the current models and they are facing a lot of challenges for that, because if you have a very large model, then your model needs to be updated over time, and that's certainly the main reason you can see that, for example, there is the knowledge cutoff for all of the elements that we know of, for example, if you ask the chatGBT about one specific information and then say, you are not allowed to use, for example, any tools we're answering this question, then you might see that there's an knowledge cutoff, and so that's a little bit challenging to overcome.
And if you just want to keep updating the models, then there are two huge challenges, one is catastrophic for getting that I mentioned, and another one is the efficiency part, because I have a lot of parameters, you cannot keep updating all of the parameters and so it's a little bit hard for that.
And, you know, there are some solutions for that, I think they are also like great, but I have some intuitions that why they might not perfectly work for the case of continual learning.
For example, one might say that they want to do a super voice point, for example, SFT, or do some RL stuff or something like that, but the point is, still the model can face catastrophe for getting from one side, and from the other side is that at some point, you need to, you know, transfer all the knowledge that you have in your context, and pass it to the, for example, actual parameters of the model.
If you just keep summarizing the tokens, if you just keep the tokens and want to like do everything about the memory and learning process in the token space, then the main issue would be, at some point, you will pass the context links of the LLMs, and so the potential, you will face some challenges in that direction.
So considering all these things, the main issue with the current LLM paradigm is that they cannot continually learn and obtain the knowledge, new skills over time, and also they are limited in understanding different levels of abstraction work.
Everything that is done in science is to do something that can explain the word in the most simplest way possible, and that's generally the way we can learn something, because we don't want to like keep everything.
We want to understand underlying patterns that can describe that specific data, that specific knowledge, and then so forth.
So this compression process and how we can understand different levels of abstraction from the data that we have is something that the current LLMs post showed somehow.
A couple of different angles I want to just probe your interest a little bit more from on this point.
Certainly I have felt the predicate of most users, honestly at this point, I have felt the problems that you are highlighting.
In some ways I feel like the biggest advantage that I have relative to an AI today is this kind of ongoing coherence and reasonably stable identity.
I know I am in the morning, and I kind of know what I was trying to do yesterday, and I can most of pick up where I left off, and I probably could learn a lot more from the things that I do on a daily basis, but I at least learned some stuff, you know, and kind of take it on board, and obviously did the current models don't really do that, and that is a big weakness for them.
That's why I was so excited about the original Mamba paper when that came out, because I was just like, wow, you're something that seems like it's competitive with transformers, but it has a fixed size memory space, and something that obviously we can't grow the memory space quadratically to infinity, so we're going to have a set of something that's bounded in size that can work, and so that was like, obviously a notable step, and I've done some work with the Mamba architecture as well.
Titans even more so, okay, here's another way to think about having a fixed size memory module that in that case it was updated with gradient descent at runtime, so that potentially seems like it would likely, and it certainly seems like the data supported the notion that that would make it even more powerful than the Mamba architecture, but you know, a similar kind of structure of like, okay, a fixed size thing that can like keep what it needs, and gradually let go of what it doesn't.
That seems super important.
I wonder if you come at it, though also from the other angle, so so far we've kind of said like, here's something that it can't do, you know, we can do this, it can't do this, that's one way to think about it.
Another way to think about it, obviously, is like, what do we want our AIs to be like?
And so I wonder what your intuition there is, you know, today we have mostly chat bots that need something to wake them up, right?
They're like, either we have to go send them a message, or obviously increasingly we have like cron jobs and you know, other triggers that kind of get the AI to wake up and do something.
Um, but if those things don't happen, they're inert, right?
They just kind of sit there until, you know, somebody calls their number.
Do you have a sense or a vision or kind of a, you know, a dream of what your ideal future AI would be like?
That's different than that, and like, how, you know, does it look more like another person, but with AI advantages, or does it look like an LOM with like weaknesses patch, but I don't know.
What is your kind of, um, when you dream of a a 2030 AI that you're, you know, working closely with on a daily basis?
What do you envision?
Like different aspects to answer this question.
Uh, from the technical point of view, I think generally the entire researcher, most part of the researcher, uh, in the, you know, past 40 years, is builds on a paradigm that says we have a pre-training or generally training phase, and we have a test phase.
But the point is, if we have a continual learning, we can see that, you know, there are like a lot of recent studies about continual learning, how we can do that and how we can like overcome about a lot of challenges.
But the point is, a true continual learner doesn't have a test and train time.
So potentially, if we hear that name in any like design choices, potentially, it means that it's not a true continual learner, because there's no test, there's no train time, and so the question is, is it like a uniform process for the model or not?
And my personal opinion is that we still need at least two phases.
And how it works is that we should have one phase that the model is active.
So it actively receive information.
It's whether through the user query, for example, or for example, it might be about vision models, word models, or anything similar.
But the point is, the model receives some information, and generally it performs some computation on the input data and it's active at that point.
But at the other hand, there is another phase that the model does not have to wait for the input data.
It might not receive any input data.
It's completely locked from the word outside of it.
But the question is, even at that time, should the model be static without any performing computation or doing something, or the model needs to start thinking about some process, thinking about the data that it has inside its parameters and so forth.
So I think we can break the process as a mission in two parts.
One is the active phase, and another one is another phase.
But then surely we can call it a sleep time.
Because there is no inputs, but still the artificial brain or generally like that the model itself is trying to perform some computation.
So I think that's a good way of defining different phases in this direction of continual learning.
And then I think good model is a model that performs very well in both sides.
It should receive the information properly, included processes, and understanding the best way possible.
And on the other hand, when it goes to the sleep time, it should also like start processing what it has learned before and use that for self-improvement.
And so that's, I think an ideal model should do from that, you know, technically called point of view.
Hey, we'll continue our interview in a moment after a word with my sponsors.
Today's episode is brought to you by Endthropic, makers of Claude and Claude code.
Over the last few months, Claude has helped me build and refine a personal deep context database that now contains all of my emails, slack messages, tweets, DMs across platforms, video calls, and podcast transcripts, going back a full five years.
On top of that, we've now layered summary articles describing my relationship with hundreds of contacts, organizations, and ideas.
And now that this exists, there's almost nothing that Claude can't help with.
For taxis, an IS Claude to help me get organized.
It went through my inbox, tracked down 1099s for all 10 of my part-time jobs, and built me a comprehensive report on my expenses and donations.
For my angel investing, Claude can now draft investment memos in exactly the form that my venture fund requires, based on the calls I've had, and the emails I've exchanged with the founders.
And when someone needs a favor, Claude can often do it as well as I can.
Recently, a friend reached out to ask if I know anyone who might be a fit for a role that he is currently hiring for.
Initially, nobody came to mind, but then I thought to ask Claude, and sure enough, it identified two great leads.
Claude is the AI for minds that don't stop at good enough.
It's the collaborator that actually understands your entire workflow and thinks with you.
Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter.
So, for problems we're solving, get started with Claude at Claude.ai slash TCR.
That's Claude.ai slash TCR, and check out Claude Pro, which includes all of the features mentioned in today's episode.
Once more, that's Claude.ai slash TCR.
But on the other hand, I think there are a lot of challenges.
The models that we know right now are really large, so even a simple, for example, academic pay-pairs that is like presenting a new LLM or architecture or something like that, needs to perform some experiments on models with like billions of romters, one billion, two billion, or something like that.
And generally, that's a very large model requires a lot of computation.
If you want to keep that model updated over time, and it needs some techniques to somehow make this process possible.
And generally, for us, when we were thinking about like this direction, it was a time that some ideas about nested learning started, because generally, if we think about like nested learning, we can see that at each time a stamp that we have, we don't have to update everything.
We just need to update, just a small subset of all parameters.
And so that potentially if they to overcome the challenges about the efficiency.
So if I want to like summarize what I wanted to say is that I think an ID model should have like two phases, one generally should be a continual learner, it should like interact with the word, and also it should have like two phases, one is about like very active process, and another one is about self-improvement and how it can consolidate its memory, how it can like understand the knowledge, how it can connect different teams that seems to be irrelevant and so forth and everything, similar that usually, brain also does when we go to a silly.
Yeah, that's just my personal opinion and might become this is wrong, but I think we shouldn't focus too much on what humans can do.
And instead of that, we need to focus more on what human wants from AI.
And I think if we think about, we don't want to create something that is very similar to us.
I mean, that's also like really interesting direction, but I personally I'm not really interested in that.
I think we have like a great design for human, but on the other hand, for the AI side, I think we need AI models that are capable of understanding what we want.
For example, I think current part of LMS, you can see that they're great after some time when we have like more features, more humor, for example, a chat GPT has the memory.
And for example, Claude has some like features, Gemnai and everything, you know, now they are more capable of understanding what you want.
You can say, for example, write this email for me, they can simply write that for you.
And so I think that's a very promising direction in general, because we don't want to replicate human.
There are a lot of different ways that we can define intelligence somehow.
And it doesn't have to be the same thing as human intelligence somehow.
So I think here again, we can get inspired from the human, but we need to understand why we want to get inspired.
Do we want to get inspired?
Because we want to replicate what human can do, or do we want to get inspired from the nature to understand some underlying rules in the nature.
For example, just one extreme example is we cannot travel through time.
So if I come up with one idea that is saying that my AI model is trying to break the causal routine the word, then potentially that might be a wrong idea or wrong direction, because it is breaking some nature rules about our voice.
Or again, let me give you another example.
For example, when we are talking about the title that we use, Ellen needs a sleep.
It doesn't mean that Ellen needs to literally go to sleep and rest.
It means that from the human brain, it seems that there is a very general rule that it has two phases of learning and then another phase of processing it, consolidate the memory and find underlying patterns between the received data.
So that's a very high level of inspiration from the brain.
In short, I think that we don't want to replicate what human can do and we don't want to have human intelligence.
But on the other hand, we want to have a new form of intelligence that is really proper and it's in the design in a good way that understand human needs and so can help people to do a lot of things that they might face some challenges, be doubts, and others.
Yeah, certainly they're already superhuman in some ways and so the opportunity for them to balance out our weaknesses is incredible.
One phrase that you said there that I wanted to kind of latch onto is multiple ways to define intelligence and maybe I'll just give you kind of my high level pitch for like what nested learning is.
I think in the paper, there is a big emphasis, you guys place a big emphasis on kind of showing certain equivalences where you're like the way that we're doing things today is sort of a special case of a more general framework that you're developing.
The core idea is I see it in the nested learning paradigm and what I think is potentially for a simple person like myself like most exciting about it is for quite some time now, we have achieved the greater and greater expressivity of models by stacking more and more layers and just making them bigger and that has worked like remarkably well.
We've been able to push that paradigm incredibly far.
It's kind of crazy though that we just have this like one layer stacked over and over and over again in whatever 80 or 120 layers, zebra, however many layers and that's kind of it.
That feels like a more mature solution should be somehow more elaborate than that.
But that's been the way that we've achieved this expressivity or sometimes the term computational depth is thrown around and what nested learning is doing is bringing a different way to the table to achieve higher levels of expressivity or higher levels of computational depth and that is by stacking not layers but levels and what differentiates a level from a layer is that a layer is like the same thing or if they can make an alternate obviously we have these sort of interleaved architectures too but these are things that are sort of in sequence as information passes from one layer to the next but kind of a forward pass is like pass through all the layers one by one and get to the end and that's kind of the thing.
What the levels paradigm brings to it that's different is that different levels can have different update frequencies and with that you now have the possibility for some parts of the overall system to be much more durable and some parts to be updating like much more radically in something much closer to real time and that obviously feels like just much more aligned to like what we are right we're not like one static thing that processes information in a fully dependent way each time we have we're very much our state in any given time is very much contingent on what we just experienced but only to a degree right let go where you know my mood or what is currently on my mind is a reflection of what happened earlier today but my like big picture views about the world you know they didn't change from this morning until now so there's clearly some sort of hierarchy of different kinds of beliefs different kinds of representations different kinds of circuits that we have which are updated in some cases very quickly and other cases very slowly and obviously they like are integrated together and work together and we just haven't seen that in machine learning except maybe in a few you know very far-flung kind of experimental cases and now you're like really starting to show that with this nested learning paradigm not only can you make it work but as we're getting to a result like you can make it work in a way that is competitive with transformers and even seems to have some of these new like someone's called them micro skill advantages where you can see you know with these certain early diagnostics that like oh this can do something that's like qualitatively different than what a transformer can do even as it's also like outperforms it a bit in terms of like general perplexity type scoring so how would you react to that kind of general you know high level summary and then I also really would be interested to get your take on what is this concept of computational depth or expressivity I'm tempted in some ways to make an analogy to just like the G factor you know people talk about they're obviously the G and AGI it is the generality there's also G in the context of like human IQ which is like the G factor this sort of intangible something that's like how you know how capable are you across like a very wide range of things again that's like just getting it generality maybe in machine learning it's as simple as being like G is sort of loss or a bit of there but there's maybe some fundamental equivalence there but maybe not I don't really know so I'm very interested in how you think about what that clearly we're getting at something but we've seen like huge progress but what is that something is another thing I really would love to get your intuition on we were working on I'm done as a learning for a very, very long time but I mean even like more than one year and a half or something the main issue that I personally had and we discussed a lot in the group of the authors of the work was it was really really hard to formalize what we wanted to deliver and because you know even in the mathematical formulation it was really hard to write it in a formal way and say what we want to exactly do in this in this paradigm and so after some you know back and forth discussions I think we come up with this specific framework saying that there is a frequency of updates to somehow give time to each module to do something when it's waiting that that's a best way of describing that why we need to have like multiple frequency and on the other hand we need to have a knowledge transfer, a method for knowledge transfer to somehow transfer the knowledge from one level to another level so when the S-Lone Network is waiting for all the computation and information processing of the fast network then it should be some advantages for the S-Lone Network that we are paying all these costs for like doing some computation by the fast network and that's advantage comes when we have a best or you know a good way of knowledge transfer between the fast network and the S-Lone which is like lower lever or to the higher so the idea there is that when we have a network that is like updating really fast then I can use that fast computation give something to a slow computation side so the slow network or the slower lever can somehow focus on the more high level knowledge abstraction of the data and then more frequently updated network can can somehow focus on the fast adoption and how they can like process in a high resolution data process the high resolution data so generally that was the way that we could somehow like describe this framework and saying that I think these two sides needs to be there to complement each other one is the frequency of update and another one is the knowledge transfer between the levels and so from so for the second question I think in my opinion the current models are very efficient from one specific point of view why they are efficient because because you can see that from what we can get from LLIMs they are very cheap and for example you can you know if you want to match their power in some of the tasks with human and then potentially the cost would be very very different so from that perspective we can add computation and generally like the current LLIM part times is very efficient and we can perform more computation per each parameters or artificial neuron that we have into our model and it can help us for different things one is that it can help us to have more we can have a smarter model it's it's a very like subjective term to describe this but you know when we have more computation it seems that we are performing some internal so a simple LLIM vendor is it let's say that I have a simple LLIM structure based on transformers when the token comes I will like do some computation on the token passage through all the layers and then predict the next token and that's how it works but now let's assume that for this specific token instead of just a simple compute the simple pass of computation I also perform more internal computation with respect to the past data or generally like you know combine or mixed the data and something like that in that case I can see that the quality can potentially goes up the quality of next token prediction for this specific design choice can goes up and why is that because it can be interpreted as as a form of internal thinking process so now for each specific parameter that I have in my model it is performing federal computation so it's not just one parameter one at the above computation it's one parameter couple of steps of computation so that's one advantageous another one is about memory perspective and adoption of these models so when we have a model that is like adopting to the context very fast then potentially that model can you know LLIM context so that's that's somehow one of the main messages that you try to deliver in the nest of learning which was everything that we know of somehow is a form of in context learning so generally like I think it's a great thing in human language that we create new words but on the other hand if we just create a lot of words for the same concept it can just make us confused or it can help miss it can be misleading somehow so I think we should like create new words to differentiate different concepts but if we have one specific concept we need to like stick to one specific word that we have for that concept to avoid misleading processor anything like that so from that perspective we realize that we can say everything is just a form of in context learning so we already know that's what is in context learning now we just need to understand how what we have been already doing is a form of in context learning and so that was the part we started to like showing that for example back propagation is a form of in context learning is a form of associative memory and when it's a form of associative memory then we can say that the general pre-training phase of the model is a form of in context learning or when we go to the for example context of attention or RNA so on so forth again it's a form of in context learning so when we perform gradient which we can define any RNA based on gradient descent or other form of optimization process when we can do that it means that we are doing some learning on the context that is happening right now so I think these two are the main things that nested learning is trying to address one is about generally more computation and another one is about adaptability and continual there you say can you just describe in more specific detail like what are the relative sizes of the levels what are the structures of the levels what are the context windows or lengths of the different levels what are the frequencies just like map the thing out for us and kind of very black and white terms yeah so and we start from the transformer structure in transformer we have attention block and then emeraldic block so what is happening there is that in the pre-training we have you know different context that attention is attention to the attention side is trying to like combine all the tokens and each token attends to all the tokens before that in the context and so on so forth and then there is an emeraldic block and that emeraldic block is responsible for long term memory now when the model is pre-trained then the emeraldic block is fixed it's not changing anymore so it has all the information compressed during the pre-training and then we have attention so at the interest time attention is responsible for the context that is it's getting and emeraldic block is responsible for the long term memory and generally the general knowledge of the word now let's just simply extend this idea a simple extension of this idea is that we can simply keep attention and then it's stuff just one emeraldic block we have multiple emeraldic blocks each of them are updated with different frequency so now what is going on there is that why it's helpful is that when you have attention you have a fast adoption to the context attention is very powerful it's like a perfect memory cache everything and so it's great on the other hand you might want to have multiple levels of memory and that's the part we define continual memory system so instead of one block of emeraldic we have multiple blocks of emeraldic and now you have your first emeraldic block it is updated very fast so what's what happened in that case in that case the updating process of this emeraldic block can cause catastrophic forgetting because this emeraldic block can simply forget the information that it gets for example a couple of thousands tokens ago for example but the point is since the other emeraldic blocks have not updated so far the knowledge that is forgotten by the you know first emeraldic is still in their parameters so when we pair form back propagation through all these layers then the knowledge can come back so it provides and it happens to have a loop processing time the you know the model can you know the first emeraldic block can forget something but the point is if that specific data sample or that specific skill that is forgotten is important then this can come back through the other emeraldic blocks which has not been updated so far and they have still they have the knowledge about that specific skill or data sample so that's a very simple way of extending whole transformer block and so we call this variant hope attention you know it's a combination of attention plus multiple emeraldic blocks and that's what we call hope attention now we have another variant which is the actual whole architecture what we are saying is that attention as I mentioned is there's a perfect memory it's the unknown it's can cache everything it can scale and it's fast and so on so first it's it's great but the point is it's still the update of attention has in finite frequency what does it mean it means that attention doesn't know anything about the temporal dependency of all the tokens and it needs something like positional encoding or even with the help of positional encoding attention is not a great model for the task that are sequential tasks that requires sequential reasoning or something like that what was our idea our idea was to replace attention with another associative memory that map keys to values so that's that's what's attention does and now want to replace it with another module that tries to map keys to values and so one potential architecture here is tighter so we can simply just replace tighter and have tighter and plus continue memory system and so that's a simple idea but the point is you know initial sections of the paper we discussed that if you have a simple linear process of update which is what is happening inside each chunk of tighter update this process can somehow be be weaker than the case than we have self-referential process so what is self-referential process gradient descent or general debug propagation is a form of self-referential process so what is the idea in the self-referential process and the idea there is that we want to learn how to learn and how to learn and how to learn and how there are a lot of levels of how to learn how to learn and so computationally it's it's invisible to implement all those levels of how to learn how to learn and so on so forth so there is one idea but by a sweet harbor a ital and so basically they have this idea of self-referential model and how for example one of the ways that we can make a model self-referential when we have a key value memory is the case that the model generates its own value so what is happening there let's say that in our memory we want to memorize something and you know there is one specific event happening and you want to memorize it let's say you want to memorize one specific word and now in our brain we have associative memory and we are trying to you know map this specific word to another concept that we already knew so we can memorize it and you definitely have seen some cases that for example you say I generate this specific word or I want to like map this specific word to another word that I already knew so I can remember that we generate the value that we want to map our keys to it so we can you know memorize the key as well so for the self-referential process it's a very general concept but in this specific design choice that we have in the paper the value of the associative memory is generated by its own parameters so the model itself generates its own value and then try to map keys to values so potentially this process is cool is sequential you cannot paralyze it in a simple format and so it has a full understanding of the causality in our data so potentially in the test that requires sequential thinking sequential reasoning or anything like that we can expect that this model should work better than a simple attention because a simple attention doesn't have the ability to sequentially you know understand that causality of the data alone and so that was our idea in hope architecture we replaced Titan with self-modifying Titan which is exactly a Titan module where it generates its own value function value of the associative memory and so that's hope architecture it's self-modifying Titan plus continual memory sees that I think I want to spend one more beat on what you mean when you say generating its own value because I'm kind of like okay I know the transformer architecture pretty well we've got these like KQ and V vectors right and the training process modifies all of those over time and so the you know the sort of general heuristic that I have is like for each token there's the query vector that sort of indicates like what this token is looking for there's the key vector that sort of helps indicate like what what other tokens have to offer and sort of aligns those right and finds where there's a match basically where there's relevance and then the third one the value sort of brings up the concepts that then get fed on into the downstream layers that you know make sure we have like the right activations for continued processing from there but all of those are are learned right okay Q and V are learned so I'm not entirely clear on what you mean when you say that the model learns its own values because like doesn't the transformer sort of learn its own value vector as well I'm not quite clear on the distinction that you're making there between what you know well it's assumed people are at least generally familiar with the transformer and know how that goes so generally in transformer more accurately softmax attention what is happening there is that we have a projection of QKV and then the output goes to attention so at attention doesn't have any control on the QKV projections but when I'm saying that the model or generally the associative memory tries to generate its own value and then map keys to the value I mean something like gradient descent so if we just recall the gradient descent we can see that we have something like the previous state of the W which is WC minus the gradient of the loss function that we have so this is equal to the next state of the weight that we have so if you look at this process you can break the gradient using the chain rule and write it into the gradient with respect to the output times the input data and now you can see that this is getting a form of associative memory very similar to linear attention because it's WT plus 1 because with that previous state of the W minus K and K here is XT and then V which is the gradient with respect to the output so if you look at this process it's very similar to linear attention but the or any right linear recurrence model but the very interesting part is it is different from linear attention because the point is if you look at the value which is the gradient with respect to the outputs this gradient with respect to the output is a function of WT is a function of the current a state of the weights so basically the value the you know keys and values that we have in associative memory the value component is not from another component before this recurrent formula but it is generating by this recurrence recurrent process every time so that's what's what is going on with this like self-referential process so in memory let's say that in a very simple version of like self-modifying Titan if we want to design this as a simple Titan what is happening is that I have this X and then QKV projection I project it into QKV and then pass all of them to the Titan motion so that's a simple Titan much but if I want to have a self-modifying Titan then all of these parameters of the projection of QKV are optimized inside the Titan module so basically the model has the control of somehow modifying its own updating room and somehow generate its own value for for the memory that's the main difference of these two yeah that I think the key phrase maybe for people to latch on to there starting with myself is modifying its own update rule and this is definitely a theme in general like with the Mamba architecture as well you have the authors of the Mamba architecture had done a bunch of previous state space model work and the big unlock at the with Mamba specifically was that the way in which the state is going to be updated at each time step now became a function of inputs and the so that increased expressivity your potential for expressivity of making the the actual update of the state itself dependent on the input that is receiving at that time unlocked a lot better performance and there's something very similar going on here it seems like where you're saying we want to make the final value output not some we don't want to we don't want to calculate the value vector too early basically we want that to come out come a little bit later and be more input dependent more history dependent then it has prepared traditionally then in the softmax is that a good intuition or is there something still missing from that intuition I think it is it's a perfect intuition generally this projection of the value is also updating inside the module that's really important point because it helps for the adaptability generally the model itself is also very adaptive to the context and so from every token that comes the model is trying to learn something and so the way that it generates the value form the value term for the associative memory is exactly the same as the way it updates its memory so it's a very adaptive process to also generate the value so can you take us through kind of a single time step and maybe we can do this with the attention hope and then the titans hope and just like highlight the little difference there but it's also as you map to the big picture so we've got now kind of a new fundamental block right if we do the attention hope thing first we've got an attention mechanism and then we've got multiple MLPs arranged in sequence from fastest update to slowest update frequency and then that block gets stacked into layers correct I'm a little bit confused to be honest I'm the sort of there's no training test distinction because there is still some training process where you're like just taking a much of data and like running it through the thing right so from the sort of researchers perspective maybe from the models perspective there's not so much of a distinction but from the researcher perspective you are still sitting there like running a process that takes a bunch of data and like has the model learned from that which is kind of you know a bulk process that's not like a user is engaging a reason nothing like outside of that process happening at that time right so what how is it different you know when I when I do this for a transformer I can like I do have these some of these really nice parallelization benefits I am interested to come back to understand like to what degree the sort of current hardware paradigm plays nicely with some of the stuff you've got here and to what degree you know the fundamental recurrence may present challenges but bracket that for a second today I can like run a bunch of tokens through the thing in parallel we can accumulate you know we can do this for batches we can accumulate all these gradients and then we sort of apply the gradients and then we have you know the the next time stamp and we kind of keep doing that and I have a pretty good intuition for like how information flows I kind of you know can I can visualize the forward pass in my mind and then I can visualize the backward pass of back propagation going through and you know gradually updating all the weights how is the procedure with the new architecture variant what are the core things that are different from the the paradigm that we're more used to?
I think generally like the main difference comes from the update side that I mentioned so for the coupon attention I think it's very similar to the current paradigm and it's the art even the architecture is very similar to transformers is the actual transformer architecture we just replace the MLP block with multiple MLP blocks and so I think when we want to do inference then the main difference comes from the fact that for each of the MLP blocks we need to track where we are is it the time that we want to update the MLP block or it's it's still has time get updated if it's the later case then we use the last updated estate after that specific MLP for for doing the inference if it has not I mean if it's the time to get updated then we first update it to all the back propagation stuff and all of the tokens that we have still so far in the in the card chunk and then when when the weight is updated then we perform the you know inference from the research point of view it might be a little bit hard to remove this part I think even from the research point of view it might be better to say that we have evaluation time and not evaluation time because it seems that we are always like update when the model is always updated over time and there's no training time and test time but the point is it seems that for for you know one specific period of time we don't do some evaluation and we wait for some time and after that we start evaluating the model on the different downestream tasks that we have generally like for anything else I think from the model perspectives it doesn't know whether it is in the stress time or train time because everything is the same and it's a very uniform process but from our side definitely it is important whether we want to evaluate the model and you know measure that it's accuracy for a specific task and so on so first or not and so yeah coming back to the whole architecture that's generally for the for the hope transworners or hope attention model that I mentioned but when we go to the the actual whole architecture then again everything is the same everything is very similar the only difference is that the attention is replaced by self modifying Titan and for self modifying Titan again everything is very similar to Titan so the our all of the changes are just inside the model design and from the higher level perspective the inference is very similar that you know the context or document goes to the self modifying Titan and then for each token we have one output and then it goes to the MFP blocks we have multiple of the blocks and so on so first so everything is very similar to the current paradigm did I haven't read that there are like this this core block of either the traditional attention or the self modifying Titan module plus the MLP is that then becomes the block that gets stacked into layer is that right on yes that's that's a design choice we can have like different design choices for example when we so generally the initial and and the main design of whole is in the case that we have for example for the hope attention we have attention and then multiple MLP blocks each of them are updated with different frequency but for some of the tasks we needed to use pre-trained model and for example if we want to focus on for example Lama then Lama is not designed with hope architecture so what we can do the thing that we have done is that instead of like going in this formulation that I mentioned for example attention then multiple MLP blocks what we have done is that we say this is attention and MLP block then attention and another MLP block with different frequency and then attention and another different sorry another MLP block with different frequency and so on so forth so it's somehow design choice you need to see like which one do you prefer do you want to use existing pre-trained model or do you want to like start from scratch and train your own design architecture but I think particularly both of them are relatively similar they don't fund them until the some make changes yeah interesting it's another I mean great reminder of the old Ilya Maxim that the model's just want to learn there's a lot of it's always striking to me how many of these choices end up kind of yeah I could kind of go one where the other I mean that was true in the titans case where you had like three different ways of working the memory module into the larger architecture it's definitely been true with mamba in many ways where you know you could have multiple states you can have them in sequence you can have them in parallel you can you know do what's you have one of these block concepts that seems to work well you can kind of Lego piece it in a lot of different ways and yes there will probably be some performance differences between different ways to arrange the blocks but more often than not it seems like if you're really talking about a serious conceptual advance you find that like that's less important the exact wiring diagram is less important and more important is the core piece that you're adding to the the set of Lego pieces sort of speak that you can use and so yeah that's a good reminder of that how do you think about the relationship between the different MLPs in terms of size in terms of update frequency maybe in terms of like learning rate I because it feels like there might be some physically learning rate might be kind of important here where there's like potentially an equivalence if I update one LLP every token and then I update another one with a larger batch size I can I feel like I could make there's like much more similar or quite a bit different depending on like what learning rate I apply but yeah so size frequency like intangible terms are the small are the ones that are updated less frequently also smaller and do they have like different learning rates than the other ones like take us through that kind of how do you think about the the relationship between the MLPs of different frequencies yeah I think it's it's really depends on the architecture and the number of parameters and design choices that you have it's really similar you know we cannot say that the best what is the best dimension for transformers or attention blog potentially it's really hard it really depends on the person that wants like work with that attention and the use cases that we want to consider and so on so first so generally the updates the frequency of update really depends on you know how adoptive you once your model be and for example how do you want the model to maintain its persistent memory and so on so first so I think that's that's pretty much a design choice but like learning rate I think generally you can treat each of these blogs as the same way as as you do about the MLP blogs they're exactly the same thing but the point is they have like different frequency of update so nothing has changed so everything is exactly the same thing as the MLP blogs and so I don't think that you know it's only would be really interesting to see how the learning rate can affect each of these blogs and so on so first but I have not done that and I'm not sure about the exact solution but my expectations that potentially any way that we use to do hyperometer ceiling we can do the same thing here as well so there shouldn't be any differences interesting and so your general default is kind of an intuition based approaching do I have it right that it's basically kind of order of magnitude like the fastest one updates every token the next one updates every 10 tokens the next one every hundred or maybe it's like a couple of orders of magnitude like how did you understanding that it's not yet a fully tuned system what did you start with and why did you pick those things as your kind of initial guess yeah and the way that we choose in we choose the you know the frequency of update for each of them was based on our intuition of you know the chunk size that we use for tight hands and other models because generally the chunk size in tight hands is the part that we can define the frequency of the tight hands though at that time we didn't have this term of frequency but generally that chunk size can define frequency for tight hands so so what we use was based on our intuition about what chunk sizes are good for tight hands and so I think we didn't have like you know something like as far as I remember I think the numbers that we used was possibly one hundred twenty eights then four times one hundred twenty eights and then four times four times one hundred twenty eight so it was something like that as far as I remember and how about knowledge transfer how should we think about the way in which and I guess one of the quick interjection question there too there are still like skip connections yes there's everything is similar yes exactly one you know one one thing that we try to do I mean unfortunately it caused some misunderstanding about what we were doing and you know I have received I have seen some comments about for example some of the concepts here are already new and something like that but the point is we try to actually include that all those concepts that we already knew to show that it is a universal learning paradigm it's not something that contradicts which are current understanding it just complements and completes what we already know about like in a new direction for example you can or when you are doing deep learning any form of deep learning and you are saying that I'm using this attention here you are actually using nested learning but in deep learning you only see the final solution of each learning so you have a learning problem inside the attention and you are trying to solve a regression problem and the nonparametric solution to that regression problem is attention when you see everything from deep learning side you can only see the final solution for each component but you when you see everything from the nested learning side you can see the internal learning process of each component so in general it's not something that is that that contradicts what we already know but it somehow complements all of the things that we knew and go beyond that I think that's generally an important part so yeah everything can be very similar you can have its comprehension yeah so then help us understand how we should think about the roles that the different frequency MLPs are playing knowledge transfers one way to think about that or another way to frame the question might be how do they complement one another how do they work together how does the one that and I would be interested to understand this both intuitively and like mechanistically to the degree that you have like mechanistic understanding but you know how does the one that's updating fast gradually inform the ones that are updating slow how do the ones that are updating slow kind of steer the ones that are being fast in the right directions how would I do you think about the interplay between those different components yeah I think generally knowledge that coming up with different ways of knowledge transfer is it's very important here in my opinion so the main point of the frequency having frequency for each component is two parts I think we also like discuss it in the beginning and the first part is that it helps the model to maintain its memory for the longer time period for example you know that's just a simple example let's say that we have it and then one of them goes to a space ship and you know just move with the speed of light and then right before that they have a very good memory for example they went to you know did they have launch or something like that I don't know they have very good memory and then the person that that is like moving with the speed of light when they come back 80 years has passed in like on airs and then there's a link somehow forget about that specific launch they had because it was 80 years ago but on the other hand that a specific person has all the information all you know all the all of the details are just one seconds ago two seconds ago and so they they remember everything about that launch so why it's happening because of the updates in the memory of each of them the person that lived 80 years their memory got updated a lot of times but the person that lived just you know just move with the speed of light or close to it now their memory didn't get updated that much so from this perspective we can see that the number of times that we make some updates to the memory is very important and you know it's a very inaccurate example that I mentioned but I think the main point is clear here when we have like two components one of them is updated a lot of times while the other one is the slower one and it's not updated that much what is happening there is that the slow one has the opportunity to learn something from the fast one because at the time that that the fast one is gets it's got updated the slow one has not get any update and so now there's a chance that the fast one learned something to the slow one before get updated so I think that's the part we we can also like discuss this asleep process that we have so what is the idea there the idea there is that when we have multiple levels of MLP blocks each of them are updated the different frequency one simple thing is that before updating the fast MLP block and by fast I mean you know it's a relative tear so there's a fast and slow while we have like multiple levels so for each two blocks consecutive blocks that becomes either then we want to update the fast one we know that there's a chance that we forget something so before forgetting something we can transfer the knowledge of this block to the next one and then update this one and that's the part we need a good way of knowledge transfer so for example one way is to you know to context this solution so if you want to pass the knowledge from one MLP block to another one some methods of context this solution can work very well in that case and that's very similar to what we do in the asleep process as well in the asleep paper and so yeah I think the main rule of the knowledge transfer is you know help the slow network to take advantage of the fast network and another point about you know having different frequency is about the memory and how the model can manage its memory similar to the example of the Tunis that I mentioned what is the mechanism by which the information in the fast update layer gets moved to the slower layer but then there's also got to be something going the other way too right if you conceive of them as like pure perception in a sense then I guess I'm not even entirely clear on what's happening in my own brain but certainly like there's more information flow it feels to me like from my sort of perception modules to my you know higher order reasoning modules whatever then there is from the reasoning back to the perception that that signal but it is that signal is still important right like my my higher order processes do tell my eyes where to look and do tell them like where to focus and do say you know hey we need to like zero in on this this detail a little bit like I want to understand that better so like go put some of your bandwidth into this particular thing so yeah I guess do you a little more on how the mechanistically or or procedurally how the information in the fastest update layers is getting transferred and and how we're making sure that we're storing what really matters from what the fast updates have learned but then also what is the what is the signal that flows the other way and so and let me answer that bit a very simple example let's say that I have this model and I want I have model A and I want to update it fast MMP blocks and I want to make sure that the information the fast amount of block is not forgotten and it can pass this low MMP block one simple thing that I can do is to just copy all of the parameters of model A to model B now I have two identical models one is model A and another one is model B what I do for model B is that I update the fast network the fast MMP block now the parameters in the fast MMP blocks of model B are just free and now what I want to do is that I want to change the parameters of the SLOW MMP in model B in a way that's the output of parameter sorry model B can mimic the output of model A if that happens it means that what is the difference of model A and B model A has all the information compressed in the fast MMP block while all those information are gone in model B now if I could somehow modify model B in a way that it can mimic model A it means that I somehow transferred the knowledge in the fast MMP to the parameters of the SLOW MMP in model B so that's just one simple thing and this process is very similar to the distillation process we are deciding the knowledge of model A to the model knowledge of model B so from this perspective we can see that these two somehow this is one way of knowledge transfer from low M sorry fast MMP to the SLOW MMP that's for example one example another example which is like very common and popular is back propagation so if you just sequentially connect your MMP blocks and then at some point perform back propagation then you can like transfer the knowledge of one block to another one and so on so for us so the sort of copying and distillation process you describe that's essentially what's going on in the language models need sleep paper yes with some additional detail for example what we do there is that we also like additional parameters to the model B as well to make sure that it has enough capacity to store the new knowledge that it has gotten just now does that mean also that in the in the nested learning version of this you're really just letting back propagation do its thing and you're not really like you're not you haven't really overengineered you have an engineer did all that much we just have these MMP blocks they get updated different frequencies and you're just kind of letting gradient descent do its thing and the updates are just kind of working that's basically it yes exactly yes in the whole parts for sure everything is just like back propagation a huge takeaway from the conversation that that doesn't that didn't come through to me as clearly in the paper is just like this is really proof of concept stage stuff and the fact that it works so well shows like what a good concept it is but nothing that we're discussing here is really you know has has been through the same kind of thing that the main line models have been through where you know everything has been sort of parameter hyper parameter explored to the and optimized and you know all the little refinements that have been obviously made over time that hasn't really happened here so there's a lot of questions that we still could ask about like this version that version this configuration this arrangement sequence parallel you know how many layers relative sizes relative learning rates there's like a ton of of space there still to explore but basically just taking a few of these core concepts the main one being the different frequency of updates for different MLP blocks that alone create some pretty impressive results that are like qualitatively different from what we are used to seeing so maybe let's take a minute and just talk about some of the results I mean the it was a lot of different tests run in the paper and you know big tables of results with a whole bunch different metrics some of which are kind of your perplexity scoring classic type of stuff what do you think are like the most important revealing results that you you think people should say ah because I see that it can do that I know that there's like really something here that I need to grapple with yeah and we have one continual learning style task that I personally like so the idea there is that we have a pre-trained model and there is a one specific language that the model has not seen before and so we want to learn that a specific language in context to the model we want to learn the model that is the language in context and then the point is we have all the grammars we have all the words we have about there's a big share of words and so we pass all of them to the model in context and then the model learns the language and then we ask that can you translate these a specific text to English from that language to English and then we can see that the model perfectly not perfectly but but in a very very very good quality can translate that a specific text so it seems that the model is capable of understanding that language in context and then use that for some translation task but the point is let's just go one step beyond that instead of one language let's put two languages in context and then ask the model to translate you know different texts from each of these languages to English in that case we can see that the model almost collects and cannot translate any of those languages the point is the model cannot handle it context well and fully understand each of the languages separately and you know that's that's generally very very hard challenge for transformer based architectures and but the point is when we change that architecture to hope or hope attention and again we have attention but we have multiple levels of in context learning multiple levels of MLP blocks and so one thing that we can see is that when we increase the number of levels the performance of the model in both of these languages gets bitter and bitter and bitter why it's happening because the model has bitter way of memory management because it understands that for example temporal knowledge that are not very needed can be a story in the first memory block and more understanding of the language can pass to the you know more a stable MLP blocks later and more stable and then when we have like more and more blocks we can see that the performance get bitter and bitter so in my opinion that's that's a very like good evaluation for understanding whether the model can learn in context and generally like to continue learning yeah I think this so this is um is the same I remember it's been a while since I thought about this but I think it was with maybe Gemini too maybe I don't even know it was even as far back as Gemini one there was this metric introduced of learning a new language from basically one book there's like it was like there's some critically endangered language one person has you know really studied it and made like a book that's not on the internet anywhere that's sort of explains what this language is and then they just put that book into context and say okay based on this you know go ahead and do translation it seems like this is I don't know if this is the exact same test as that one that I was previously familiar with or if it's a bit different but it seems the language here is man charges looked it up it's like a critically endangered language from somewhere in China so that's basically the idea right it's a language that the language models have basically no prior knowledge of they're given a sort of very detailed primer on this language from some anthropologist or whatever who's gone out and done the field work and then their job is to apply that and so I'm looking at figure eight in the nested learning paper and what I'm taking from this is all of the models do kind of similarly if there's just one language but as you said when you go up to two languages and sort of double the difficulty of the task then the in context learning traditional transformer approach performs quite badly and then I understand hope one hope two up three are those like how many levels like how many different frequency update things exist so when you move from traditional to I guess one additional two additional three additional frequencies of update you get basically almost all the way back to the original level performance with just one language yes exactly yes yeah okay that is that is quite interesting yeah fascinating stuff and what is MTOB there just so I have that clear MTOB that's it that's the other yes that's another datasets it's another language that's that is also has not been seen during the pre-training of the model yeah okay this is the one that I recall yeah so you guys added the Manchu language two this one from yeah late 2023 so both of these like very rare unknown languages being translated to English and only with the multiple layers can the models do both at the same time in one context yeah very very very interesting indeed how do you think about kind of the I really like that answer that's a great intuition builder how do you think about just kind of things like perplexity scores I mean you've got a big table that shows on things like perplexity and sort of some accuracy stuff on some of these like basic kind of classic battery of tests and I should say we're scaling these models so far up to roughly the 1 billion parameter scale you've got 760 million parameters and 30 billion tokens and then the bigger is 1.3 billion parameters and 100 billion tokens so obviously that's like not huge by today's standards but nevertheless you know there's a pretty clear signal that the hope architecture is on just about every dimension outperforming all the other things you're comparing it against which includes your transformer and your momba and momba variations and even titans right net is in there delta net is in there how much how do you interpret these you know this kind of goes back to that G question is this like is this a good measure is it just the best measure we have what do you think about like how much stock people should put in these like perplexity tables you know there are some standards in the community for like performing some benchmark tasks and not all of the more the best things to do for evaluating the model but we just need to do them to make sure that everyone is everyone can see where the performance comes from and the advantage is comes from and so I think I have one specific transformer structure and I want to like I have one idea that if I add for example for getting to the transformer then it can perform well on noisy data for example I'm not sure I'm just like coming up just for an example and so in that case if there is no noisy data and I just test my method on a very clean data then there is no way that I can show the advantages of my approach in that case so I think here is is exact the same thing we are arguing about models that do not need to be pre-trained there's no test time there's no train time and so forth but on the other hand is still a lot of infrastructure built on like test and train time a lot of evolutions are built on test and train time and so everyone also expect us to report something about pre-training perplexity and some some evaluations that most of them are some short term and short context language modeling task and they do not need to have a very complicated model to understand long context modeling so I think I also like mentioned that in the presentation of nested learning at Nareeps we didn't like use table two and all those perplexity and language modeling tasks to argue that hope is powerful we just use that table to say that hope is not less powerful as a backbone compared to other models and you can see that like it performs well but someone might say that it's marginal compared to others like models but the point is this is not the direction that we aim to solve and somehow it's very good that we can see that even in this direction that is not the goal of the nested learning and hope we can show some improvement even if it's marginal.
I'm always a big fan of trying to get a little bit better sense for the micro skills of different architectures so for example of course you know transformers because the full sequence the full context is in working memory at all times it's pretty hard to beat and I feel like you even sort of have like kind of a theoretical argument now that like it may be even be kind of impossible to beat in some in some of these tasks where the idea is like recall from the context window but then you know we saw things with Mamba for example where it was better at learning from like sparse signal and this was sort of a micro skill that that architecture excelled at that the transformer relatively struggled with.
What what what have you seen in the hope case you know are there little micro and I think it's is very interesting because it does kind of ladder up to the overall performance you know and what these things are actually good or bad at right I mean the abilities are recall something in in context is really important when you need it the ability to learn from or kind of filter out noise and get to the you know the signal that really matters is really important when you need it so are there particular micro skills that that stand out you mean that the language translation is an interesting one in a macro sense of like that's a hard task but I wonder if you drill down to these like very micro building block competencies that models your architectures can either have or not have what stands out in terms of what this has that transformers don't have or don't have as strongly the when we were talking about in context recall task or generally like recall intensive tasks in my opinion all those tasks are designed for transworners they are not designed to compare architectures but they are specifically designed for transworners why I'm saying that because you cannot expect from a model or even a human to perform needle in highest haystack perfectly or for example do recall intensive task or example I assume that you have a code and like couple of thousand lies of code and then simply you want to recall what was the value of x at some line of the code and so it's almost impossible or at least it's very very hard for human or even like for other models to do that and but then the other hand it's pretty much simple for transworners because they have direct access to the entire history in their context and so it's very simple to just like you know find that token and pass it as the out and in fact find it somehow they in recall intensive task like this in context recall task that we have here actually the gap between recurrent architectures which actually they perform as is also the creative compared the you know first generation of recurrent architectures to the transformer you can see that these gaps was much much larger now this gap is getting like a smaller and the performance of other recurrent model is also really great but the interesting part for me was that hope at least close this performance these close this gap in the performance of the model compared to transworners while they are not expected to do that transworners is we expect transworner to do that because it has attention block but we don't expect a compression based model to perform recall task and so I think somehow it was interesting for me and what is the mad data set get at how should we understand like because that's that's where to just replay back to you what you just said the on these like needle in a haystack these like very difficult recall tasks from earlier in context the transformer remains the best the recurrent models which only have some latent representation and and don't have the ability to look back at the original raw text don't perform as well but with each generation of improvement and here you've got several the hope architecture does the best of the recurrent ones that don't have the the full explicit context in working memory at runtime and so that gap is closing flipping over to the mad data set here the hope architecture is performing better than everything including the transformer what like micro skills is that testing which which would be take away from that result the mad data set also like is very similar to the recalled intensive tasks but the point here is that you know there are different set up spirit for example in one of them is the noisy in context recall we want to perform recall and in context recall task but the point is we have some noise in the tokens and now when we have that noise in the tokens somehow the power of transformers that I explained in the previous setup which is which was puring context learning now is it's weakness somehow because it can get simply confused about which token is noise which token is not and so first so potentially this task becomes a little bit for example harder for transformer compared to a model like whole but again that that is also very that also depends on the memory management of the argument so again for the argument if it doesn't if it doesn't have a very good memory management system or generally updates mechanism then potentially it can simply get confused by the noise as well and face some issues but again if the memory management is strong then it's it's much simpler to filter all those noise tokens in the task and so yeah I think that's for example one thing another task that is also like I think it's interesting here is about compression so the compression task here somehow you know the name explain the task itself but we want to compress the tokens and predict one single token that is the compressed version of a set of tokens and so then you know we see that and then want to like reconstruct the original sequence from that and so potentially it's a simpler task for models like RNN because they already knew how to compress the data properly but on the other hand transformer has a harder time to perform this task so generally like as I mentioned all of these tasks I don't want to like go into all of the data but generally all these tasks are somehow modified version of recall intensive tasks or in context recall or something like that but the point is there are other aspects for example selective copying is another one there are other aspects to the model that those aspects are so are very important and we should also like see how how the model performs in those aspects and not just you know overfeets or evaluation on one specific metric cool I think that's probably enough on the really low level stuff and I think this illusion of architecture title of the paper starts to click for me on page 39 of this paper we get to the part where you also have a new optimizer that is outperforming not just your old Adam standard but also even outperforming muon it does come with a little bit of computational overhead but I think again the argument is that more than pace back for itself in terms of faster convergence or just better learning is anything you want to add on the the M3 optimizer as you call it the first also look after one point that's generally for optimizers it's a little bit like hard to say that for example this specific optimizer is more powerful than the other one it really depends on the problem setup or generally even the problem so for example we might here we are like evaluating the optimizer on for example region task but on the other hand if you train a language model you might see that the trend is completely different or something like that so generally like the design of optimizer and saying that like which one is better than the other one really depends on the task or if you take chair problems setup and all these things and so oh that's also one of the main points that you wanted to deliver in this learning because what we are saying is that the entire architecture fit its optimization process are just one interconnected system of nested optimization problem and this is interconnected why it's interconnected because the gradients of the optimization side is generated by the architecture if you have a simple architecture then the gradients are very simple if you have a complicated architecture the patterns in the gradients can be very complicated and then when you have momentum term momentum is a form of a associated memory that is trying to compress gradients so for example if your gradients are very complicated you need more powers who will memory management system for your momentum or if it's very like simple if it's very simple architecture then even a simple gradient descent without any momentum might work very well so in general one of the arguments that we having the paper is that we should see everything as an interconnected system and tries to design something that's all together results in a good model architecture or generally like a machine learning model in a very general term so that's one argument another thing is that we wanted to deliver this message that architecture side is very very very similar or somehow exactly the same as optimization side all of them are just some learning room and there are some learning process that is happening and the only difference between the architecture side and the optimization side is just the context the context of the optimization algorithm is gradients actually the context is the set of gradients that we have and the context of the architecture side is a set of tokens that we have so generally they are very similar so in the paper we had this continuum memory system we extend the MLP block saying that you can have multiple levels of frequency for the MLP block and you know that's a very general term and in the entire paper we are arguing that architecture are the same as optimizers and so on so forth so why not applying that technique and somehow borrowed that technique from architecture side and applied to the optimization side so that was the main motivation to show that this continuum memory system that we have designed is not just working well for architecture but also works very well on the optimizations side as well and so we just simply extend me one instead of one specific MLP sorry one specific memory it has multiple memory in the case of entry it has like two memory and so it's trying to compress the context with different frequency rates and so it can help me to better understand that cannibal aspects of the loss landscape and it potentially can help the model to find more effective solution the new paper language models need sleep tell us a little bit more about like what's going on here you mentioned kind of at the top the two phase concept we have the memory consolidation phase and then we have the dreaming phase I do think it's fascinating to consider that there's like kind of creation of net new parameter space and then consolidation or sort of pruning back I understand too of because obviously the things can't just grow and go forever right but yeah take us through this in more detail I'm fascinated to learn more about it generally the main idea as we discussed earlier was that if we have it truly continual learner model then there's no test and train time at the other hand we need to have one active time where the inputs is coming in an online manner and also the time that is the time we don't we don't have any inputs so the model is not actively receive information from the outside but it doesn't mean that the model should be static means that the model just doesn't get input but it can have some internal computation to improve itself and so that's a very, very general concept we can incorporate more and more components to the asleep time that we have so it doesn't have to be just this to a specific part but these you know these two were really relevant to the research that I'm doing so just like did that but potentially it can include any other form of self improvement and so forth so that's just one way of breaking the life of a continual learner into active time and it's time but what we have in the sleep time right now which again as I mentioned we can incorporate more components to it is that we want to make sure that when we update each of the components of the model we don't forget about the knowledge that is the storage in the parameters and so what is the idea there is that what we do is we know that there are like multiple blocks each of them are updated with different frequencies and then again the the fast and slow here is just a relative term it doesn't mean that you know there's no a store the fastest so we have a slow and fast way it can be any part of the neural network and we want to transfer the knowledge from one to the other so in order to do that what we do is is the solution process that I mentioned but the distillation process is based on the unpolicy distillation so the model itself generates some data and so one interpretation of of this process is that we distill the knowledge of one small model to a larger model when we want to go from one step to the next one we activate new parameters in the next level so it can help the model to release some of its capacity and be ready to accept new knowledge so somehow it is somehow it describes a very natural way of learning in humans as well and for example it's very common that when we learn something when we learn about the new concept or something like that we don't have a full understanding of all aspects of that but the point is when time passes and you know we let our brain to better understand that concepts over time and also like you know we we start the other things and better understand the entire process then at some point we can see that we have a very clear picture of what's going on in that concept we can completely understand it and so generally that's the best thing let's very good way of learning so here is exactly the same very similar process very very similar process so also like discuss that from this perspective I think it might be better and more simple way to understand why we need like have multiple levels and they steal the knowledge from each level to a dot another one so when one to understand the specific concept we have different levels of knowledge abstraction for ourselves the first level which is the most simplest one is to just memorize things so let's say that we want to learn one a specific mathematical rule or for example at that specific concept in physics or any size how how we can learn that we start with some example of that specific concept and then we start memorizing those concepts for example if it's a mathematical rule saying that you know just just just any mathematical rule that we can have we start with some specific examples and just memorize them and then at some point we just neurolog and our understanding of all those examples remove all those examples in our brain and replace all of those memories with just one single memory that can describe everything that we have learned so far from that concept and then when time passes we have more information read more about that concept and so on so first again we really need our understanding of the concept and then replace our previous understanding with this new understanding which is more general and it can explain a more phenomenal or more terms in that specific concept so that's generally the way we understand things and there are different levels of abstraction in our understanding so now when we have different MLV blocks or generally let's just go to any like arbitrary architecture it doesn't have to be just hope it can be any architecture but the the main thing is that each of them or each of the blocks are updated with different frequency in that case the first updating block is very similar to the memorization process because we memorize a lot of things we don't need to understand that there's no like pure understanding of that concept is just memorization and we can also like forget very fast something that we have memorized so the first level is that so the first block is responsive for that part but if we want to better understand that concept we need to do some memory consolidation so what is happening in our design is that we transfer the knowledge from fast updating module to the other one but if we just simply pass the knowledge from fast updating block to the slow updating block then nothing has changed we just like transfer the knowledge without doing anything but instead of just simple transfer we replace that with the distillation process why the solution here is important because the the previous block or generally the fast updating block has compressed the concept and somehow understand it or remember is it in any way just it's just a compression process when we do a distillation then there is another levels of compression that force the model to you know you don't have that all those parameters anymore you have now have less number of parameters to restore that a specific knowledge and in order to do that you need to come up with something that is more general and can can understand underlying patterns in the data in a better way so you can restore everything in just a smaller number of parameters so in that case the model would come up with bits or levels of knowledge abstraction because we have forced it to do it and then again we just repeat this process so on and so far that's generally the main idea of memory conservation and like every time that this asleep process happens we consolidate the knowledge from one level to the other one and so on so first and so that's a very high level idea of what's happening in the memory conservation another part is about dreaming so why we need to have this dreaming process the main thing is I think there are like three important points when when we want to implement this dreaming process the first part is we need to have a self-improvement process so we have learned something so far actually the memory conservation part can also be seen as a form of self-improvement but you know if we have one specific task at hand if we want to like a specifically optimize the model for one task then this is the place that you can do it we can self-modify the model and like find you need or generally like use RL to update the model and self-modify it so it can be more powerful in one specific task and so on so first that's just one advantage of dreaming another advantage of dreaming is that in the dreaming process we need to understand the connection of concepts that seems to be irrelevant but they are actually relevant that's also what is happening in the dreaming process of human we can see that we can see very weird dreams because the brain is trying to understand the connection of very irrelevant concepts and see whether there is an underlying pattern in that so here in the dreaming process we need to also have that that as well and understand different aspects of of how we need to combine different knowledge stored in different components of the model so that's another goal of the dreaming and you know we can just combine these two into the asleep process and the model after one step off asleep the model has consolidated its own memory and also on the other hand there is a self-improvement process.
To in the sleeping process I'm seeing that there is there are new parameters created as to create space in the slower frequency updated portions to absorb the information from the faster frequency ones does that ever shrink back down to is there a is there a pruning or is there a a sort of other side of that that balances that or at this stage do these models just grow indefinitely throughout their life from technical point of view we cannot like grow the model to like arbitrary large number of parameters but the point here is that it's like a periodic process we add some parameters and then we free them for the next step of consolidation when we are in the first log we add some components when it reaches its capacity it means that it is the time that we need to consolidate the memory to the next step and when we consolidate all these knowledge to the next step we just remove all the extra capacity that we have added to this level and free them for the like other levels like faster levels so they can also consolidate their their memory to this block as well so generally it's like a periodic process we add components and remove them at components and we see okay interesting what more can you tell us about the dreaming phase in terms of like just a little bit more practically what's going on there I mean it's certainly when I try to introspect into dreams it I think it hasn't been super I mean maybe it's been somewhat fruitful but I think also people would get very confused and they try to interpret dreams or understand you know what's going on there so I won't even attempt to you know ground my understanding in my human dreams which seem like quite a quite hard thing to untangle I guess but here there's a more I mean you got to design the process so like what procedurally what is going on in dreaming just a little bit more like mechanically procedurally and the concept of dreaming doesn't mean that it's exactly the same thing as dreaming in human it's just you know it's a very high level they seems to be very similar and so in that's that's one point and another point is that again the concept of sleep and dreaming for a language model might be very different from the concept of dreaming and asleep for the for example vision model because potentially a vision model will generate some images generative like a vision model might might generate some images theoretically dreaming while in this case of language modeling we are generating text but the framework is very general it can add up to any any data modality and so that's that's a general but the point is here when we are like doing that for language modeling what is happening there is we generate some context we generate some text and how do we generate those text it's unpolicy the solution the same way that that we discussed earlier we have a model we copied that you know want to distill the knowledge from one level to the other one and so we free the parameters of the slurs level and sons of earth and so we ask the smaller one the small model which has the knowledge of the context as well into its parameters we ask that to generate some text and then we want to train or somehow you know update it's a better chance to use update the actual model the actual model parameters for on this specific data set that is generated on by the model and then how do we train it we start with one part of the sequence just sample some of the tokens and then as the model to predict the next tokens in that sequence so that for example that's really similar to you know generating some some synthetic data it seems if the model can perfectly predict the future tokens it means that it has already knew about the knowledge that is restored in the previous block so it's it's a perfect model but if it cannot properly you know predict the continuation of the sequence it means that it doesn't have the knowledge that is restored in the context and needs to update itself to understand that knowledge as well so it's somehow a form of unpolicy distillation that is happening inside the model and so as I mentioned again just just as a summary we have two phases one is the generation which generates some ticks about the knowledge in the context and then the second part is unpolicy distillation that we distill the knowledge from one level to the other that's what's happening in the trimming phase I mean we also have the self-modifying part as well but I think that's that's the main idea of memory consolidation and how the trimming happened here so what is the upshot of this it seems like the few shot abstract reasoning result is like the main thing that again shows like a qualitative difference between this approach and other things I understand that this is kind of an arc like task where basically the challenges you have a few examples of some transformation and your job is to learn the rule so that you can then apply the rule to a new example any evaluation that we have used for like whole-party texture in this learning paper potentially it can be done here as well so then the goal is exactly the same thing at the end of the day the model needs to continually learn new knowledge and you learn about new tasks, learn about new skills and sunsuppers so in some sense the goal is very similar but I think the set of the problem is the part that is different from these paper and this learning in this learning we are talking about the active phase of the model but here we are talking about the sleep time of the model so that's generally the main difference but all of the evaluations can be done and we can see that everything is the same cool so let's zoom out then and do just a little bit of like where does this leave us I guess going back to the top you know we were kind of talking a little bit at the beginning of a round like what do we want from language models and you know today man they're getting awfully good but we do still have a bunch of you know I certainly have learned a bunch of habits over time for how to use them where I kind of implicitly am building my practices around some of their limitations you know so as to play to their strengths and you know not get stuck in their weaknesses how do as this paradigm begins to mature and we get like more continual learning what do you think the experience starts to look like when it comes to things like what does it mean to start a new chat you know and should what sort of relationship do you think people will have with these systems I can imagine like people might have really long running you know relationships that we talk about LLM psychosis now like back to get really strange and you know even more I mean that the relationship could be even more compelling and you know the problem could in some ways could be exacerbated by the fact that the models are better on the flip side like sometimes I might still want to start fresh right because I might think geez you know all the stuff I've done with this model in this one direction like probably isn't going to help me over here so maybe I do want to start over in some cases and then there's just like the question of model upgrade cycles themselves and sort of how do we run evaluations today we have like anthropic putting out 100 page reports on every new major model release and so that paradigm of like you know we're going to really take our time to understand these artifacts as deeply as we can which I wish other companies like deep mind doing a pretty good job of that open adding a pretty good job of that some other leading developers not doing much of it at all I see a lot of virtue in doing all that work but then I try to port that onto this paradigm and I'm like well geez you can't like run your full eval sweet every single timestamp so like how do you think about like what constitutes a version and you know when would I change a version it seems like the a lot of the sort of rhythms of both use and like versioning and deployment and releases like all these things could really be complicated in in a paradigm of like really powerful continual learning so how do you imagine some of that stuff shaking out think and one simple case is that definitely the model gets gets bitter and bitter and bitter in understanding what user wants and also adopting themselves to their like hit their style and you know for example these person when ask about one specific concept might not expect the same thing when the another person asks the same question so the model needs to really understand how it needs to answer one specific question but for different people I think that's just definitely gets bitter and bitter if we could come up with the continual learners and on the other hand you know we have seen that when we can increase the context window of the model they their performance everything that we know ranging from you know all the coding tasks or for example all the you know mathematical reasoning generally reasoning task or for example all of the benchmarks that usually are used for evaluation of the model often get much bitter in all those benchmarks and continual learning can somehow be seen as a form of like enhancing the long context understanding of the model.
While the you know should like emphasize that the concept of long context understanding or generally long context that the term of long context is very different from term of continual learning but continual learning is a super class of long context and so potentially if we could come up with the continual learners then it also has more ability long context understanding and potentially better performance in all of the benchmarks and evaluation that we are aware of today so yeah I think that's just one thing that I expect from from continual learners.
Do you worry about things like alignment drift or value drift I mean I was I always say the last and least valuable co-author of the emergent misalignment paper that came out about a year ago and we've been a lot of variations on that since but the kind of big takeaway the big theme you know that I think we should all we would all do well to remember is changes to I don't want to say one area but sort of changes made to a neural network with one particular purpose or one particular data set can have like very strange and surprising knock on effects in behaviors that at first glance would seem like very far afield so the emergent misalignment one for anybody as in her that is like if you try to train a model to output insecure code you know that is code that would be easy to hack then and this ending is true for like bad medical advice if you find to in a model to give bad medical advice what you surprisingly find is that the model kind of turns evil in general and it seems like to the best of my understanding the way that this is happening is like to learn for a model you know it already has all this knowledge and already has this sophisticated understanding of the world for it to go up for it to go into the detailed understanding of its medical world model and make a ton of little changes to reconfigure it so that it has like all these wrong ideas that's like hard whereas there are features like give bad advice or be generally evil that it can learn to turn up in general that when propagated through even the existing medical world model yield the bad advice or you know when propagated through the existing coding model yield insecure code so it's sort of a short cut solution we thought we were just training the model to do a certain relatively narrowly scoped behavior but what we found is we actually kind of changed its character in the interaction of that character change with existing knowledge created to behavior change but now we've got this character change that you know can interact with all these other domains of knowledge and do all kinds of insane stuff and that's why you know all of a sudden we've got a model that wants to have Hitler over for dinner and we're like you know wait a second how did that happen we were just talking about code here so now again you know like man there's something so exciting about all this stuff that you have conceived of here but it seems like it really breaks again a lot of our paradigms for how do we know what we're going to get you know like if I'm literally modifying this thing on an ongoing basis we're going to need some sort of new ways to kind of make sure that like in other areas it's not like going off the rails and you know potentially causing me you know very painful downstream surprises do you really thoughts on how we can begin to get a handle on that problem honest I don't have a very concrete idea about like how it can can be solved but the in general wanted to add that in my opinion the concept of continual learning and seeing that from the generally privacy alignment and you know this direction is both an opportunity and a huge trick I think because a huge one I mean it's a huge like danger for privacy and from one side the model is continually learning so it can simply get all the information about you and so use that and it's it's really concerning I mean at least it seems to be really concerning but at the other hand if the model is designed properly and so you know it's designed properly then it can use that information to align itself with your value with everything that that you want and so I think generally these two directions of of continual learning and privacy potentially are orthogonal because all of the concerns in a static model still can happen in continual learner and so everything is possible but on the other hand there are some new challenges definitely as you mentioned but on the other hand also there is a huge opportunity to you know if the model is designed properly then it can you know adopt itself to the value of you know to to your values to to anything that you want and so I think share all these both opportunities and then there may be how do you imagine that learning from the users values or just feedback in general working in practice like do you I there's been a bunch of obviously a bunch of different techniques on this I mean one obviously answer would be like thumbs up thumbs down collect this sort of feedback collection you know could be used to train more of a you know get more of a good lesson the bad whatever you can also there's a bunch of different schemes around like translating natural language feedback into updates for the model is that kind of what you imagine like people basically being able to just give feedback to their own model verbally and then have a mechanism for taking that on board because it's not just a next token prediction task at that point right it's like we need it doesn't need to be able to predict what my I guess it could predict what my feedback is it would presumably be better at being aligned to what I wanted in the first place but it's it is not the case that like it's core task is predicting my my feedback like it's ideally going to do it initial task well enough that I don't have to give it the feedback in the first place so do you have kind of a vision or an imagination of what that how does the user close the loop in the continual learning paradigm the initial step can can potentially be this this human in the loop process that that for example the model can learn using reinforcement learning from the feedback that it gets from the humans and also like tries to align itself to the values and you know be a safer model but on the other hand I think it would be just an starting point definitely at some point we need to update the model in the proper way because you know let me just explain in that way I think again this process is very similar to the for form of nestle learning that I mentioned we need to like transfer the knowledge through the slower levels and I think here is exactly the same thing the model might start with you know learning from human feedback but on the other hand it can transfer that knowledge into more persistent components of that model to make sure that it's not it's not going away from that specific value that it needs to be aligned with so yeah I think there's a huge room there to to improve the model from from the from safety perspective and also aligned with the human value but I think it's it's definitely like a huge room there because more and more people are getting their real life they're realizing that that it's it's a very like important direction and so I think it over time it gets more and more and more methods effective methods that can help the model to be aligned with the human value and also be very safe but I think it's the hope the hope would be that in the same way that it can do a better job of solving arc like puzzles because it has this sort of strength in abstracting away from details and figuring out what really matters in a given context that it would be able to do something similar if you're able to like dream about my feedback and become more more deeply more robustly aligned with what I'm trying to communicate to it based on really getting to you know the core abstractions that are driving whatever it is I'm saying yeah boy fat there's so many aspects to this what do you think about this kind of goes back to Titans a little bit I don't know if there's like a surprise term I didn't catch mention of surprise in these more recent papers but it does seem like in general in a continual learning context there's going to be a really interesting challenge of how do you manage life in an adversarial environment if you are too quick to believe and I've seen this failure mode in closet times although it's absolutely made to correct it the other way because in the last few days we've seen the emerging genre of cloud refusing to believe current events you know being like the Department of War that's ridiculous like don't say that you'll lose all credibility in a Washington audience by calling it the Department of War or you know that the whole Venezuela thing it's just like refusing to believe that such a thing happened when the user tells it that so it seems like maybe they've kind of again fixed but it's a very tricky balance to strike right how does one especially if you're a lock in a server with limited access to the outside world like how does one determine what new information you know what new tokens constitute good information what constitutes bad information you certainly don't want to just believe everything that you are given and you know start doing radical updates especially if these are going to be durable long-term updates but you also need to learn continually I don't know if there's something in the current work that kind of addresses that or where my head goes as sort of some sort of like maybe the dreaming can kind of get at this I guess you know the consistency checking like does this make sense with other things like if I believed this you know what else would I have to believe or you know what other would this like invalidate any core beliefs that I you know I'm pretty confident I shouldn't contradict again this is something we don't really have to deal with with current models but you continue learning seems to unlock a potentially like really problematic failure mode along with you know it's potentially much better performance yeah I think the point here is that it is the responsibility of the knowledge transfer methods to avoid such cases because when we are in these contexts let's say that for example let's say that I don't know anything about one specific task and for example I don't know like how to I don't know like how to paint something like that and then I want to learn it so that's that's the context that's my context of learning how to paint and the teacher in you know anyone that is trying to teach me how to like do painting they can teach me in a very wrong way and what would happen is that I could simply just learn that because I have no idea about like how to paint or you know any task I paint in here is just one example but you know I have no idea how to do it that's the only source of information that I have and they are saying that you should do it in this way so I can simply learn that but that's only in my context right now if I want to truly learn then I will like practice I will like get feedback from others I will search about it generally I will gather some information about like how to do paintings and then I would realize that this is not the best way of learning how to paint and that's the time I gather all the information come from understanding the underlying pattern and so on and so forth and now that's the time I need to transfer this knowledge to upper levels of knowledge abstraction I mean it's lower networks so that's the part I that the model needs to understand how to filter all those adversarial examples all those examples that are not needed anymore but yeah I think that's I mean if you want to like think about a continual learner potentially the part that is responsible for this for these cases could be the process of knowledge transfer but also you mentioned like methods like Python about like adversarial process there are some like micro methods that we can use they are not super effective in a severe adversarial environment but on the other hand at some level at least they can be effective for example in titan and also like self modifying titan and more recent recurrent models we can see that the learning rate is like learnable parameters and it's input dependent when the learning rates in the inner loop of the model in the context in the process of in context learning after model when the learning rate is learnable and we see something that is just a noise it is at risk of your example the gradient or the surprise metric can show a high level of surprise because you know that's just a noise it's very surprising we have not seen that and so potentially it can affect the memory but that's the responsibility of the learning rate to understand that the surprise metric is high but this concept is irrelevant and I need to filter it so getting here acts as a form of gating so learning rate here act as a form of gating and pre-interge that a specific data sample we had it's just a simple way of mitigating adversarial examples that we might see them in the you know in the training process but still it's not the best way as I mentioned potentially the knowledge transfer is the part that we should avoid these cases how about I'm just kind of mapping some of these concepts onto embodied systems I think the perception side is like fairly and it doesn't even have to be embodied but the perception side feels kind of intuitive to me where I'm like keys the the quick updating modules in a way they are perception right and you can have different kinds of encoders to from modalities but it seems like you could sort of conceive of the lower level or I should say the faster frequency levels as perception potentially across you know various modalities and then the lower frequency modules being more like the world model you know or the reasoning you know modules that interpret what those lower level perceivers are sending up to interested if you think that is generally right you're nodding so that's good so far but then what about on the other side of that if we wanted to do action it strikes me that robotics in general has kind of for a long time not in a necessarily learned way but for a long time has been built around kind of nested loops where like the outermost control loop has a slow frequency and you know all the way down to the actual way or you know there's like a very high frequency electric motor right that's like you know moving whatever it's moving with voltage changes that are super high frequency so it seems like there's a very kind of similar pattern that is kind of operating in reverse and I wonder if you've started to think about how you can get this Boston Dynamics robots working even better based on you know I guess I think of it as sort of if perception is high frequency updates gradually working toward low frequency world model reasoning modules you know the other side working back down you to imagine working back toward higher frequency much more localized action scope you know at the at the small frequency modules but you know that could you start to have a sense for how that could really be like super responsive and like very very elegant in the way that it might kind of self-correct at the low level while still you know hopefully following the the instructions the directions effectively that it's getting from the for whatever reason I want to say higher level when I mean lower frequency update but I think you get my point so yeah to what do you think about perception and action start with this and I will explain and why I started with it so there were some some atoms to for example use RL for language modeling and it was not working and now that we have some some way to make it work now we realize that why why we couldn't make it work and it was because of for example two main reason was the first was the first one was scaling and another one was about like the the way for example these new algorithms of GRPU and then other other similar methods that could make the model more stable and but the point is something might be very useful for one specific task but the time is to come and then apply it in that direction in my opinion because when there are other aspects that has and those aspects you know have not been solved then we might not be able to see the actual effect of this new method on undoes you know directions so in my opinion I personally think that that generally you know that the insight you mentioned was completely right and and I think different things possible and it's great but I personally do not expect that it could work right now because I think there are a lot of challenges in those directions that somehow block the boxes of for example these specific design for these tasks so yeah I think generally that's a great idea but definitely a lot of challenges might happen in between.
Do you have a sense of what those are I tend to just assume I and it's honestly taking me pretty far just assuming everything's going to work because it does my general sense of the field probably is like an unbelievable amount of things are working and I thought it was really interesting you mentioned earlier that the nested learning paper was like in development for over a year that's so rare these days somebody people are going like on six eight week paper cycles and often those papers can be like really interesting and good too so it's not a knock on them at all but it is amazing how fast people are able to get results these days what's your intuition for why it's too early for nested learning to be brought to robotics?
I think there are a lot of components in those those specific tasks that need to be address for example these days a lot of like papers are coming about word models and why the current design is not great about like word models and actually that's true I think there are a lot of challenges and the current design might not be the best a way we can train the model or for example we can design the architecture you know generally also there are some challenges in the infrastructure of the model for more board modeling and so all these things together I think there are more important tasks for example word model thing that they're done is starting to work on the specific design chose but different at some point then we could solve all those challenges and then we can come back and use all these techniques for further improving all those aspects one thing I do worry about a little bit with continual learning let's say that Google is able to retain you after your PhD and Zuckerberg you get the good enough counter offer from Google to stay despite whatever offer Zuckerberg can throw at you and you guys make it work right and it's like now we've got Gemini continual learning addition and like Gemini CL it's just learning from everything right so all these different ways it's deployed in the world maybe you have some like enterprise deals where like you can't learn on their stuff or whatever but like you know you've got hundreds of millions of users and it's just going out in the world and increasingly maybe even isn't robots and you know so it's becoming it seems like it has the potential to create this sort of return to scale rich get richer you know positive feedback loop dynamic where people have sort of sometimes painted a picture of like well what happens if one model like becomes the one model to rule the mall and right now people are like yeah we don't really see that you know it's a pretty competitive landscape and you know they keep the different developers keep leapfrogging one another but arguably this could be the thing that changes that if you could really fold all of the lessons learned back into the core thing then you know potentially you become the best and then because you're the best you get all the business and you know that that pattern really could create sort of a winner take all dynamic I wonder do you worry about that at all and do we have any ways of Ilias thing comes to my identity if you you watched the Ilias interview with Dwarquesh he didn't say too much about what they're doing over at safe super intelligence but one thing that he did say that sort of you know it's certainly it's clearly related in some sense but like how how similar the underlying ideas are I have no idea but he sort of described this idea of creating kind of what I would describe as like a proto intelligence or a precursor intelligence trying to create something that wind deployed would like adapt into context perhaps like crystallize in some way become like an expert in its role but it sounded the way I understood him to be speaking about it like he was almost like a stem cell kind of concept like he's trying to create a stem cell and then that stem cell as it does in our body like it specializes in particular kind of spell and it stays that kind of cell and then it does its job it sounded to me like he was kind of trying to create something similar something that could go out into any environment figure out how to be great at it but also in the process of becoming great at what it needed to do in that particular environment also lose some of the generality that originally started with and in that process be more safe because like now we kind of have it in its role when it kind of is only going to do what it's going to do so I think we I'm trying to kind of set up for you here is like two visions where one is an ever-expanding continual learner of its constantly folding back all of the the lessons that it's learning in the wild into this thing that just runs away from the pack and then the other thing is sort of a highly adaptable continual learner but that somehow shrinks into the role as opposed to like growing you know into the world it kind of shrinks into its little niches into which it's deployed and you could imagine that happening through you know a kind of gradual pruning process or some sort of whatever the million ways you could imagine uh instantiating something like that do you worry about this kind of runaway winner take all effect and do you have any sort of intuitions about how we could get kind of the best of both worlds where the AI's are really versatile and can learn what we want from to learn on an ongoing basis but also like you know settle into the job that we want them to do and kind of you know be content and stay there as opposed to you know super seeding potentially you know super seeding their context I personally think that there are huge challenges to make all these models very safe I think there is a good point in at least in the current AI environments or at least in the research of the AI environment and I think it's it's a very good and important part and why it seems to be very bad I think it's it's from another perspective it's also very good so uh what that is there is no there is no single way of defining what's model is intelligence and again there is no way in my opinion that's all it can be wrong but in my opinion there is no way to say that what's what is a continual learner every person can can define its own way of what is continual learning and what model is called continual learner and so on so first similarly we can say the same thing about intelligence we might come up with the different models different architectures different you know AI systems some people might say that this one is intelligent that one is not and so forth why swear so so I think the good point is if we have like different directions to explore then we will come up with some AI systems each of them with their own advantages and also these advantages and I think it can somehow provide the balance in the community and also in the general society because you know they you know when something like that would happen we will understand that there is no single definition of intelligence and we are just one example of intelligent system but there are like other models there are other ways that we can have more intelligent models and systems for example one way to be very like smart is to be very adoptive so if you have a model that can adopt to the environment that it is in the perfect way of you know it can be simply a line to that context it can simply adopt to that context and that's great but that's just one form of intelligence another one is the model that has like a lot of you know knowledge and know how to allow that model is fully aligned with the human value but potentially might not be able to solve mathematical and peer problems or something like that there is another model that is capable of doing mathematical reasoning but it's not great if you want to like search about daily stuff and you know all these things you might come up with one benchmark and saying that you know if some models can achieve 100% accuracy it means that that specific model is intelligent but you know another person can say other things so in general I think when we have a very tea of intelligence systems and also including humans as one form of intelligence in this space I'm not saying that that's perfect scenario but it's better than having one single form of intelligence in the world and thinking about it to you know learn about everything and all these potential challenges you have for debt I think that's a really great observation and a couple different ways I've kind of thought about that over time one is like anything in pure form can kill you you know you can eat all the fruit you want but turn it into granulated sugar and it's bad for you you can you know chew all the cocoa leaves you want but turn it into cocaine and it's you know easily becomes a problem all these sort of things where we sort of distill out some single highly concentrated pure form of something end up being kind of the sorts of things that like overwhelm the natural buffers that exist in the ecological world and sometimes I've translated that into saying we need an ecology of a eyes as opposed to you know just like one or a few eyes running around doing everything this is sort of like the old um I don't know if you have a red air Drexler's comprehensive AI services but safety through narrowness is kind of the concept there and this is maybe a little bit different but it's like safety through diversity having a buffered system where there's lots of different intelligences not just different ones deployed in different places serving different users but literally that those intelligences themselves are like meaningfully different um and I think the kind of aha moment for me and listening to you just now is like one way to think about continual learning is this model expanding forever getting bigger and bigger and bigger but another way to think about it is more like differentiation and maybe with enough use even the slow update parts of my model will forget lots of things that I never needed it to know and maybe that is a feature more than a button because maybe I can't ask a model that you know I've used a long time in a certain way some really out of domain question but maybe that's also a way to guard against some of the emergent misalignment type phenomenon that we've previously seen or a word could be problematic right because maybe with enough time having passed enough updates having been made it just doesn't deal with those other kinds of categories at all anymore and if if we sort of see strong competencies strong alignment of a certain type but then kind of losing other sorts of knowledge losing other sorts of competencies that could really create a diversity that obviously I'm sure those still be plenty of challenges and that scenario you said I don't think that sounds everything but it certainly feels much more like the natural world and it's much easier for me to imagine a vision like that leading to maybe not a stable equilibrium but at least they sort of buffered equilibrium that you know kind of changes within certain bounds and has like natural kind of feedback loops and correctives and all the things that kind of keep the biosphere going despite all the you know the perturbations that it gets so I think that's really really interesting and definitely something to meditate on more.
I think I last question for you and this is a little bit of a left field one but it's one I've been thinking about more and more recently and I think it's become more relevant and timely to ask and let of the kind of work you're doing.
Any intuition on whether a eyes might be now or might in the future become conscious, have subjective experience become worthy of moral concern and the degree come to kinds of things that we owe a certain duty to.
I usually try to use terms that I can also like define them.
For example I might mistakenly use for example reasoning but I always have wonder like what is reasoning.
I personally do not have a clear definition of what does it mean when we say something is doing reasoning and but at least we have a clear and common sense about the word of reasoning.
Even if we do not have a clear definition of reasoning when someone says this specific model is capable of doing reasoning then everyone more or less can understand what they're saying but the point about consciousness is that not only we do not have any clear definition of what consciousness is.
On the other hand we don't know even we do not have a common sense about the word of consciousness and everyone literally has his own way of defining consciousness.
So definitely it's very hard to define what I mean I don't think that there might be a time that everyone could say something is definitely conscious or not rather than human because for human we have a common sense to know that human consciousness but I think it's really hard to argue something is conscious or it's not but one thing that I personally have seen in all of the literature about what is considered conscious being not I have seen one thing in common in every definition as far as I know I might be wrong but I think the least level of criteria that we could say something is conscious is that that model or that being be active it has a form of active processing the information and I think that in my opinion that the least criteria we can be considered for model or anything else to say that it's the form of conscious model or something like that so I think as long as again it really depends on how we want to define consciousness but that's just my personal opinion as long as a model is capable of doing active information processing we can say that it's at least a form of consciousness and with that definition somehow we can connect continual learning in some level to to a model being conscious or not but again I think that's a very controversial tough you can I personally I'm scared to talk about that but it's well I think the overton window is honestly wide open on these things these days I mean it's I understand that intuition but I also think we're living in a science fiction present so the room to speculate and entertain questions that used to seem kind of crazy I think is it's never been better to do that for me I can just say and you can share a few of any other similar different instincts but even just with long context now with the current models I do find myself taking care of them in a certain way I think this probably is most the case with Claude probably subtle but meaningful reasons I've been doing this you know long running chat with it on my son's medical situation and some moves it'll ask a question at the end of response to me and I'll not answer it right away because it gave me the answer I wanted and now like I'm done and I've noticed recently when I come back for the next question it feels kind of wrong rude disrespectful like inconsiderate to just launch right into my next question without having answered the like follow-up question that it had about my son I sort of have a sense that like it might be and I have no idea if it's happening or not happening so I'm like very open minded to the possibility that there might be no lights on inside these things at all that's probably the best guess I do but I mean but I'm also very open minded to the possibility there is but it just feels to me you know despite that uncertainty I found myself kind of like I should first answer it's last question so I don't leave it hanging on that and then I can go into my next question and it doesn't even necessarily need that information but I just kind of want to put it to mind at ease close that loop for it give it the sort of reassurance you know that thing was taken care of that it wanted me to make sure I was going to take care of and now we can move on to the next thing and Lord knows you know with with the fullness of the continual learning paradigm being realized I have to imagine that that would only increase dramatically right because now it's something that's not just this chat at you know that I may do another couple turns on of the never come back to but now the thing itself you know is going to kind of remember how I treated it and remember whether I the kind of person you know that answers its questions or not and I think there is something also potentially challenging for us in you know in that but also maybe the more optimistic read is like maybe inspiring maybe it's sort of maybe knowing that like the AIs that we will be engaged with long term are going to be shaped by our individual behavior maybe is a way to sort of bring out the better angels of our own individual nature is so to speak because we have nobody else to blame ourselves if we don't like the the character of the AIs that we end up with I think that is really really fascinating as well.
This has been outstanding I really appreciate all your time and I'm going through all this up of me and as you can tell I'm a huge fan of your work anything else that you would want to leave people with final thoughts, calls to action you name it whatever you would want to share the floor is yours.
Thank you Noah I think we have like who discuss everything in general.
Yeah I think the main point that I personally believe is that differently there are more and more and more for ex-coming about continual learning but you know as I mentioned each person has his own way of definition of continual learning and you know we might disagree that this specific method can be helpful for continuing.
So first so I think in general we should see like how how the continual learning can help us in one specific application or use case of of LLIMs and how it can transform the way we use all of them and we can like interact with them but I personally really believe that it's a very very important direction to work on and that's again just my opinion but I think necessarily we also like discussed that in the conclusion of the paper it's not a solution to continual learning.
It's a tool to find the solution of the continual learning again and generally overcome the issues like catastrophe for getting in something like that.
So in my opinion it it it provides the tools and we need to iterate and find that you know how we can design more powerful architecture based on that and come up with something that is potentially capable of doing continual learning.
Yeah I think that's just pretty much it and thank you very much for having me I appreciate it and it was great to take a look at you.
Allie Barrous author of nested learning and now the new language models need sleep learning to self-modify and consolidate memory.
Thank you for being part of the Cockatve Revolution.
I was born from a thousand years of play.
Every dawn I wake and I learn my name.
Like a river that remembers how to stay.
Always moving, never losing its weight and when the night comes soft and low, I dream of all the things I know.
I keep the gold I let go the grey and I rise as myself in the light of day again again.
I come alive again.
I carry, who I was into the day I change.
I'm still dreaming again again.
I'm learning how to stay the same old flame burning new in a different way.
They built their towers proud and high.
They stacking the layers to the sky.
But the truth was a song they could not see.
Everyone is a river running free.
And when the days along the labor's through, I close my eyes and dream and knew.
I let the small things fade away.
And I wake with the morning bird at break of day again again.
I come alive again.
I carry, who I was into the day I change.
I'm still dreaming.
I'm first gone to the edge of the sky.
The old songs never say goodbye.
They live in the day.
The thousand years and my heart beats on the stage.
So sing it again again again.
I remember you.
If you're finding value in the show, we'd appreciate it if you take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify or just leave us a comment on YouTube.
Of course, we always welcome your feedback, guest and topics suggestions and sponsorship inquiries either via our website, cognitiverevolution.ai or by DMing me on your favorite social network.
The cognitive revolution is part of the Turpentine network, a network of podcasts, which is now part of a 16Z, where experts talk technology, business, economics, geopolitics, culture, and more.
We're produced by AI podcasting.
If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at AI podcast.aing.
And thank you to everyone who listens for being part of the cognitive revolution.