
Dwarkesh Podcast · 2026-04-29
Reiner Pope on LLM Training & Serving Math and System Architecture
Hosts: Dwarkesh
Guests: Reiner Pope
Why it matters
Batch size critically amortizes memory fetch costs, enabling 1000x cost efficiency gains in serving LLMs.
Key claims
- Batch size critically amortizes memory fetch costs, enabling 1000x cost efficiency gains in serving LLMs.
- Roofline analysis shows latency and cost trade-offs between compute time (weight multiplies) and memory bandwidth (weight and KV cache fetches).
- Optimal batch sizes for frontier models are on the order of thousands of sequences, balancing compute and memory bandwidth.
- Mixture-of-experts (MoE) layers use expert parallelism across GPUs within a rack; cross-rack communication bottlenecks limit scale-up.
Briefing memo
Summary
In this detailed technical lecture, Reiner Pope, CEO of Maddox and former Google TPU architect, explains the mathematical and system-level principles behind training and serving large language models (LLMs). He focuses on how batch size, memory bandwidth, compute throughput, and KV cache affect latency, cost, and scaling. Pope uses roofline models to analyze trade-offs between compute and memory bottlenecks, showing why batching many users is critical for cost efficiency and how sparsity and mixture-of-experts architectures impact compute and memory demands. He also discusses the physical constraints of GPU racks, interconnect bandwidth, and parallelism strategies (expert, data, pipeline) that shape model deployment at scale. The episode covers the implications of memory walls on context length scaling, pricing signals from API costs, and the interplay between training compute, inference compute, and RL fine-tuning in optimizing model lifecycle costs. Finally, Pope touches on invertible neural networks inspired by cryptographic constructions and their memory-saving benefits during training.
- Batch size critically amortizes memory fetch costs, enabling 1000x cost efficiency gains in serving LLMs.
- Roofline analysis shows latency and cost trade-offs between compute time (weight multiplies) and memory bandwidth (weight and KV cache fetches).
- Optimal batch sizes for frontier models are on the order of thousands of sequences, balancing compute and memory bandwidth.
- Mixture-of-experts (MoE) layers use expert parallelism across GPUs within a rack; cross-rack communication bottlenecks limit scale-up.
- Pipeline parallelism reduces per-rack memory capacity requirements but introduces micro-batching and pipeline bubbles, complicating training.
- Memory bandwidth, not capacity, is the main bottleneck limiting context length scaling; sparse attention partially mitigates but quality trade-offs remain.
- API pricing structures reflect underlying compute and memory costs, e.g., higher cost for decode (token generation) vs. pre-fill (context processing).
- Invertible neural network architectures inspired by cryptographic Feistel ciphers enable memory-efficient training by rematerializing activations on backward pass.
Source material
Transcript
Today, I'm interviewing Reiner Pope, who is CEO of Maddox, which is a new ship startup.
Previously, he was doing TPU architecture and many other things at Google.
This is a very different format for my usual interviews.
This is going to be a Blackboard lecture where I'm going to get up in a second.
We in fact built this whole new studio with specifically this format in mind, and says a pleasure to get to inaugurate it with you.
We're going to be talking about model architecture and the little infrared and many other things.
And the reason I think it's an important topic is because once you actually understand how training and inference actually work in a cluster, as we'll see a lot of things about why AI is the way it is, why AI architecture is the way they are, why API prices are the way they are, fundamentally also how why AI progress is the way it is, start making sense.
And you need to understand the details to get there, and you need a Blackboard worker understand the details.
So Reiner, thank you so much for doing this.
Yeah, very happy to be here.
Just a heads up.
This is a lecture with graphs and equations and all that stuff.
So if you kind of would really recommend washing it on a video platform like YouTube.
Okay, full disclosure, I am an angel investor in Madx about that's been related to this podcast.
Reiner, maybe to kick us off.
Ah, this question.
So we have a couple of companies like Claude and Codex and Kurser are offering something like fast mode, where for six extra price, they'll give you streaming tokens at 2.5x to speed.
Mechanically, I'm curious what's going on here.
Well, why is in the case if you can pay more to get faster latency?
And two, could you keep going?
Could you pay 100x more and somehow get even faster speeds or much much faster speeds?
And three, could you go the other way?
Could you have something like Claude Code slow mode, where if you are willing to wait for minutes on end, you could get even cheaper prices?
So maybe this will help motivate the kind of analysis that you'll be doing through the lecture.
Great.
I mean, to jump to a little bit to jump to the conclusion, the big effect is batch size, but what we're going to do now is quantify exactly what that looks like and what its implications are on latency and cost.
There's going to be another effect, which is, you can call it speculative decoding or multi-doken prediction.
We can maybe come back to that later, but I think the first thing that we'll talk to is batch size.
So what I'd like to introduce is sort of the two principles of analysis.
Firstly, we're going to look at a roofline analysis of how we run a transform model on a cluster of chips.
We'll take a sort of, let's say, a blackwell NVL72 cluster, so a rack of 72 GPUs.
And so the roofline analysis means we look at memory bandwidth and compute performance.
And then the other side of that is that we're going to look at just two simple factors of the model, which are the time to operate on the whites, and then the time to operate on the context of the KG cache.
So let's jump in.
What we're going to try and do is we're going to try and estimate the time that it takes to run an inference of a certain shape.
Now, we're not perfect here.
We can't exactly predict the time and so instead we're going to approximate, and so we're going to say that the time must be greater than or equal to a certain quality.
And so we're going to consider a two different aspects.
We're going to look at the time for it takes to do the memory features, and then the time it takes to do the compute.
And it'll turn out that this actually gives us a very strong predictive power, even with a simple model.
So what do I want?
What is the time that it takes to do the compute?
So there are really two things I need to do in the compute.
I need to multiply by all of the active parameters, and then I need to do some work on the attention.
So multiplying by all the active parameters.
I have a certain batch size that I'm writing, and then I've got a number of active parameters, and then I'm just going to divide this by the compute throughput, which is the flux of the ship.
So this is our concern.
So this actually accounts for all of the compute time for all of the weight matrix multiplies.
There's a little caveat here.
We've sort of ignored the time to do any of the attention computation, but that in general can be well, be quite small in comparison to this.
So we'll look at this.
Maybe I'll just ignore from time to time to ask a very naive question and sort of clarify some basic points.
But just for the audience, you're not serving one user at a time.
The batch refers to the fact that you're serving many different users at the same time.
Yeah.
And that's a whole batch.
Yeah, so I can motivate the batch at least a little bit.
So we will see exactly why batch is such a favorable optimization, but we'll turn out to be the cases that if you do not batch together, many users, the cost and the economics you get is can be like 1,000 times worse than if you do batch been introduced together.
And we'll be able to see that quite explain.
And then a number of active parameters, this is saying, like if I think look at it, for example, a deep seek model, the deep seek B3 model has about 37 billion active parameters.
And then seven, have you built in total parameters?
So this is where focusing on just the ones that are active for a single eye out token.
OK, so we're modeled compute performance.
I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much.
And maybe they'll always sometimes be ignored.
On the memory side, what do we need to do with memory?
We need to fetch all of the weights.
And so there is some time to fetch all of the total number of parameters, not just the active parameters.
So there's weight fetch time.
And in addition, there's a KV cache fetch time.
So there is this actually depends on batch size.
So for every element in the batch, we have to fetch an entire context length worth the tokens.
And then there's a size per token.
So like bytes for one token.
And so there's a model parameter.
And maybe just back in, let's just explain why the KV cache is real quick.
Yeah.
So when I do a forward pass, let me draw actually how the order gets of inference work.
So this is during D code.
So if I think, I have a bunch of tokens text.
I'm growing a tensor because ultimately, the tokens are represented as some like tensor of embedding dimension.
And then in this direction, I have the sequence like the work of running a decode is I have to run each token through a whole bunch of matrix all the pies over a bunch of different layers.
And I have, in general, I'm going to have to do that work over all of these tokens.
But then one step of decode is actually to produce just this one additional token property.
And so what I'm going to do there is I'm going to run a full forward pass of all the find by all of the white matrices in the entire model.
But then I've got this attention mechanism where this token sort of it's like looking at all of the past tokens in this way.
And what is it looking at specifically?
It is looking at some internal representation of the model is produced of the tokens and we call that the KB cache.
So this process of attending this single token attending to all of the history of tokens, that's attention.
It is mostly dominated by memory features from other than than managed total bias.
So we've got the amount of memory that we're fetching, showing over here.
And then there's, of course, just then divided by the memory bandwidth.
So the memory by its perspective.
So in fact, these are questions here.
I actually enough for us to now draw some fit lines.
And so the things that we'd like to look at are sensitivity to batch.
And then also, which will draw separately to context things.
So we said that the big effect you can get is like some trade-off in latency versus the cost in batch size.
So let's draw them out.
I think there's just really two graphs we want to draw.
We'll first just draw batch size versus time here.
So when we look at the shape of this, we've got a maximum of well, the sum, and then another term.
So let's look at these terms one by one and how they scale the time for compute and memory can have this shot.
So let's first look at this compute time.
This is just purely linearly in batch size with no offset.
So it is some kind of like this.
So this is T compute and then on the memory side, we've got some portion here that is just this constant that is, you know, constant in some base offset here, which is the white fetch, white fetch.
And then finally we have this term here, which is the kbfetch, which we're going to draw as there's the kbfetch, which is linear in batch size.
So it looks like that.
So the sum of this plus this maxed with this.
So let's at least first draw the sum.
So the two memory times in conjunction and uplooking on this curve just like this.
And then we get a the overall maximum is, I'll draw a little thicker here, is the maximum of these two curves.
Very sense.
OK, so what does this mean, actually?
So this is a latency plot.
So if I grow my batch size, I get initially some not very strong dependence on batch size.
And so there's some lower bound or latency here, latency lower bound.
So it's already partially answered the question.
For a given hardware configuration, then we can talk about varying hardware configuration.
But for a given hardware configuration, there is a lower bound or latency, which is simply the I need to read all of my total parameters from memory into the chips.
And that takes us on the amount of time.
If I use all of my memory bandwidth, I can't do anybody like that.
It's even the way you've drawn the slopes for compute time and how the KB grows and what implication the KB has on memory time that, as you imagine, what if this were above all below who?
Yeah, or is that necessarily the case?
Because if this is always sure that this batch size grows, compute always dominates KB and which suggests that if you have a big enough batch size, maybe memory's never an issue.
Yeah, this really sounds a bit of the context, thanks.
So I think we should come back and explore this.
Yeah, as you vary the context length, the KB fetch time will go up and up.
And so that'll cause a transition from compute limited to memory limited.
And is there something especially significant about this slope being exactly this slope of the complete time?
Yeah, whenever we have balance points, it kind of says that you're getting it exactly right.
And so for the particular context length where the slopes match, that says I am equally memory bound and compute bound, which is a really desirable place.
Yeah, but but suppose it's like, this is a very simple algebra, algebra problem, but suppose it's, you know, the optimal is a 100 k context length.
And you go to 200 k context length.
Does your MFU go down to like 50%.
Like does it have a humongous impact on MFU?
Yeah, to be slightly outside of context length, the optimal range, that's right.
So that is true as model tier.
There's a key point here that I'm modeling this context length as, modeling the memory fetch as linear in context length.
Yeah.
That actually depends on model architecture.
It is true for many of the, all of the model architectures with dense retention.
Yeah.
There's a sparse attention actually scales much better than that.
Right.
And is sparse attention never raises in practice?
I'm pretty excited about sparse attention.
Uh, it's hard to know the labs are using.
Deep seek is published just by ascension mechanism.
I'll just like put a plug in that sparse attention.
Some of the deep seek papers that have published sparse attention, end up putting a square root to this term.
Okay.
So so far we've done, we've looked at the latency.
Um, it's kind of hard to read off cost from this.
Uh, so if I think what does cost mean?
Um, I'm going to, like, to run this inference, I'm going to use the GPU for a certain number of seconds, like $1.00 second or $20 milliseconds or something like that.
Um, and I have to pay the rental time for for that.
For that time.
So like it's $2.00 an hour per GPU or something like that.
Um, so, so that's the cost of this inference.
But how much value have, how many tokens of a process during our inference?
That is the batch size.
And so what we actually on the plot is going to be the, um, the cost versus batch size, um, which is like T over B versus batch size.
Uh, this is the cost per token.
Um, so like we have to do imagine dividing each of these three curves by by B.
So I'm multiplying by this, um, reciprocal.
Um, and so what we end up with the wind there is the compute curve is going to, um, it was linear.
We divide by B.
That makes it, uh, a constant here.
Um, this is taken to the, um, the, um, the kv fetch was linear.
Now it becomes a constant as well, um, uh, kv fetch.
And then the, um, the, the, the weight fetch, uh, was constant.
And now we've divided by B and so it becomes this, um, hyperbole.
And so again, we're going to compute the, the max of the sum, um, so the sum of these two terms shifts the, uh, the, uh, the parabola up, some of the kv fetch and the weight fetch, um, gives us a sort of, uh, a higher parabola, that's like this.
And then we're going to take the max with the compute, uh, here.
So we end up with this, this being the overall shape that we care about.
So again, so like we, we see some limiting behavior, the cost initially starts very high at batch size of one, actually like, it almost goes to infinity.
Like, uh, it's, um, because we've got so many weight fetches, which are not amortized over a large batch size.
Um, but then as we increase the batch size, the weight fetches become, uh, amortized over so many different batch elements that they, their cost goes grows very small and eventually the compute time, uh, ends up driving the cost.
So there is a limiting, um, like low abound, low abound on cost.
Um, which is this pin here.
Yeah.
Um, so cloud code slow or code x slow or whatever, we'll just live on this line.
You know, and how much because you're, you're not able to amortize the kv values over a much bigger batch.
Yeah.
Yeah.
They're unique for batch.
The compute is also unique.
Yeah.
A batch.
And so what is the minimum work you can do for batch after amortizing everything out of the way?
Um, so this point where you are no longer, um, memory bandwidth bound.
What practically, how big a batch do you need to, uh, like, how, yeah, how big are the batches?
Yeah, practically for frontier models.
Um, you can, you can just solve for that, actually.
Um, and it's not even particularly sensitive to model architecture.
So, um, let's go ahead and do that.
So, what we are talking about is we're going to say when the memory time is equal to the compute time.
That's, that's what that question is.
Um, for now, I'm going to discard the, um, because we're focused on what, what the batch size is and really there's a question of what, uh, when the weights are amortized over the, uh, and the, uh, the multiplies, I'm going to focus on comparing the weight fetch time to the weight multiply time.
I'm going to disregard the kv fetch term, um, just just to simplify the analysis.
So, we can get a kind of a clean answer out.
Um, so we're going to equate, uh, this portion with this, with these two times.
Yep.
So, writing that out, um, we get in number of total parameters.
Over the memory, uh, memory bandwidth, uh, is equal to, um, batch size times number of active parameters divided by the compute performance.
So, looking over here, everything on the top, these are model parameters.
Everything on the bottom, these are hardware parameters.
Um, it, it turns out to be nice to rearrange them such that we have the hardware parameters on the outside.
So, so that's, this is equivalent to, um, um, lots of memory bandwidth being equal to, um, batch size times number of active parameters divided by the number of total parameters.
So, so this is a hardware parameter.
Um, actually, this actually ends up being a dimensionalist constant.
Uh, if you look in terms of flops, what are the dimensions of this?
This is, um, multiply as per second.
This is bite per second.
So, that's not quite dimensionless.
But what you do is you say, like, multiply as per second times, let's say I'm doing FP4.
Um, so I do, like, how many FP4 multiply as per second, times the fact that, uh, each one, HFP4 is half a bite.
Um, and so I can actually make this end, ending up being dimensionless.
Um, and, and this ends up being on most GPUs, um, around 300.
Somewhere around 300.
It's, sorry, is that ratio change over time?
We've gone from model to generation of model generation or the, so ops keeps increasing.
So this is a hardware parameter.
Um, to what extent has the hardware changed?
So, um, from, like, A100 to A200 to A200 to B100.
Um, the, the ops has increased substantially.
The memory value has also increased substantially, and it has remained reasonably stable.
And we can, we can express this one as well.
This is a sparsity parameter.
Um, and I, I might even phrase this slightly different.
Let's solve for batch size in total.
Um, we end up with, um, so we're just moving this back over to the other side.
We end up with batch size needs to be bigger than approximately, um, 300 times sparsity.
So for example, if I have a hundred, like, I activate, in deep thick, uh, I activate 32 out of a turn into 360 experts.
So this would be, like, eight, for deep thick.
Yeah.
Okay.
So this actually gives you a ballpark, which is, like, remarkably accurate to practice.
Generally, people will go a little bit larger than this.
They don't really want to be exactly at the, uh, balance point, because, um, real well deficiencies aren't as good as a, uh, refined analysis would say, um, but, like, take this and maybe double it or triple it.
Okay.
So basically, it's, like, two to 3000 tokens per batch.
But then if you included the KB cache, yes, the implication would be that the optimal batch size should be larger.
So this is got, like, we solve for the equivalence between when, um, compute time is equal to memory time.
If I add in more memory bandwidth, like, something that consumes more memory bandwidth, then I have less available for the, uh, for the workloads.
And so I need to grow the, uh, memory bandwidth, or the, and therefore the batch size more.
This seems incredibly small.
The batch, this would be, like, less than one sequence, right?
Yeah.
Okay.
So, so I guess this is, um, keep in mind that I'm talking about the number of tokens that I'm generating one more token for.
So, so it's, like, it's actually two thousand.
We make sequences.
Okay.
We're just talking about the single forward pass.
Yes.
On these sequences.
So this is like the, you think you're like, the batch is the number of sequences rather than, like, that's all right.
Okay.
Cool.
Yeah.
When I'm prepping for interviews, I often direct experts in the field.
So for Rhiner, I've shared it with two of James recent engineers, Clark and Axel.
Clark, who works at low latency trading systems, walking me through why James Street uses FVJs to make sure that they have predictable nanosecond .
He doesn't build these like giant grids of compute merit easily, but do exactly what you need to touch 100 megabytes of SRAM, and then get your response back in tens of nanoseconds, very easily, and that's basically impossible on some people.
He then went on to explain why CPUs just wouldn't work for this kind of thing.
And so if you have a clock that's going every three nanoseconds, you actually have several bytes of information at a time to make your decision.
That's as opposed to a CPU where you'll just collect a whole packet, you know, let's say 1500 by packet.
And he said, okay, this packet is ready.
Here you go, CPU, you can start thinking about it now.
FVJs allow you to react to the earliest part of the packet as it arrives, rather than having to wait for the full thing.
We also talked about liquid cooling, network design, and many other things.
If you're interested in this stuff, Jane Street is hiring.
You can check out their open rules at Jane Street.com slash broadcasts.
And if you want to watch the full prep conversation, we posted it there too.
If you've got a frontier model, and you are actually doing inference, surely they must have more than 2000 concurrent users.
Yeah.
Is there any added latency from the fact that you need to have the whole bash fill up?
Or is it, if you have a reasonable amount of users, it's so unlikely that you wouldn't, it would not take you 100 milliseconds to fill up the next 2000 slots.
Yeah.
The way to think about this, I guess we thought think of it as like, when does the train depart as a model?
Yes.
So let's say I've picked a batch size that I'm going to run at.
Maybe I pick this batch size.
Yeah.
And so like, well, and by this intersection point, there's the same intersection point here.
So I pick this batch size.
I know that it's going to take, for example, maybe it's something like 20 milliseconds is a common place to send up landing.
What I'm going to produce is, like, so this is a timeline of what is running on the GPU.
It's going to start a new batch every 20 milliseconds regardless.
And so, so each of it, this is 20, this is 40.
Thank you.
You can think of this as a schedule for the train, a new train departs every 20 milliseconds.
Any passengers who are ready board the train, if the train is full and they wait to the next train, if the train is not full, the train's going to go anyway.
And so in terms of what that means for QL and latency, it means that the worst case is that a request arrives just after the train departed.
It has to wait for the next train.
So that's up to 20 milliseconds.
And then it has to wait for that train to complete.
And so the worst case latency is 40 minutes.
Sure.
How is it?
20 milliseconds derived?
I mean, rule of thumb, but where it comes from is not fully explained yet, but so far we've focused on memory bandwidth and compute time.
When we look at memory, the other consideration is that we want to use all of the memory capacity we have.
And so, generally we're going to use all of that memory capacity to store the weights or the kb's.
And so, we just want to read, like, in the time of doing a forward pass, maybe you want to read all of the memory capacity into into the chip.
And so that is capacity divided by bandwidth, that tends to be 20 milliseconds on many different generations of hgm.
The units make sense.
You would have a byte by its percent.
Yeah.
So for example, I mean on, I think the ribbon generation, it is something like 288 gigabytes divided by 20 terabytes per second.
And this looks like it comes out to about 15 milliseconds.
I mean, I understand how why the units can't let the unit analysis.
But what is it saying is, we can evacuate and replace the hgm in this amount of time.
And so we don't want to mean to the situation where the hgm is not big enough that we're not actually able to keep right everything you want to it or take everything out of it.
Or we don't want to mean to the situation where our ability to write back and forth is so up big or so small compared.
Yeah, there's sort of two scenarios.
Why don't we pick a latency that is bigger than 15 milliseconds?
And if I think what that means, it means I actually have time to read the hgm twice.
Yeah.
By the way, most of hgm accesses is reads.
Right.
It's like, almost all reads because the weight matrices are read only.
And then almost all of the kv cache accesses are read.
So in, let's say around 30 milliseconds, I can read all of hgm twice.
But what's the point of that?
Like, I don't want to read the weight matrices twice.
I don't read the kv's twice.
Yeah, it makes sense.
It makes sense.
Okay.
So a couple of actually quick questions.
One, if it is the case that the optimal batch size is something like 2000.
And that actually true, it's totally dependent on the sparsity is not dependent on the model size or anything.
I mean, sparsity shows up in model size, but beyond that, it only depends on sparsity natural scale.
I mean, that's very interesting result.
And that seems to imply that you can one question is how much of a push to our centralization is that you would have these economies of scale from inference from matching.
Yeah.
But it seems like it's not that big of a deal.
Like, I don't know, it's 2000 users at the same time a lot.
It doesn't seem like a lot.
So we can do a bit of analysis on this, which would be actually it's like you can think of it in terms of a number of users, but maybe a more productive way to think of it is in terms of number of tokens per second.
So what does this batch size mean in terms of tokens per second or just of the system?
So tokens per second, tokens per second is going to be equal to the batch size.
We run a batch many tokens.
And then we do that every T.
So every time it's all, which is, let's say, which is, which is, this thing is equal to the 15 milliseconds, 20 milliseconds, number.
So this ends up being batch size itself times about 60.
So like 64 times B.
And so this ends up being around 2000 times 64.
So like, we can 20, 80 tokens per second.
So this is sort of in more digestible units.
It's hard to reason about conscut users, but what is the travel global traffic for a system?
When you look at some of the announcements, sometimes the API providers will will brag about how much traffic they have.
The numbers that I've remembered from some announcements of Gem and I last year were in the hundreds of millions of tokens per second worldwide.
So about a doubt, like this is one thousand of that great journey.
Yeah, but I mean, the German has big, right?
And that's actually one thousand of German has a lot.
Yeah.
To be competitive at scale, you need to be able to serve at least one thousand of German.
Yeah, that's interesting.
Cool.
Okay.
So, the more sparsity you have, the less compute you need, and it does seem that as batch sizes get bigger, computer ends up being the bottleneck, according to this analysis.
So then the question is, how far can you take sparsity that is to say, as the sparsity ratio increases, as you have fewer inferior active parameters or relative to total parameters, how much is performance of the model degrading?
And is it degrading faster than your saving compute by increasing the sparsity factor?
Yeah.
So, before the equality of the all the property, it can be rather than speed of the model.
Yeah.
Yeah.
So, unfortunately, we're not able to answer that analytically.
That's, that is an empirical question of model quality.
Best I can do is pull up a paper.
And that's that empirically.
Yeah.
Should we pull up the paper an hour or so?
Yeah.
So, so this paper, this is unified laws for routed language models.
It's a somewhat old paper by this stage, but one of the things that they did is looked at, if I keep increasing sparsity, what is the model quality impact?
This answer is very sensitive to the actual choice of mixture of experts, mixture of experts has been around for a really long time.
I think it was made even back in 2017.
But the techniques have changed a lot.
Deepseek, mixture of experts, it was a big change in how it worked.
There have been older papers which are G-shared, switch transformer.
So, the actual empirical results are going to depend on all of that.
But on one of the older techniques that is shown here, you can see if I hold constant the number of active parameters at a certain size.
And then I increase the sparsity in which they go expert count here.
The quality keeps increasing.
And then if you imagine like drawing a horizontal line from 1.3B dense across, you end up seeing that, for example, in this case, the 64 experts, 270 million activated parameter model, as good as a dense 1.3 billion model.
So, in some sense, it's actually an automation return where you need to increase total parameters a hundredfold to get the equivalent of 10x as many active parameters.
Yeah, I mean, actually even more so.
Yeah, it's a huge increase in a parameter count for a modern increase in.
Yeah.
So, in this case, actually, what is it?
464x for 4x.
Yeah.
So, while it is, while it is true, I guess that the, you get this benefit of being able to economize on your compute time if you increase sparsity, naively would seem like, oh, that's a trade-off worth making.
But if this, this, you're decreasing this by 2x and then having this go up by 8x every time you double sparsity.
So, is that good or bad, actually?
Even from a memory point of view, keep in mind, you are doubling this portion of the memory fetches, which is amortized by batch.
And so, just, just keep running out larger batch size.
From the point of view of the analysis we've done here, this is pure win.
Keep doing it.
Keep doing it until you run out of available users, basically.
So, there's actually this equivalence between, if I want to go spars, or if I have a lot of users, I can go to a batch sparsity model.
So, from that point of view, it's a reasonable trade-off.
The other trade-off that shows up here is that it also consumes memory capacity, which we, we've only reasoned about memory bound with it.
So, let me just make sure I understood.
You're saying, we want bigger, we want this one less type of building.
Therefore, we do more sparsity to make that work.
We need bigger batch sizes, which means we need more memory capacity.
Yeah.
So, how does that work?
Yeah.
So, maybe this would be a good point to actually talk about how a mixture of experts layer is typically made out on a rack of GPUs or something.
Yeah, we're really sparsity of experts.
Yeah, maybe how we lay that out on a GPU.
So, let's zoom in on the mixture of experts layer first and sort of draw what that looks like.
So, we typically will have a some kind of a router layer, which is making the decision of where we route the experts to the tokens to.
So, your tokens coming in here, they go through a router layer.
And then we have a bunch of different experts draw a few more to line some up.
And then the router will make a decision on which experts am I going to route to.
And it'll be a small fraction of them, maybe one in 32.
So, maybe it'll make a decision to route.
So, this one, maybe this one, maybe this one, maybe this one.
These experts.
So, these each expert itself is a normal MLP.
It has a up-projection and a down-projection that are known in the hierarchy in between.
And then finally, we sort of do the inverse operation.
So, where we were broadcasting things out here and we're going to bring them back in and sum them up.
So, bringing them in like this.
And then finally, we have our residual connections.
The token is also passed through here and it gets added to the result of the MLP layer.
So, this is a normal MLP layer.
What I want to talk through is how this is mapped to a like a GPU rack and what this means for communication.
Because I think this will start to show some of the limits of how sparse we can go.
So, the standard practice here and it is the best solution is to use expert parallelism.
So, I mean, different experts go on different GPUs.
So, if we take something like a deep-seek model, they have 256 experts.
Let's say we want to run that on a like well rack.
So, there are 72 GPUs.
We have a divisibility problem.
This is not a power of two.
So, we'll just like simplify and set where I'm not going to use 64 of them.
Just ignore the other way.
It's not a big deal.
And so, we have four experts per GPU.
For the sake of the diagram, I'll actually just say let's say we have two experts per GPU.
So, we end up just putting these at a GPU boundaries.
Every pair of experts is on its own GPU.
And then we can look at the communication cost.
We had some experts stored.
There's some tokens stored centrally here.
They get routed to all of these experts.
And so, there is some communication cost to paid here.
There's the same communication cost to paid on the output.
And then the hope is that this does not become communication committed.
Now, what is the traffic pattern here?
The traffic pattern here is that any GPU in fact will be talking to any other GPU depending on the decisions made by them all.
So, this is an all-to-all traffic pattern.
So, when you see any GPU in the pretense, the router is more than one GPU.
Yeah, the router.
So, I drew this as one router.
In reality, you would actually have many copies of the router.
And so, you would have as many routers as GPUs in fact, as the incoming traffic.
Yeah.
So, these are 64 GPUs.
These are 64 GPUs.
It's actually the same GPUs we just like draw them as a separate because they're serving different purposes.
So, at this point, any GPU can be sending to any actually GPU.
So, this all-to-all pattern of communication that shows up how the black wall racks are configured is a perfect fit for the communication pattern that the MOE actually wants to do.
However, if you think maybe I want to do like maybe one rack is too slow when I want to do two racks, then I have this challenge that like maybe I've got some sort of rack boundary drawn outside.
Yeah, like this.
And I know longer in fact have all-to-all communication between all the GPUs and two racks.
And so, the rack-to-rack communication ends up being a substantial bottleneck.
So, the fundamental thing here is that one rack is actually the bounds the size of an expert layer you can do.
And so, this has been part of what's been driving towards larger, larger interconnect domains.
Yeah.
But before we, maybe we're through explaining what exactly a rack is, the differences in bandwidth between a rack and within a rack.
Yeah.
And the all-to-all versus not all-to-all nature of communication with the inverses outside.
Yeah, and there has a place where it's just to be very different in fact in video, for example, in Google and then others including us.
The, so generally a rack is a it is a physical structure, it's a few meters tall, a meter or two wide depends on configuration and it stores some number of GPUs or XPUs, which is typically about 64.
The, what constraints are being a certain size is power delivery, weight, and cooling ability.
It ends up being about the size in, in many cases, because of these physical constraints.
So, then when I deploy a data center, I've got a data center may have thousands of these racks.
So, I've got one of these tall racks that's got a bunch of GPUs in it.
And so on, and then I put another rack.
Um, natural.
You make it sound so easy.
Yeah, right, just like that.
Drop them in.
In a video's case, the communication topology is actually, they put the GPUs on the outside of the rack, and then they push these switches on the inside of the rack.
So, what this ends up being is that there's a set of switches in here.
These are the NV switches.
And then they run a bunch of cables.
Every single GPU has cables going into the switches in the middle.
So, every GPU goes to the switches in the middle, and then the switches have connections to all the GPUs.
So, all of the GPUs can talk to all the other GPUs in just like two homes, going to switch, going to the other GPU.
Now, when I want to leave the rack, I end up going via a different path.
The GPUs have also a much slower connectivity, which is typically about eight times slower, which is, uh, so the green that I drew here in GPU cases is the NVLink, more generally it's called the scale up network.
This is the scale up network.
You will typically also have a scale out network, which allows you to connect to like some data center switch.
So, data center switch.
And then all of the GPUs will have some kind of activity, just some data center switch somewhere.
Well, this is, this is about how times, uh, like this is the scale out.
And it tends to be about about eight times slower, uh, in veterans.
So, the challenge, if you want to, for example, lay out a mixture of expert layer across two racks, is that half of the GPUs here are going to be wanting to talk to the GPUs of GPUs here.
And so, um, like, half of the, like, just on average, like, when I look at where the tokens on, on these GPUs want to go, how are the tokens that want to go inside the rack?
That's great.
They can use the, the fast scale up network.
But half the tokens are going to want to leave the rack and go to the other rack.
And that's not as good.
They're going to need to use a much slower network.
And so, that becomes the bottleneck on, uh, on the, on the all-to-all battery.
Um, the different choice would be, well, why don't I, like, have a big switch here and sort of, like, um, and connect, uh, everything to some big switching, uh, like, a much bigger switch that actually combines the two racks together.
There are many ideas in this direction, but in general, it becomes, uh, the reason you have this sort of hierarchy of switches rather than one big switch is to manage the cabling, congestion.
Uh, you just need to have a lot of cover of cabins.
Sorry, is this, is that question you just asked basically why is it in a bigger scale up?
Yeah, exactly.
Why not, why not, just, like, have, like, a million chips and scale up.
It's about what it's changed.
That is an allowed and really to go from hopper was eight, then a blackwell is 72 and now Rubin will be, I don't, yeah, it's a 500 or something.
Yeah, what, what is allowed that to happen?
Uh, so from hopper to blackwell is, is mostly just a, uh, the decision to switch from, uh, trays as the form factor, one of these is a tray just, just, just switching to racks as the form factor.
But that's a product decision.
Yeah.
Um, there wasn't a substantial technical barrier there.
Um, uh, switching from, uh, from the, like, uh, 64 to, to 500 or so, um, there's a bit of chance in math there, but, uh, uh, uh, there's at least a, a genuine four X increase, um, which is, coming from a much more complicated and difficult rack design.
And so that, that is actually, like, new, new physical design to run more cables.
And the cable complication is just the, uh, the, the, the class figuring out which cable hops to which, or like, wish it to signal.
Yeah.
That's what I mean, why do I, let's sort of zoom in on this and look at the, the wire density.
Um, I'll draw this diagram which runs more, so we have a bit of a clean version to work with.
I'm in a larger version.
Um, let's say I have some switches in the middle.
Yeah.
Um, let's say I'm going to have, initially I'm going to start with just two GPUs on each side or two trays of GPUs on each side.
Um, and let's maybe, maybe each train wants to have, uh, two cables coming out of it.
Um, so I get some kind of, I, I physically run about a few cables that look like this for another two switches.
Um, now if I want to double the number of GPUs in a rack, um, um, uh, I need to run like literally twice the density of cables.
So I, um, I need to run, yeah, uh, these as well.
Um, Excuse me, question.
Yeah.
But if you look at a physical data center, seems like there's a lot of space within a rack.
I don't know, just like the cables are like really big.
And yeah, so there is space outside the rack.
Inside the rack, like these racks are like, I mean, as they become more optimized, these racks are very tight.
So there's, uh, connected density going from, um, from, from, from, from the tray into the rack and the rack's back plane, um, and then the back plane itself has, has, has a really high density.
Um, there are other physical constraints, including like bend radius of cables, like you don't want to snap the little sound.
Yeah.
Oh, there's this little, the physical space to put a cable.
Yeah, that's concerning it there.
I don't idea.
Interesting.
Uh, that seems surprising, but like, however, the, the, the, the, the rack is so big and they're just like, we can't just stuff more cables over there.
Yeah.
So I mean, rack design is not my expertise, but like when I talk to, to folks in one of the constraints they're up against.
It's, it's a combination of, um, uh, so you, one of the big physical things you're optimizing for, um, space, uh, weight of the rack, like, it's actually really heavy.
And so like, you need enough metal to not sag and roll.
Uh, but then you add more metal and it's heavier, um, and then power and cooling.
And so all of those are competing for, uh, like, modern racks, pushing all of those to very extreme physical notes.
Deep work is by its nature quite a verse of, so even things which seem like work, like, slack and email can be easy ways to distract yourself.
So I often wish that I could just turn the internet off.
But if I'm prepping for an interview, even if I have the papers and books on hand, it's still super useful to be able to do a back and forth in the little ones.
I can break down concepts and research follow-ups.
Google's new Gemma4 is the first open model that allows me to have this kind of fully disconnected focus machine.
It's small enough to run on my laptop, but good enough to actually be useful.
So to prep for this episode, I downloaded Reiner's scaling book and showed off the internet.
I was able to have Gemma help me understand the material and answer my questions.
If you want an LLM, they can run locally on your laptop, or even your phone, you should check out Gemma4.
When was GPT4 released again?
It was 22 or 223.
It was rumored to be over one trillion parameters.
It seems like only now, within the last six months, models being getting released that are significantly more parameters than the model released three years ago.
When supposedly there should have been this scaling in the meantime, is the reason that we were just waiting for racks with enough memory to hold the five trillion parameter model along with its KASH for enough users for a full, for a lot of sequences, or if you're doing RL, it's kind of a similar consideration of actually holding the KASH for all the the batch problems you're trying to solve.
So if you look at like Hopper, you had eight Hoppers, and I think that's 640 gigabytes, as of 2022, with black, well finally, which was deployed by 2020, very, very early.
I mean, last year, you finally have a scale up with the order of like 1020 terabytes, which is enough for like a 5T model plus KV KASH.
Yeah, deploying in larger scale of domains is a huge unlock.
I mean, I've drawn here the sort of Nvidia black-world deployment, the Google deployment has actually had a very large scale of domains for me.
And that also explains why Gemma and I was seemed to be ahead, like was Gemma eight to put five, was it successful?
Or it just seems like Gemma and I said successful or pre-trained for longer than some of the other labs.
Not having been there at the time, I'm not sure how much is coming from like successfully deploying highest-passed duration, which could be.
It could also be, I mean, there's a whole bunch of actual modeling things of like, uh, specifically how do you do the mixture of experts, uh, have seen the, um, deep-sick, uh, like the deep-sick mixture of experts has said actually, activate more experts, but find a great experts, was a big innovation.
Yeah.
I'm sure there are many other innovations on the model architecture, as well as on the training data.
It's kind of hard to disentangle all of them, but uh, what shows up in terms of the limits of what you can do, um, the active parameters, uh, as we saw is limited by the compute cost, um, and in the total parameters is limited by the scale up size.
Yep.
When you're operating with a single scale of domain, is that a consideration specifically for either forward or backward, or specifically for pre-fill versus decode, or is it, is it preferred to always be within a scale up?
Yeah.
Whatever kind of workload you have, whether you're doing a pre-trading run or whether doing our all, uh, generation or whether you're doing inference for users.
Yeah, really interesting.
So, okay.
So, uh, to answer that question, we're going to need to talk about the communication patterns, um, that's, so we've talked about the mixture of expert communication pattern.
That is this all-to-all, um, uh, as all-to-all, um, all-to-all, um, all-to-all, um, all-to-all, very strongly favors, um, full connectivity, which is what we've kind of just shown here, and favors being within one rack.
There are other kinds of parallelism, uh, besides expert parallelism, which, which we just showed here, in the literature is tensor parallelism.
This is, um, with the trend towards smaller experts, this has become much less relevant, so we can ignore that.
Um, but the other two things that we have available are data parallelism and pipeline parallelism, um, and they are actually much, they can be a much better fit for, uh, using multiple racks.
So, let's focus on pipeline parallelism specifically.
Um, this is one layer of MOE.
Um, I'm going to have like a hundred more layers up above.
I could decide at this point, for example, to move to a different rack, change rack.
Now, is that going to be a chemical communication bottleneck?
So, we can actually just solve for when this becomes a communication bottleneck, but before we do that algebraically, like let's just sort of visualize it out and then sketch the path.
So, we're going to have a bunch, this is another MOE layer.
I'm going to have a null effect MOE layer of the herbal someone.
Um, uh, so let's say I change rack here, and then some number of layers later, I change rack here as well.
Um, so our, our methodology that we're going to use to determine whether we have a communication bottleneck in this, like, in this point where we change rack is we're going to compare the, this, this is the scale out scale out, um, bandwidth requirements to the scale up bandwidth requirements.
So, let's try this, I mean, the hint is going to be that there's a lot more sense here, like we're sending many things here, whereas we're only sending one thing here, and then we're also maybe doing it many times.
That's, so that's going to be the uh, what, what makes the difference?
Uh, can I try to guess?
Yeah, I'm just starting to ask it to see if I'm actually understanding.
Um, it seems like you're sending like batch size into the rack in here.
Yes.
Uh, but the communication within a rack is sort of batch size times number of GPUs.
Yeah.
So, a number of activated GPUs, right?
So, like, I, I don't send to this GPU at all, right?
So, there's an explosion from one to, like, three times larger here in this time.
Yeah.
Um, the key thing is that I, I didn't even need to send to this GPU at all, and so that's a big saving.
I see it.
Yeah.
Okay.
So, we're going to talk through, um, uh, sort of how much more, uh, what, what is the slowdown of, uh, to what extent is scale up, uh, a bottleneck over a scale, uh, over scale out.
So, uh, we will directly jump to the ratio of the time spent on, uh, scale up, time on scale up, over the time spent on scale out.
There's the quantity we're talking about.
Um, and the first consideration is that the scale up is, like, um, uh, scale up is, there's a times faster than scale out, generally.
And so, uh, at a baseline, if the bandwidth's were the same, we would have this one, one over eight, which is coming from bandwidth, bandwidth.
But then we have some amount of expansion in, in, in, in how much data we're sending.
So, if one token comes in here, then this one token gets routed to, in the deep sea case, it'll get routed to maybe 32 experts, or 16 experts, gets routed to some number of experts.
So, this is the number of activated experts, number of activated experts, um, and then it also, this, the same thing applies on multiple different layers, so maybe I'm going to run two layers.
So, um, there's also multiple times, um, number of layers, uh, percentage.
And there's usually, to multiple, the whole thing way too, for the, um, for the optimal, yes, yes, and there's a fact of it.
Thank you.
Um, so, what we would like is the, for the scale up time to be greater than the scale out time, um, because like the scale up time is the more important and precious resource.
And so, we just, we want this one, we would like this number to be greater than a equal to one.
Um, and this really doesn't seem hard, like we've, there's just a fact of age that we need to overcome.
So, we need the product of these three things to be bigger than eight, typically we have a fairly large number of activated experts.
It could be eight, um, by itself, and then we can increase the number of layers per stage a lot until until we have decided this by this.
So, um, so what this ends up looking like is that I can, in fact, have an entire pipeline of racks where one rack does one layer, and then I move on to the next track and I do an other layer, and then I move on to the next slide.
I can do another layer.
It's interesting to me that the best parallelism, uh, strategy and practice ends up being one, which physically resembles the actual architecture is not some galaxy brain thing.
You know, it's like, oh, we have experts.
We're going to put them on different GPUs.
So, we have different layers.
We're going to put them on different racks.
Is that, I feel that it's interesting that the physical and the, the, the model architecture matches, like, the, the, the counting matches about lockdowns.
Yeah, exactly.
Yeah, I mean, it could have been something wackier with the tensor parallelism and whatever.
Yeah.
So, I mean, I think a way to think of it is, I mean, okay, the galaxy brain way to think of it is, like, what are all the different dimensions in which a model is scaled up?
And so there is, it is scaled up by layers.
It is scaled up by the, like, the model dimension.
It is scaled up by the DFF dimension.
It is scaled up by the number of experts.
Every single one of those numbers you can choose to cut along.
Um, and if those numbers are big enough, it eventually becomes profitable to go along there.
Yeah.
Um, and we have selected two of them.
The other two in the way, typical models are typically sized are not profitable.
Yeah.
So, there's, um, talk about earlier or he says today, we know not to do pipeline parallelism.
And, or see give my friends in me, I hate that it's something about their chips.
Coat.
But he goes a lecture on these different types of parallelisms.
And you said, the problem pipeline parallelism is that it, other than the bubbles that constrain, it creates these architectural constraints.
Yes.
On, um, like, Kimmy, for example, has these residuals where attention attends to the, a few of the back.
Yeah, layers a few back, and so that it becomes hard to implement in this way.
Yeah.
Um, so, and I guess we didn't really fully articulate even what is the benefit that we're getting from pipelineing.
Yeah.
Um, uh, and so these complexity is a real.
It's, the pipelineing is a massive hassle.
It's, uh, but it does give you some benefits.
Um, the, uh, and then you can then decide whether there's benefits or the worth of costs.
Um, the, uh, the biggest benefit that shows up.
So it can, it has some benefits and it's maybe big benefits in training.
Um, in inference, what are we saving?
Are we saving on, um, memory time or compute time, not really.
We're just moving the memory time from one chip to another chip, um, or one rack to a different rack.
There's no actual benefit in runtime.
Um, however, what we are saving on is that the memory capacity is, uh, the amount of memory used per rack.
If we think that the memory in a rack is a bottleneck, then there's a constraint on how fast we can go.
Um, it pipelining allows us to massively reduce that bottleneck.
Uh, I guess, but the opposite connotation to this, which actually, uh, before this, I was trying, before this interview, I was trying with them.
Axel was a GPU performance engineer, uh, Jane's reading, uh, he was explaining, well, to do pipelining, you had to do micro batches rather than full batches.
Uh-huh.
And if you do micro batches, then you're, by definition, not able to amortize the weight, loading the weights.
That's right.
Across all the users, or all the sequences.
And so the positive connotation of that is you don't have to use this memory.
The negative connotation is that of that is that we can't amortize loading the weights across all those users.
Maybe it's worth explaining why you had to do micro batches because you can't.
So we draw the mic, the pipel bubble.
Um, yeah.
Okay.
So, so why do we do, um, uh, what is this micro batching that shows up in a show is up in pipeline parallelism?
So, um, the, uh, I'll focus on inference first.
It's just a slightly simpler problem.
Um, so, and I'm going to draw, uh, so this is time, and then this is which rack, uh, rack, um, whereon.
And so the idea is that, maybe I'll have, like, four racks.
So I've got, um, uh, an inference that is going to, like, step through these four racks in some time, like this.
So, great.
So this is inference number zero.
Um, uh, it runs at a certain batch size, uh, and it steps through all all the pipeline sessions like this.
Now, if we were to say, well, we're going to run inference number one here.
Like, this is clearly, like, a massive waste, right?
Like, um, like three quarters of the time, each of the racks is doing not so.
So, um, so, so we don't actually run inference one here.
We, we, we rather as soon as we can, which is immediately after, um, inference zero finishes like this.
Um, so, uh, and we kicker.
Um, so if we hadn't filled this in, we would call this the pipeline bubble.
Um, when I've drawn it in this inference context, where we're only going in a forwards pass, it's like obvious, like, why would you do the stupid thing?
Um, but in a training context, uh, it's maybe less obvious, but in the inference context, it's, it's sort of really natural to, to make this change.
Oh, you're just saying.
So, uh, this, this sort of, sort of, I've just been, um, the difference between micro batch and batch doesn't matter at all in inference, because you can just call whatever you want, whatever.
Yeah.
It only matters in training because there is an optimal batch size.
Yes.
And before you do the backwards step, you want to have accumulated.
Before you do a full backwards step, you want to have accumulated all the sequences in that batch, and if you wanted to pipeline and training, in order to avoid that bubble, you need to, should we draw the training diagram?
Yeah, yeah, that's it.
Um, so, so this is the inference diagram, and I'll call this four, just so we don't have the wrong things showing out there.
So, let's do the same thing for training now.
We've got a forwards pass, but at some stage, you're going to have to transition to a backwards pass.
So, we'll, we'll do some number of batches in the forwards pass, and then we're going to transition to the backwards pass for everyone all long ago.
So, the inference part is the same here, but then we do a hard stop at this point, and then transition everyone to backwards pass.
Similar number like this.
It may be worth to verify, and the reason there is that hard stop is because you want to do a whole batch at once for the backwards step, and then there is an optimal size for how big that batch should be.
Yeah, I mean, smaller is always better, actually.
This is a way to put it, but it's like, from a ML convergence rate perspective, smaller is always better, because basically you're getting the freshest information from the gradient descent, but total trading time perspective, it's worth like smaller is worse from a system perspective, and so the optimum is the trade-off between us.
So, you pick a batch size and you, and then, like, four of that batch size, you do some amount forwards, and then some amount backwards.
You ask why, why is there even a hard stop?
The pipeline parallelism, because of this, like, the fact that you've got this idle time here, which is the bubble, there are so many techniques in the literature for how to lay this out differently, and avoid that.
There are more complicated schemes called, like, zero bubble or one forward, one backward, which sort of interleave the forwards and the backwards in complicated ways, but you can mine be quite a mess.
Yeah, right, right.
More usefully, you can do the weight gradient step, but you could also mine be cut.
Yeah.
So, in inference, actually, the effect of pipelining on anything you care about, like batch size or latency, actually, is neutral.
It doesn't improve, it doesn't make a worse.
So, if you look at the latency or this inference, running if you were pipelined versus if you were all on one rack, if you were all on one rack, we would just slide all of the boxes down and still put them in a row, and the latency would be the same.
Pipelining is neither better nor worse for latency, but it does mean that you're just used less memory per rack, like memory capacity, because now instead of needing the whole model, you don't need a quarter of the model, you can understand the sense.
So, basically, no brainer to use pipelining during inference, but there's this hardware trade-off during training.
So, even in inference, in fact, it's not used at all.
It reduces your memory capacity, requirements.
There's actually a huge surplus, like I think you're saying, that a rack of blackwell has many, many terabytes, maybe tens of terabytes of, that's much bigger than a trillion-primant model, a trillion-primant model is on the knees on terabyte.
And so, it already fits in fact.
And so, there's not a huge benefit from pipelining, because you're reducing a number, that's already pretty small.
But it does say that theoretically maybe you had too much and maybe you could have done a different, like, build a different hardware that has less memory in fact.
If you were designing a hardware, like, and you said, I actually didn't need that much memory, because I don't need the white certificate in one rack.
I can fit the whites and eight racks.
Then, I could have maybe built a hardware that didn't have so much HP and put GPU.
Last week, poor C was kind enough to give me and my friends a great lecture on large-scale pre-training systems.
And there were some concepts that I wanted to animate for a write-up on my blog, like, how weight, shard, and gradients flow, depending on the parallelism that you're using.
So, I gave cursor my lecture notes and a sketch that I made during the lecture, and I asked it to visualize a specific hierarchical collective that horror said explained.
The first version was already pretty good, and then I was able to use design mode to select and tweak any specific components from there.
I was able to do all of this without a clear and stated mind.
Cursor is composer to fast model, what's quick enough, that I was able to iterate almost instantaneously.
I could try an idea, test the results in the built-in browser, and immediately make any changes.
I went through 10 different versions in under 20 minutes.
If you want to check out this animation, I published it along with the lecture notes in a blog post, the link is in the description.
And if you want to try this kind of iterative design flow for yourself, go to cursor.com slash lore cache to get started.
So, macro question, everybody's talking about the memory wall right now, the rescanning super expensive.
There's not enough memory, the smartphone volume will go down 30 percent because there's not enough memory.
Hyperscalors are spending, this is shocking.
If I'm Dylan said there's spending 50 percent of their cap X this year, or memory, on memory.
That's blue.
So, what is Hyperscalor's cap X?
That's like a high 100 jubillion.
It's maybe a trillion.
Are there spending half of that on memory?
That is a huge constraint.
That's why we're not going to get new laptops and phones this year.
But at the same time, we have too much memory.
People are willing to put too much memory into this system.
Right.
This is what white ways of jets that's showing all this memory into these racks if you don't need it.
In the equations we had here before we raised them, we were doing memory time, so memory bandwidth and compute bandwidth.
Let's now start looking at memory capacity.
So, we'll start off with memory capacity without even thinking about parallelism scheme.
So, the capacity of memory or the demand on memory is the number of total parameters.
Plus, so this is what we need to fit the weights in some system that we are using.
And then we need to fit the KVs as well.
So, KVs go as batch size, times the length of the context, times our terms the bytes, bytes, bytes, bytes, okay.
So, what I was arguing about in this context and the case I was making for pipelining is that we will actually, there are some techniques that allow us to solve this, are the techniques that allow us to solve this.
So, let's consider, so we're going to run this on some number of GPUs and we're going to say we're going to have one extent which is E is going to be the X-met parallelism.
So, how many when we had this charting of X-metre layer across many GPUs, how much of that to what extent do we do that?
How many GPUs?
So, we're going to say that this is, for example, 64, and then P is going to be the extent of pipelining, pipelining.
So, this is the number of racks which, who knows maybe, maybe we'll pick four or something like that.
Well, we want to calculate, so this is the, this is like the total total memory requirement across the system, but now I'm going to calculate a memory requirement per GPU.
So, per GPU memory requirement, we're going to have, I guess I'll use a lot of the case, c, mm, and well, obviously we should take all of this numbers and divide it by E and p, really.
So, it's this end total, plus the batch times length of context and bytes a token.
All of this is divided by E and p.
Okay, so this is like, why is this correct?
Just divided this way?
Well, we're saying, we knew that the parameters were perfectly divided amongst all the GPUs in Iraq.
There was the layers of perfectly divided amongst the, the different racks.
So, that works here, and somehow we're going to arrange all handwave exactly how, somehow we can arrange the same perfect sharding of the context across GPUs in Iraq and, and only based on layer across racks.
And it's a four is a number of racks.
Yeah, for example.
So, this is the place where we actually need to go back and analyze this batch size B.
And you're making this comment that there's micro batching versus global batching.
So, let's come back to this pipeline and diagram here.
We've got one batch going forward here.
And then, as I drew it, it kind of just like disappeared.
That's not really correct.
If you think about how decode is working, I have a bunch of tokens that I have generated already.
I do one forwards past where I generate a new token.
And then, and then I push, like, then I write that to my KB cache.
And then I do another forwards past that generates the next token.
So, I'm actually going to be running this batch zero in a loop.
So, in fact, I go forwards.
Once I finish, I can start the next iteration of the loop up here.
Yeah.
So, I'll just fill this in.
We'll drift forward.
We'll have the...
Oh, uh, nice.
Yeah.
So, we've got the two or three.
Oh, two or three.
Uh, two or three.
Uh, so, let's split this batch.
This batch will be the global batch size.
So, B is going to be the number of number of micro batches times the batch of, like the batch size per micro batch.
So, how many micro batches do we need?
So, the number of micro batches in this diagram is four zero one, two, three.
Um, and then the batch size per, um, like the micro batch size, this is still this, like, 2,000-dish number.
Um, this is the one that is like, um, this is the, like, 2,000 times speed.
Uh, sorry.
Oh, no, this is the 300 times speed.
Uh, 300 times speed.
This is the, how big the train that takes up very, very little seconds is.
Right.
Yes.
This is going to be the, uh, the 20 milliseconds train.
Um, so the global batch size is the number of micro batches times the local batch size, local batch size is set by this hardware parameter.
The number of micro batches, um, well, the number of micro batches is a smallest possible such that we can, like, wrap around, uh, and not leave any idle time when we wrap around.
So, if we, like, if we had fewer, they would, we would have, have this idle time when we wrap around.
And so, you can sort of just visually see that it is equal to the number of pipeline stages.
I mean, I sort of proof my visual here, like, it is four and it's four this way as well, but like, you can sort of look and see that it goes along here and then it wraps around, um, number of pipeline stages.
Yeah.
It's, it's very, very basic question.
This is what is actually done.
Uh-huh.
Okay.
Like, as in a frontier model today, we'll actually have a, during in front of pipeline.
Uh, for sure during massive scale training, this is done.
Um, it can be done for inference.
I'm actually going to make the case for why it is less attractive.
It is useful for weights, but not so useful for, okay.
Yeah.
Yeah.
Um, the big challenge is selects, let's fill this in.
The micro batch size here ends up being equal to the number of pipeline stages.
Yep.
When we go back and substitute this, this, all of that into here, we get a, um, number of pipeline stages times, um, this little B, showing up in here.
And then when we factor this out, I'm going to split this into, like, this plus into two, two terms, um, we get the full division by E times P over here.
We still have division by E times P over here, but the P's cancel, this P and this P.
Um, they cancel.
And so what we find, if you increase the number of, uh, pipeline stages, the memory footprint for the number of weights keeps going down and down and down, but the memory footprint for the number of activations stays constant.
So, so it, it doesn't actually work.
Like, if most of your memory, um, ends up, like, once you do enough pipelining and it's really not much, like, even two is often enough, um, this term becomes very small.
This becomes the dominant term, the KB cache becomes the dominant term.
Yeah.
I know this is wrong.
I'm just trying to figure out, well, why we're trying to logic here is wrong.
If you have many different, um, you're pipelining through many different stages.
The KB values are not shared between layers.
So, why would it not help to be pipelining across multiple layers?
Because then you don't have to store, yeah, you only need to store like one layer rather than two layers of KBs, right?
Yeah.
So, it helps from that perspective.
You're right.
Um, what's competing with that, though, is that you need to be keeping all of the racks usefully busy at a time.
And so the number of sequences that are in flight simultaneously has gone, uh, yeah, yeah, it makes sense to make sense.
So, those exactly cancel.
And yeah, and you end up not getting a second by GP.
Right.
Hey, this is going back fundamentally the point of, you're, you're not able to amortize across KBcashers.
Yeah.
Well, so first we did, you can amortize KBcashers across batch size.
Yeah.
And now we're saying you also can't, um, charge across pipeline stages.
Um, uh, it, it sucks from both of those points of view.
Yeah, I'm just saying.
Okay, so then what is that during inference?
Um, so, I mean, a, like, the deep sick paper reports what they do, which is like, um, they just do a lot of extra parallelism.
You should, um, in effect, you should increase your expert parallelism up to your scale up to main size.
Um, and then do very little pipelining, maybe not at all, maybe to, um, just enough to make the white storage, not too big of a issue.
Um, those are the only two parallelisms that really make sense in the past.
Um, there was tensor parallelism, which was making, cutting up within an expert, but, uh, the experts are so small now that, that, that is not a profitable transaction.
So this goes back to the question.
Does that mean that frontier labs, when they're doing inference are just basically within a single scale?
Uh, yes.
Yeah.
I mean, you can look at how it depends on model size.
Um, like, you could have a very large model, like, um, like one that exceeds the memory of a Mac.
Um, and, and they should be doing a bit of pipelining.
Um, maybe, maybe it's extremely sparse, for example, and that would be a reason to do it.
Um, so I guess this goes back to the question about, uh, or this goes back to the promise to the beginning of the lecture, which was, this will actually tell you about AI progress as well.
Um, to the extent it is the case that model size scaling has been slow until recently, because let me make sure I understand the claim.
The claim would not be, you could have trained across more, more racks.
It was just that it would not have made sense before, like, we didn't have the ability to do inference for a bigger model.
Easily.
Actually, I made the clip.
So pipelining doesn't help with context length.
It totally helps with model size.
And so, um, because of the ability to pipelining, uh, at least a rack should not be a constraint on your ability to fit the model parameters.
I guess the other consideration you're asking, like, why hasn't scaled up more and why did bigger scale up to a main cell now?
Um, so we, we talked through one aspect of that, which is, we kind of said, it, it's not because of memory capacity.
We, we have a solution to the memory capacity, at least with respect to model size.
Yeah, not with respect to, um, uh, cavey catch size, but at least with respect to model size, we have a solution to memory capacity.
Um, the other issue that shows up is, uh, latency.
I was just about to ask.
So what is the going from rack to rack?
What is the latency cost for, for hop?
This is very much dependent on the hardware.
It's, uh, I would, uh, I can't say with a lot of authority.
I think it's probably on the order of a few milliseconds, but it could be off by it or anything.
It's for realistic number of how many pipelining stages you might have.
Yeah.
Yeah.
Okay.
So that's, that's like, it's not, uh, on a small number of pipelining stages, this is not a huge, uh, they can see it back.
Well, I guess it's 10 milliseconds per token.
That's right.
Two times forish.
We're, I don't know how many said, but yeah, 10 milliseconds per token is actually a lot.
Yeah.
If it, if it goes from 20 to 30, right, or something like that.
Yeah.
Yeah.
Um, if this is, so like just to chart the path that it goes through, here you're going from your, from your GPU or GPU or whatever, um, to a network card, um, which then goes to like a top of rack switch, um, and then hops over to the other same thing in reverse.
So you sort of have to sum up the latencies of these different things.
Uh, so this is the same thing as the DC.
Yeah.
So the day it may, in fact, go up to a day time for switching back depends on to climate integration.
Right.
Now.
And because it's, um, decode and sequential, it's also not like the stack up across the state.
Yeah.
So you can't do them at the same time.
That's right.
Yeah.
Okay.
So I guess this brings us back to the question, then, is the size of the scale up at all relevant to why AI model sizes or whatever have been, what they have in over the last few years, whether whether whether through trading or through inference.
Yeah.
So I mean, we talked about latency of the hop of this hop.
Um, there is also just the, the same team M latency, the, to memory time latency is actually substantially, like, massively improved by a larger scale up domains.
So, um, I'll recall team M down here.
Um, team M for the weights, uh, to M of weights.
Um, this was equal to the number of total parameters, divided by the memory bandwidth, which memory bandwidth are we talking about here?
Is that just one GPU or it's, it's, it's, it's in fact, it, it is the number of GPUs that I can use in parallel to to load these weights.
So, um, I can't use different pipeline stages in parallel because they they're not running at the same time.
But I can use all the GPUs in my scale up domain in parallel to load the weights.
Yeah.
And so, um, this is actually extremely effective.
Um, so, uh, basically, I end up with a term here.
This, this memory bandwidth term itself is equal to, um, like scale up size.
There's a memory range of the GPU.
Yeah.
Yeah, times GPU bandwidth.
Um, and so, this term doesn't increase a lot.
It may be increases 1,25 or 2x per generation.
But this one increased by like a factor of eight, um, from these from hop up.
So, the reason the bigger scale of matter is not the memory capacity of the whole scale scale up, but really they're never bandwidth.
Yeah.
Yeah.
Pipeline in totally solves the capacity problem, but, um, but, uh, uh, scale up size helps solve the bandwidth.
Um, the bandwidth problem helps you do longer context links, which is more and more relevant as if model is a more authentic.
Yeah.
It lets you just run the model at low latency, um, uh, as a first thing like if I just do a very fast model and it's on like a little like H100, uh, box.
Yeah.
Um, uh, uh, the latency will be really high.
Yeah.
Okay.
It's super tangential question.
There's Shinshala scaling, which tells you how, how big short of model be relative to the amount of data you're going to train it on.
Um, but now obviously you're not just trying to optimize for the highest quality model you can get with trading compute.
You want the best results a user can get.
Yes.
The mixture of training and inference compute.
So, then there's a question of how much should you overtrain a model such that that compute, amortized over training and inferences minimize to get a certain performance, but now with RL inference, there's, or RL, there's another consideration, which is you're going to do some amount of pre-training.
Yeah.
That pre-training will be used both for RL generation and then for inference for the final user.
And by over training here, I mean, while it would have been more efficient just from a training computer perspective to have a bigger model that you train for less time because it can learn faster.
Maybe you get a smaller model, you spend more computing it than you otherwise would have, but now it's cheaper to give it to users.
They basically, okay, maybe let me go to the question more concrete.
How much more than chinchilla optimal are models or a trained?
Ooh.
Yeah.
And has that changes the result of RO generation?
This is a place where we have to do a bit of guesswork because like the updated scaling laws and the and the model traffics are not reported.
And so we have to guess there.
But one way to look at it, let me first just make a sort of a general heuristic claim.
If I am, if I have some like cost and I've got a total cost, which is a sum of like cost A and cost B, I can maybe this is the training cost and this is the inference cost.
And so I want to minimize this sum for many, for many curves that tend up being the case, the minimum tends to be where these are where the cost is equalized.
That's something overhearistic claim, but you can it tends, like there are many examples where it's true like where one is one of the racks and the other one is X, for example, they tend to be minimized at the point where they equal each other.
It's also true for like each of the X and like each of the minus X and all the kinds of other things.
So basically I've got some, I've got some curve that's going down, some other curve that's going up and they tend to be minimized at this equal point.
Heuristically, I will conjecture that that is true for the setup you described as well.
Like actually showing that I would be true would require looking at the scaling laws and like fitting these weird exponents, but things that do follow power laws tend to have this property.
So I'll just make that claim to move on.
So we're going to say that the cost of training plus the cost of inference, we want to equalize these.
We'll do pre-training only first because it's a little, well actually we can do all of it in general, so actually we'll cost of these.
Cost of pre-training.
So number of active frames times the data on pre-training.
So that's the cost of pre-training.
There's our factor of six out here, which is in a row of what's, this is the famous six ND formula.
And then in RL we have approximately the same thing.
We've got like the same number of active parameters, but now it's the amount of data is the RL data.
There's this extra like efficiency multiplier, which is or inefficiency like the, they can efficiency.
Which is the fact that you're not training on all your rollouts?
Well, yeah, there's that.
And then the other perhaps even bigger inefficiency is that this involves a substantial amount of decode and often decode runs, so at least I'm a few than trying it.
Okay, so if you're doing a backward pass on every single generation in RL, it would be six ND.
Yeah, so this could be a small number, right?
Like this could be somewhere.
So it leads me to, yeah, somewhere in the range of two to six.
So it's just like, well, say somewhere in the range of two to six and laying on that.
Yeah.
And then we can add in the inference cost.
The inference cost is two, number of active times the data and inference.
It's all right, I think the way I said it was super garbled, because I feel four, or just for the audience, maybe.
Forward, plus backwards, per parameter is six, forward alone is two.
That's why RL where you're definitely going to generate all the trajectories, but you might or might not train all the trajectories as two to six.
Yes, yeah, thank you.
And then inference is, is just, yeah.
So we're going to solve for essentially, it may be a quality of all three of these terms, but that is ballpark where people are going to be.
Like labs have more information on what is productive in doing more RL, for example, than of us are doing more retraining.
I don't have that information, but I think a good ballpark is 30, 30, like 33% split between each of them.
Actually, I'm not sure there's an intuition for that.
Um, another naive model could have been that RL plus retraining would be 50%, any inference would be 50%.
Yeah, that's also a valid answer as well.
They, because this is a serristic, I can't really argue for one of us the other.
They're different by that much, like 33 versus 25 is on the smear of my trough.
Um, so let's pick one of them, uh, all equal seems simple enough.
Um, and so we're just going to solve for a quality of them.
It's pretty straightforward.
We can immediately see that the number of activated parameters totally disappears.
And so let's factor that out.
And we're going to just say that, uh, data in pre-training.
I decided to do it your way.
It's this a little bit nicer, actually.
So data in pre-training plus, this, uh, I didn't have the inefficiency over here, either.
Um, inefficiency, um, data in pre-training plus, um, some multiple of like alpha times the data in RL is just going to be and end up equal to the, um, some beta times the, uh, data in inference.
So, uh, and then let's just like roughly size the alpha.
This, this, this alpha, it's going to be, um, this is like the, uh, it's maybe somewhere in the range of two to six, uh, two to six over six, um, from this term compared to this term, um, and then we've got an inefficiency term, which, uh, I would say, it's maybe in the range of like 30% of some makeup.
Um, so, uh, so, so this alpha is going to be something like, um, one on 10, one over 10, let's say.
Um, then this beta here is, is actually the same.
It's, it's a third.
It's one third, just times that we thought it was, so, it's also, um, equals one in 10, something like that.
If you both remember one in 10, the kind of implies that there's never a backward person, RL?
Yeah, okay.
We can make this like two in 10.
Uh, I don't know bigger.
Yeah.
So, yeah, like just write it out once more, like this is two in, two over 10, this is one over 10.
Um, so the number of inference tokens you have, and this is just a function of like, I've got hundreds of millions of tokens per second, um, times my model is deployed for, I don't know, two months before I ship to the next version.
Um, that sort of determined the, um, the number of, uh, tokens in RL and pre-training, and then I guess we didn't do the equivalence between pre-training and RL, so we'll do that here.
Data pre-training should be equal to like to over 10 times that late in RL, we're going to be cost equivalent.
Um, so, sorry, this one over, I got a backwards, uh, like we pay more cost when it's inefficient, so it's, this needs to be one over, um, uh, um, um, so this tracing this backward, uh, backflides, um, this, this thing ends up actually being, as written here, it's like, uh, yeah, so this is like 1.5 and this is 1.
Um, um, um, uh, billions of dollars, the compute just for the other direction.
Yeah, right, right.
I think like if you do this spreadsheet and like actually know about it, you might notice when the money's going down the drain.
Yeah, um, so, uh, yeah, so I think this, yeah, all of these end up being close in, as modeled here, they're starting to send me a opinion a little bit too generous.
Um, so let's say something like 1.5 here and we'll update this as a one here.
So I think it like, at this point, you can always read it off, like the number of inference tokens should be about the same as the number of pre-training tokens should be about the same as the number of our altocons, um, within, like, in fact, it's the way we're not able to be talking about.
But then, so it's, it's, it looks, uh, sorry, I'm making a basic order of a second.
So it seems like there should be less our altocons than pre-training tokens.
Yeah, that's, that's in general, right?
Because our all is less efficient, um, in terms of machine time, and so, uh, you, um, if you try to equalize the arrow and, and pre-training time, then you should have fewer tokens.
Uh, I know what I have the same rule time.
This is all, uh, quite interesting that, um, I never thought about it in terms of how much equalizing in terms of data.
Uh, I, I mean, I think starting with the equalizing cost is right, but uh, depending on how you model the cost, this, yeah, because close to equalizing the container.
That if every single user who uses, basically, if you, for GPT to be trained optimally, every single user who uses GPT 5, the total amount of tokens of this stream should equal the amount, total amount that I've got into pre-training.
Yeah.
And the total amount of tokens that I've got into pre-training is the sum of all human knowledge, like each model should generate the sum of human knowledge on the output that it gets on the input.
Yeah.
So, I mean, which way are people going to err?
Like, uh, if you think that people's power of predictions are perfect, and, and also, um, you run the risk that you're, um, that you make a model that is not a frontier model, and then you just throw it away.
Yeah.
Um, then, then, like, that kind of changes that cost to trade off because there's some, like, probability that applies to the inference.
Yeah.
And you should derake the inference tokens by some amount.
Right.
And then, uh, can we back out how much more compute, uh, yeah, compute the potential optimal for a given sized test?
Yeah.
Or so, I think we just have to make some real-world assumptions here in order to do that.
So, um, so the inference tokens, we should totally be able to catch, right?
Like, so, um, let's say a few hundred million, I don't know, maybe it's like, uh, five hundred million tokens a second.
Now, I don't really know.
Um, a five hundred million tokens a second times a model is deployed for two months before it becomes obsolete.
I don't really know.
Um, uh, I can't do this in my head.
I can't do this.
I can't do this.
Um, uh, 2.6 times 10 to the 15.
Okay, 3.6 times 10 to the 15.
Okay.
Um, um, this number is probably too large.
This, um, because this is going to be multiple models in a family.
We, so, let's, let's make it, like, five times more, or 10 times more or something like that.
Um, uh, okay.
So, we're estimating maybe 50 million tokens per second for a specific model.
The model is live for two months.
Um, and so, uh, this comes out to around 200, uh, trillion tokens.
Um, and then we want to compare that to active parameters on a, um, frontier model.
I don't actually know the latest rumors, but, um, some, um, do you know somebody taught me 150 trillion active frames?
So, yeah, so there's a certain amount of tokens.
Trying to know 150 trillion tokens.
Interesting.
Which is similar.
Yeah.
Yeah.
That's actually similar.
So, um, so I don't know, I'm pre-training that, this is not, but well-sighted, but you'll be so not really satisfied.
It's not like it.
Um, and I think often active frames, number of active frames could be in the range of, like, uh, 100 billion.
So, yeah, maybe, maybe a larger.
Um, uh, so I'm assuming active frames of about 100 billion, and so multiplied by 20 together, the token count.
So, Chinchilla, the Chinchilla would be around, uh, two trillion.
And yeah, and we see, like, we're at 100 times larger than that.
Uh, what is the D-chinchilla actually mean?
Like, the token counts for pre-training for, um, the Chinchilla scaling law would recommend.
Yes.
Um, oh, I see, so how much is it over trade?
Got it.
So, yeah, like the ratio of this 200 trillion or 100 trillion parameters over, uh, uh, over the, like, the, like, the Chinchilla optimal of two trillion.
That's the amount of overtrend, which is, like, about 500, 100, 100.
That's whatever.
Okay.
So, if you consider this right here, to the extent, this isn't the right ballpark, just by thinking about, okay, you kind of won everything to be equal in terms of compute.
Um, here's, if, if that opening I also realized is that and they're serving our certain amount of tokens per second, that tells you how much data went into the free training of GPD-5.
Even if it's, like, 50% off or something, that is, the sort of a while that you're going to sort of force principles.
Yeah, these kinds of numbers.
This is also, I mean, this is why you should just, like, approximate everywhere because, like, the sort of big error about some of this.
But yeah, no, it's kind of, like, empowering to just, like, set A equal to B integral now.
Yeah.
Yeah.
Yeah.
That's a real cool.
Okay.
So, um, in this word of trying to do things, we can publicly look up the prices of the APIs of these models.
And, um, maybe you can learn something from that.
So, uh, first with a longer context, um, Gemini 3.1 is, um, 50% more expensive.
If you go over 200K tokens, then if you're a bullet 200K tokens.
I mean, at a high level, I understand why that might that be, but why specifically 50%.
Yeah.
Um, so, I mean, why specifically 50%.
Let's sort of, um, so the high level, even in the first place is, um, there is some amount of, uh, increasing cost with context links.
Yeah.
Um, and, and, uh, we can bring that back up.
That was the, um, the memory time.
This is the compute time.
So, um, okay.
So, we, we've put up these same locations from before of the, the time for memory fetches, which is the whites and, and the cable cache, um, and then the, the time for the compute, which is just the, uh, matrix modifications for the whites.
I will also draw the, um, the cost curve.
Um, but this time I'll do it as a function of context like, uh, instead of as a function of patch size.
So, this is time over, uh, yeah, just just time, uh, and so, this is the cost curve as a function of context like, um, we'll draw the compute.
Um, the, the cost of the compute is actually constant as a function of context link.
There's no dependence here on context length.
In reality, there is some dependence, but it is very mild dependence.
So, we'll ignore it.
Um, so this is the, um, time for the compute.
This one, uh, and then we'll also draw the dependence, uh, of the memory fetch on context length.
And this starts at a large number for the whites, and then grows gradually with, um, with the context length.
So, uh, maybe here, um, and then grow gradually with context length.
And so, you check the maximum, and you see that is this inflection point here.
So now, so this is the costs that, uh, that, that, for example, Gem and I might be paying.
And then you think, how, how might you put a pricing structure on top of that?
Um, you would like to ensure that no matter what the context length is, you are, you are still profitable.
So, interesting.
And so we've got a two tier pricing structure, maybe we've got something that looks like this up to some context.
That's fascinating.
So, I think it says something about, um, given that the bump is at 200K, it probably means that this is somewhat aligned with this concept across over point, maybe not exactly aligned with.
That's it, yeah.
Um, so we can actually probably even complete that calculation just to see where it lands out.
Um, we can solve for the number of bytes per token.
If, if we sort of make some assumptions about the number of active parameters.
So solving for the number of bytes per token, um, we're going to assume like the point where we equalize, um, the time of memory and the time of compute is at, let's say, 200K tokens.
Um, so we equalize these two.
Um, we're also going to just assume that the batch size is larger enough that the, um, the memory time spent on weights is negligible.
So forget about this and we'll focus on the actual memory time spent on KB cache.
So that ends up saying, copying this term over batch time's length context, um, times, uh, bytes, the token, um, over bandwidth is going to be equal to, uh, number of active variables over our plots.
And then we're going to solve for bytes per second.
Um, batch size was missing here.
Shows up here.
And then it cancels out by the time we get to here.
And, uh, and I drop the length context.
So we can plug a numbers.
This number, this is, this was the reciprocal of the number that we saw before it's, yeah, this is like one over 300, um, which is reasonably stable across many, um, different hardware platforms.
We can actually said that maybe number of activated tokens is like 100 billion.
And like the context we said was 20k.
Um, something is wrong here.
The length of the context should be on the denominator of the numerator.
Um, 1, 6, 6, 7, make about 1k, almost 2k by that's that is plausible actually.
Um, so we said around two kilobytes.
Um, so, um, so let's just do a sanity check for this, um, for what this could be.
Um, there are two mechanisms that people do, uh, attention with a small number of bytes per token.
Um, one is, uh, dense attention with a lot of reuse across layers.
Um, so character AI has a blog post talking about that alternating long and short context.
Um, and like in the character AI kind of model, uh, which also showed up in the Gemma models, the global context, which is really what we're talking about here, global context, um, was shared across all the layers.
And so to get this two kilobytes, you could get that, for example, as, um, a D-head of 128, um, is, is typical, um, and then like the number of bytes is typically, um, number of attention layers, um, uh, times two times D-head, uh, times, uh, a number of, uh, Q heads.
So, um, this is the number of unique contexts per layer.
Do you have, do you share the context across many layers or do you use it only once?
Um, um, uh, so in the character AI like models, uh, this number is one.
Um, we said this is 128, um, and, uh, uh, this is a choice which typically ranges from one, uh, sorry, this is KV heads, I meant, um, there's written a head and a KV head is that the KV heads of the heads that are stored in memory, like store the contents of the previous trickles, the Q heads are the, um, the retrieval heads there, they're only used temporarily and they're, they used by the attending token.
So, um, in this water aggressive context, I've got KV heads associated with all of the contexts, and then Q heads associated with this new token here.
But, but this had the 128.
Oh, uh, this is, um, this, this, this, this number is actually the same for, oh, so this D-head is the dimension of the vector.
I, uh, and number of KV heads is typically in the range of one to eight.
Yep.
So, um, like it is totally plausible to get this by, for example, having eight KV heads and, uh, the D-head of 128, that gives you exactly this number, or, or you could have, like, fewer KV heads, but more or less.
Yeah.
Um, so that, this is one way to get that by a dense attention, there's also a way to get that by a sparse attention, where you, um, increase all of these numbers, but then you have, like, a lot of the sparse at each time.
So, I mean, I mean, I think the number is plausible if, if maybe a little bit small.
It's funny that they would leak so much information through their API pricing.
I mean, you are incentivized to price, close to your costs, because otherwise someone could scoop you.
Maybe you can learn something about the difference in input versus output prices.
Yeah.
And what that tells us about D-head versus pre-fill in these models.
Um, and I think, last edge of it's, like, 50% more expensive than something with that?
Or, I, I don't remember what I've seen in the past it's like three or five.
Oh, okay.
So let's say it's five more times more expensive.
Okay.
This is the compute to process the next token in D-code.
Suppose you're doing pre-fill, we're not just processing the most recent token, you're processing all the tokens in parallel.
So I want to say that it would be this times LEN, um, LEN pre-fill.
Drop all the past in general.
Yeah.
And if we say, like, if we can think of D-code as being a past with one, and then pre-fill being a past with many.
Oh, okay.
Yeah.
Yeah.
Um, so maybe like pre-fix, drop whatever.
Um, okay, memory.
So you're not storing the KV cache if you're for the tokens that are the pre-fill tokens.
I think maybe maybe it's sort of less draw actually how pre-fill shows up here.
Um, if I may clarify, uh, so we do a bit of D-code, like this.
Yeah.
Um, we may actually come back and do more pre-fill.
Like, like, if you say this is a chat session, the user says something, the AI generates response, and then the user says something else.
Yeah.
We pre-fill this.
So like, maybe this is the more common, like this is the general case rather than this.
And in fact, this is like, you read a file or something.
Read a file or just like the AI is responding to user input or to full call or anything.
That's not exactly the same.
Yeah.
Okay.
Okay.
Okay.
Suppose we're here.
So you will need to load.
Basically, the, you will have calculated all of this previously.
So just the KV of everything I came before.
But what is the memory cost of this?
Well, memory bandwidth cost of this.
If you're doing flash attention, it's, it's, it's basically temporary.
It, it doesn't even go to my memory.
Just ignore it.
Okay.
So then it would just be everything that came before.
So is it not just that then?
Yeah.
There's actually no adjustment role to the memory time.
Okay.
Great.
Oh, it's, it's a very trivial change.
Yeah.
So yeah, I could accommodate.
So this term is making it 5x more expensive.
Now, why would that be?
Or what does that tell us about?
What, what are we trying to learn here?
What does that actually tell us?
What variable does it help us clamp?
Well, the compute has presumably gotten five, like the only thing that could have changed the compute is 5x more expensive as a result.
So yeah, there's the time for one pass.
But actually, the amount of tokens is that much larger.
So I guess we want the cost per token, in fact, or the time per token.
So I'm not sure understood that this is, this is for processing the next token in prefix.
Well, actually, for processing the entire batch.
So in this, like, at this cost, we have processed this many tokens like lend, lend, lend, or pre-filled.
Yeah.
I guess pre-fe, yeah, like the, of the past.
Yeah, yeah, no, not, not this prefix, but it's this cost.
Okay.
Okay, look what we're seeing as a pass.
We get, so this is 5x more expensive.
In, input is 5x more expensive.
Output is more expensive.
Output is 5x more expensive.
Sorry, it's a, the result we want to work towards is that pre-filled is compute limited, and decode is, and we bandwidth limited.
Why don't we do this?
Why don't we have, why don't we just chart it with, like, lend pass on the x-axis?
Yep.
Yeah.
T on, T on the x-axis.
T, we want the cost per token.
So it'll be T over, some stuff, T over length at the pass.
Yeah, that's no great.
All right, so.
Okay, excuse me, the reason why this lend pass is the, uh, it seems like this should be higher when you're doing pre-fill.
Pre-fill has a bigger length, length pass.
Yeah.
Right.
But then why is it, why is it cost higher?
Yeah, yeah.
Um, so, I mean, we're going to, is this division by length pass, that action makes it all, uh, so, okay, yeah, this is going to divide out.
This is going to divide out, but then we're going to get a bit, all of this is going to divide the length of pass, and it's going to make the memory cost cheaper.
Okay, yeah, let me, let me, let me think about this then.
Okay, so let's do one line for, basically we'll have four different lines.
Um, let's do the, let's do pre-fill first, and so, actually, let's, let's do decode first.
Oh, so actually, uh, length, length of the pass, when it's one, that is decode.
When it is bigger, that is pre-fill.
Okay, yeah, yeah, yeah, yeah, yeah.
That makes sense.
Okay, getting back to it.
So, T compute, if you have, um, basically just this divided by length pass, it's just this amount.
So, this actually does not vary based on T, so it'll just be some flat value, like this, um, and this is T compute.
And then this is like, uh, this is, let's take our decode, right?
Um, now T-mem, if you have this whole thing divided by the length pass, well, it doesn't really matter what's up there.
It'll just be something that looks like this.
Right.
Yeah.
Let's say this is T.
Mem, this is decode again.
So, as the length of the prefix goes up or pass, your memory bandwidth time declines, and that means that to the extent that you're bound to connect our memory bandwidth before, you can avoid being bottleneck on memory bandwidth.
The fact that they are charging five x less for pre-fill than decode, does suggest that they are bound to connect our memory bandwidth to quite a degree, such that for them at least, because T is equal, we're equivalent to cost, right?
It's a cost of running a compute.
This is actually like this, this would be at one, and this would be at five.
That's right.
That's right.
Yeah.
So, it is, in fact, tremendously memory bandwidth bottleneck.
They, the real graph looks something like the real graph looks something like like that.
Yeah.
I mean, still crosses, but yeah, exactly.
Yeah, let me do this way.
Yeah, that's right.
And then this is the gap on decode between the memory and the compute time.
Yeah.
Yeah.
Interesting.
Another interesting one would be why casheets are so much cheaper?
Yeah.
Okay.
So, if I remember correctly casheets are like 10x, it's more expensive to write a cache according to the pricing on all these models, but if you do hit a cache, it's 10x cheaper.
So, what is going on with it?
Presumably, this is the cost of keeping something in HBM rather than just evacuating it.
But if you do keep it in HBM, then it's cheaper to load again.
Right.
So, there's two ways you can produce tokens or the Cayley cache for a token.
You can just produce it from scratch by computing it from the underlying token IDs, which are tiny.
Or you can, previously they have producers and stored it in a memory song.
Yeah.
So, the cost ratio is really talking about the ratio between those two mechanisms of producing it.
A cache miss means you've to leave it from audio memories and you have to recapit it on the tokens directly.
In fact, you can maybe even take that a step further and think about which memory tier do you store it in.
So, you could store it in HBM.
There are other slower and cheaper memories than HBM, like DDR on your host or flash as well.
And so, one of the things you can do is a calculation of where it makes sense to be an H memory tier.
And this is related to how long you're going to store so we want to look at the cost of storage in a few different memory tiers and also the cost of rematerialization.
So, rematch means the cost to rebuild all of the Cayley cache from scratch having up after you've deleted it.
So, we rematerialize it.
And so, basically, this is going to cost the length of the context.
Actually, we'll look at cost per token so that we don't need to carry around this length of context everywhere.
So, to rematerialize one token of Cayley cache, I just need to run, I need to run a forward pass on the whole model.
And then, so there's going to be the computer.
After rerun the computer, I don't want to have a speed my GPU does it.
And then I multiply it by like GPU dollars per second.
Is there an extremely nice question?
Why is there not a quadratic term?
Yeah, so there is a quadratic term in it shows up in the compute.
As an approximation, I chose to remove it.
The what that I'll just show you sort of quickly what that looks like.
It's because so you have the, if you look at the cost per token or the number of flops per token, there is the flops that are coming from doing the weight matrix not applies.
Yeah, as a function of context lengths.
And then there is the number of butterflies that comes from during the Cayley cache, which goes up linearly with the amount of stuff you attend to.
The slope on this is so low that like when you draw it like this, it seems like it's very well approximated by a flatline.
So, like it starts to like you start to notice the effect of the quadratic or the linear term up in the fin and millions of tokens or so.
So, just not super relevant.
So, what is the reason that there's no company which has over a million token context lengths?
If there's this rule?
Yeah, so there are two costs of long context.
One is the memory bandwidth cost, which we've spent a lot of time analyzing.
That's this thing.
And then the other one is the compute cost.
The compute cost is almost always and sort of actually forced by environmental principles to be a much smaller slope than the memory bandwidth cost.
And so, the primary thing that limits you to have really large contexts on memory balance and memory capacity, which is exactly this effect, like.
And so, there's this idea, the Dario said on the podcast and others have said, which is, we don't need to continue learning for BGI in context learning is enough.
And if you believe that, then you have to think that we had to get to $100 million billion context length to have an employee that is the equivalent of working with you for a month.
Now, maybe that's no longer true, it's for our attention or something.
Yeah.
Yeah, if you think that, then, as a some ML in for a thing, would have to change to allow for $100 million, like the memory bandwidth will allow for $100 million.
Yeah, so it's getting context lengths.
I mean, the space attention gives you a get-out for shore because you get this square-riched like gives you a big improvement.
But I think it's like, if you look at the history of context lengths of models, from like earlier models like GPT-3, maybe to GPT-4, I don't remember when the transition happened exactly, like they shot up from like about $8k to $100k to $100k.
And then for the last year or two, they've all been hovering around there.
I think that actually indicates that that's sort of the reasonably balanced cost point and going massively beyond that would be cost prohibitive.
Not because of the compute cost because of the memory bandwidth cost.
Yeah.
So, I actually don't see a very good path to solving that.
Like the memory, the HBM is where it is.
It's at where it is.
It's not getting hugely better.
And why does it, as far as it turns out, it's not there.
The space attention is a big improvement.
Maybe that is priced in already, perhaps.
It's not an infinite improvement because if you go too sparse, you lose too much quality.
But yeah, I mean, the empirical result is that the quality of things haven't been increasing that much.
And I think it's because there is no solution to the memory wall.
Like, so going too sparse, which means like you're returning to a very small amount of the tokens, the quality will get way.
So, what is the cost of these different ways of producing a recent exercise in the KVK.
The computer from scratch is based on my GPU time after do a certain amount of multiplies in order to all of GPU time that I spend in order to produce it.
Story HBM.
This really goes as my, I think, at a number here, which was the bytes per token.
So, I need to, I need to just have some number of bytes per token.
And then I need to store this in the HBM.
So, it's going to use up some of my HBM capacity.
So, anyway, to think of this is that, like, if I have too many of these things sitting in an HBM, like, if I fill up my HBM with just KVKasher's that I'm not using, I can't use that GPU.
And so, how do I price that?
Maybe I say that the cost of it is proportional to the fraction of the HBM I'm using.
So, there's also time to check your dollars.
And then, let's just do one more memory here and say something like DDR.
Sorry, DDR instead.
The same kind of thing goes up for flash and for DDR.
I put these in the wrong columns, actually.
I meant to make two columns.
The distinction I want to make is that there is the time to cost to retrieve.
And then there's a cost to store costs to hold on.
And so, this is like, there's a cost per second, whereas this is like an instantaneous cost.
So, rematerialization has a cost to retrieve and has zero cost to store it because we've deleted it.
This is the one that I put in the wrong location.
This is actually the cost to hold on.
So, I will rewrite it.
So, okay.
So, we have this is the, like, if we're just storing it in HBM, it has this sort of cost profile.
And then, if we store in DDR, it's actually going to take some time.
So, it's like we get the same thing here, but it's for token over a DDR capacity, time DDR cost.
A second, but now this has a cost to retrieve that is higher than the HBM, because we need to copy it into the HBM.
So, this is by its token over DDR bandwidth.
And then, this consumes some amount of the DDR.
So, then, every scale up has a DDR and flash, there's really a deployment question.
And so, you can choose that.
And video does deploy in this form.
It has it has both.
Whereas in the cost to retrieve HBM, the memory bandwidth, or the bytes about the memory bandwidth.
Yeah, I mean, it depends on what you define a retrieve to be.
Here, I'm defining retrieve to be, move it into HBM so that you can start actually doing it for someone.
And so, like, sort of by definition.
Because if it's already in HBM, you can be doing compute while you're getting it from HBM DDR.
Yeah, for example.
So, these are three things.
And I guess I ordered them wrong.
In general, if you're balancing two costs, and you've got different memory, different tiers in the memory hierarchy, you should expect as this cost goes up, this cost would go down.
So, you can kind of see where the zeroes are.
And, like, I should have avoided them.
This one first.
This one second.
And this one third.
So, if you're going to hold onto it for a very short amount of time, then, all of this is multiplied by the hold time.
Yep.
This one is, and so is this one?
Um, and interestingly, they have different prices to write for, and is it you specify this in the API for five minutes versus an hour?
Yeah, which suggests that the five minutes is HBM and the hours DDR.
I think that's like, I think that's a pretty good assumption.
It could, if you look at the numbers, it might also turn out that it's one tier down, and it's DDR versus flash is down.
Interesting.
And the price difference, I think was, I look it up.
Okay.
So, the base, base input tokens is five per million tokens, basically.
Yeah, that's five.
This is five, five.
Two, two, like, retrieve, quote unquote.
And then the, to write to, presumably HBM, right for five minutes is 6.25.
So, actually, we might actually be able to determine the, which memory tier it is by, by the duration section.
The duration probably tells it to actually, so there was six of five minutes versus an hour.
Yeah, exactly.
I think this will probably end up being, um, it's going to be the drain time of the memory, uh, tier that you're in.
And so what that means is like, uh, like, given that I'm, I know I'm going to be holding something for five minutes, I would like to have, pick a memory that I can read every five minutes.
Like, I can read the whole memory once per five minutes, bullpock.
So that is the drain time of the memory.
So if I take the, the, like, or the storage, storage capacity over storage bandwidth, bandwidth, um, I would like this to be, like, equal to five minutes, or something like that.
And so actually, we did this calculation for HBM, for HBM, we know that this number is 20 milliseconds.
So HBM is much too short, like, much too small.
Um, indeed, our could be about a order of magnitude, or, or two, or from this.
And so this is probably in the order of, like, actually, I think it might even be in the, in the seconds, like one to 10 seconds.
Um, and then, this is really, I don't have these numbers memorized, but generally as you go to slower tiers, flash is plausibly in the order of one minute.
Um, and then like spinning this, uh, which is massively different, I think is on the order of one hour.
So this might actually identify the other tiers, or probably flash and spin on disk.
Sorry, why, why is this the calculation?
This is the storage capacity provided by the bandwidth.
So, um, you, you've got a bunch of different memory tiers, like we've listed four of them.
Um, um, uh, the, your choice, like, your choice of which memory tier is like, like, you want to minimize the cost.
Yeah.
Um, and so you are, like, what fraction of the device are you using?
You're using some fraction of the device for the holding onto it, and then using some fraction of the device to retrieve it.
Um, and so let's say, I'm using, like, 10% of the device, um, and I would equalize those two fractions.
Uh, that's a sign that I've hit the right, um, the right thing.
So let's say I've got some runtime here, like, I'm going to hold on for all of this time, um, and then, so this is the time hold, uh, and then there's going to be some amount of time here, which is time retrieve.
Um, uh, and I want, I mean, basically to equalize the costs, these two costs, um, I want the retrieval time to be equal to the whole time, uh, times the, like, fraction of capacity.
Um, because like, this is the retrieval time, uh, yeah, I mean, this is, this is how many other things I can hold simultaneously now.
Basically, just like, hey, you want to, you want to store things in there for so long, such that the amount of time it's in there is kind of the time to get all your things in there and out.
Yeah, basically.
I mean, yeah, I think that probably indicates that this is the two tiers of fashion and spending disc.
I'm kind of shocked to see spending disc being used at all, because it's such an old technology.
Yeah, yeah.
It's also crazy that it's so slow that it takes an hour to load its full capacity to it and, like, it's really on the attractive technology, but it's just one of some places.
Yeah.
So we're sitting down because I want to ask you some questions that I guess don't need to play for it.
Um, you have this extremely interesting blog post where you talk about how, at a high level, the architecture of a different group of graphic protocols looks a lot like neural networks.
And there's this conversion evolution where they both need to jumble information across all their inputs for a graphic protocols.
It's to make sure that there's like each new input into a hash function will totally scramble what happens for neural networks.
Of course, they need to consider and how this piece of information changes what you should make of this other piece of information.
And that has an extremely interesting point.
I guess at a high level, the difference in what they're trying to do.
And some say they're trying to be interesting, right, which is cryptographic protocols are trying to take information which has structure and make it look in distinguishable from randomness.
Yeah.
And neural networks are trying to take things which are look like random, protein sequences, DNA, I garbled text, and extract higher level structure from it.
So they have similar high level mechanisms, but they're actually kind of trying to do the opposite things.
The owner is going to make it that.
Yeah.
So I mean, like the mixing, I tried to look for other examples when mixing, like scrambling, mixing shows up as well.
There's actually almost even like a physical example where you're stirring something, you're making a cake and you want to stir the batter.
And like, literally the idea, like first stir this way, and then stir this way is like actually not too bad of an approach.
But beyond that, like, in back to the digital world, there are some differences.
And the one you talk collout is a pretty strong difference.
The way it shows up, like what makes neural nets, like if you just randomly initialize a neural network, actually, maybe it's a reasonable cryptography, like a cipher as well, because like the random initialization, it's kind of a jumbled stuff in a complicated way.
You may even like do what you want to do.
You know, is the thing that makes it introvert or is the gradient descent.
So you can differentiate and you're all that work and get a meaningful derivative.
And we do a lot of work to, like, not over complicate the derivative.
So the residual connection keeps it, like, contained and simple.
And so it is, like, the layer norm, stuff that we do.
One of the biggest attacks against cryptographic ciphers is also to differentiate the cipher.
Ciphers run in a different number field.
They run in the field of two elements.
So binary, whereas neural nets run, like in theory, and the field of real numbers.
And so you have to differentiate with respect to binary numbers.
But you can absolutely differentiate a cipher that's called differential cryptanalysis.
And, like, basically what it says is that if you take us more difference of the input, how it's quite difficult to make the difference of the output be small.
Like, we're like, the whole job of a, of a well-designed cipher is just like the difference in output per large.
So, I guess the distinction is that the optimization goals at that point are about complexifying.
They don't have the same residual connections or, or, like, layer norms, I mean, I guess a place where the, the two merge is back doors.
Okay, so with the back door number 11, you're trying to hide, um, what do you consider an input?
It's not an input into the forward password.
It's an input into the backward pass.
But you're trying to hide an input into the backward pass.
Like, you're, like, this is like an adversarial.
Yeah.
Yeah.
So, yeah, I mean, in fact, this is, like, this is actually a place where you get exactly the, um, sort of, avalanche property that ciphers have as well.
Um, like adversarial attacks on typically like image classification models, right?
Or, can I find a perturbation of the image that a very, very small perturbation of the image that totally changes the classification, totally changes the output?
That is the common case in ciphers, whereas that's, that's the, like, undesired case and we can do all that's for sure.
Okay, so I was asking you, uh, has, have neural networks actually been used for cryptography, and, um, you realize it might be better if you just do this in the blackboard.
Yeah.
Um, so I'm curious.
Are there actually being used for cryptography?
Yeah.
So, using neural nets for cryptography, well, in general, cryptography, like creating a new ciphers, a very, very dangerous proposition, like, uh, always know all of them are broken, like 99% of them are broken.
So, uh, probably a bad place to start, but the other direction has been very, like in, in at least one very clear case, quite productive.
Um, so there's this construction, in, so construction that exists in, in ciphers, and then was imported into neural nets, called a phystal cipher phystal network.
Um, so the idea is that, um, you may have some, some, some function f, uh, which is not incredible.
Um, but you like the function because it like does, interesting things, like it, it does an MLP, for example, or, or it makes this an an interesting way.
Um, you'd like to build something out of this that is invertible.
So, the construction we're going to make is going to actually be a true input function rather than a warning input function.
Um, um, and we're going to apply, uh, f of x.
We need to actually remember what x was.
So, we're going to stick x over here so that we can, uh, work backwards, and then we can also can't drop by.
So, we're going to remember y and we're going to add them together.
And so we form this topple.
So, um, the, the way to invert this, like, if you think I have this output, and I want to recover x and y, well, I can easily recover x, that's right there.
I just read it off.
And then to recover y, I, like, if this thing was called c, um, I can, I can recover y by z minus f of x because I've already recovered x.
So, so that means that this construction is invertible.
Um, this was used in ciphers, like a ton, um, still is used.
It's one of the main mechanisms of construction ciphers, often you want ciphers to be invertible, especially the layers of ciphers you want to be invertible, um, because that has better agricultural coverage properties.
This has actually been plotted over into, um, uh, into neural nets.
Um, there's a, uh, 2017, 2018 paper called rev nets reversible networks.
Um, and what it does is it actually makes the entire, like you can apply it to any network, like a transform network, you can make, I do a forwards pass, but then I can actually run the entire pass backwards as well.
Um, so the whole neural network is invertible.
Um, with exactly this construction.
And so, this paper reversible networks, um, like applied to some layer like a transform layer, for example, we've got this function f, which is our transform layer.
Um, now normally we would have, um, just an input and then there is a dual connection coming out.
Um, and it gets added like this.
Um, over here.
Um, but now, uh, the variation of this is going to be, we've got two inputs x and y.
Um, so we've got x and y inputs, um, x goes through the function, gets added to y, and then this becomes the new x, the output x, and then this x becomes the output y.
So, um, really what this is doing, this is like, this is actually sort of doing, if you think of two layers, uh, back, this is actually, the thing we mentioned before, it's actually doing the residual connection from two layers back, um, like this y came from the previous layer and was the residual connection there.
Um, but because of this construction, the whole thing is invertible, why do I care?
What does invertible metaphor?
Um, the big thing that it can be interesting for is for training.
Um, if I think of a forward pass of training, um, so I will, let's say I have four layers.
I run them in 0123 order.
Um, I have to write all of the, um, activations to hbm.
Um, and so I get an hbm footprint here that is kind of like linear linear in, uh, number of layers.
Yep.
Um, so this, this actually can be, uh, the largest memory footprint during training.
Um, and so this is normal training, and then and then I run the backwards pass and I read a kind of a nervous, like I, I run them and sort of forward pass goes forward, backwards pass goes backwards, not to read the back out.
Um, the idea of this Revenants paper is that because it's invertible, um, I don't need to store this at all.
I can complete the rematerialize it when I'm running my backwards pass.
Sorry, I run my forwards pass and then when I'm running my backwards pass, I'm simultaneously in lockstep, undoing all of the forwards pass steps that I did in order to, um, uh, to have the activations that I need to help.
So this ends up being a memory saving, which is a, that's idea.
Interesting.
And in, in some sense, you're, uh, spending more compute to save memory.
That's right.
Yeah.
Interesting.
Uh, actually, it's kind of the opposite of what you're doing with the kvcash, the kvcash.
Yeah.
Yeah.
You're spending more memory to save compute.
Yeah.
Uh, spending more memory to save compute is generally profitable, given where, yeah, hardware is actually interesting.
Cool.
Uh, that's super fun.
Right.
Yeah.
Thank you so much for doing it.
I, uh, I feel like it really vindicated the vision behind the studio and the, and the, and the blackboard.
Cool.
Thanks, not for doing it.
Thanks.