Dwarkesh Podcast · 2026-05-22

Reiner Pope on AI Chip Design and Systolic Arrays

Hosts: Dwarkesh

Guests: Reiner Pope

AI chip designmultiply-accumulatesystolic arraysFPGA architectureclock cycle optimizationGPU vs TPUdata movementAI accelerator architecture

Read summary Jump to transcript Original episode

Why it matters

Multiply-accumulate is the fundamental primitive in AI chips for matrix multiplication, with 4-bit multiplication and 8-bit accumulation to balance precision and error accumulation.

Key claims

Multiply-accumulate is the fundamental primitive in AI chips for matrix multiplication, with 4-bit multiplication and 8-bit accumulation to balance precision and error accumulation.
Data movement between register files and ALUs dominates chip area and power costs, motivating the use of systolic arrays to amortize communication overhead.
Systolic arrays store weight matrices locally and stream input vectors through, enabling high compute-to-communication ratios essential for efficient AI workloads.
Clock cycle speed is limited by the longest logic path; pipeline registers are inserted to split logic and increase clock frequency at the cost of area.

Briefing memo

Summary

In this episode of the Dwarkesh Podcast, Reiner Pope, CEO of Madx, provides an in-depth explanation of AI chip design from the ground up. He starts with the fundamental logic gates and builds up to the architecture of AI chips, focusing on the multiply-accumulate operation as the core primitive for matrix multiplication in AI workloads. Pope explains the trade-offs in bit precision, the quadratic scaling of circuit size with bit width, and the importance of optimizing compute relative to data movement within the chip.

The discussion then moves to the evolution from traditional CUDA cores to systolic arrays, which are specialized hardware units designed to maximize compute throughput while minimizing data movement costs. Pope details how systolic arrays store weight matrices locally to reduce bandwidth demands and how this design principle is central to modern AI accelerators like TPUs. He also contrasts GPU and TPU architectures, highlighting the differences in granularity, data movement, and parallelism.

Further, Pope explains clock cycle synchronization, pipeline register insertion, and the challenges of balancing clock speed with area and throughput. The episode also covers FPGA architecture and why FPGAs offer deterministic latency at the cost of efficiency compared to ASICs. Finally, Pope touches on CPU design elements such as branch prediction and cache systems, and how these differ from AI accelerators. The conversation closes with reflections on the trade-offs in chip design sizing decisions and the future direction of systolic array architectures.

Multiply-accumulate is the fundamental primitive in AI chips for matrix multiplication, with 4-bit multiplication and 8-bit accumulation to balance precision and error accumulation.
Data movement between register files and ALUs dominates chip area and power costs, motivating the use of systolic arrays to amortize communication overhead.
Systolic arrays store weight matrices locally and stream input vectors through, enabling high compute-to-communication ratios essential for efficient AI workloads.
Clock cycle speed is limited by the longest logic path; pipeline registers are inserted to split logic and increase clock frequency at the cost of area.
FPGAs provide deterministic latency and reconfigurability by using lookup tables and multiplexers but are less area- and power-efficient than ASICs.
CPU architectures rely heavily on caches and branch predictors, which introduce non-deterministic latency, unlike AI accelerators that use scratchpad memories for predictable performance.
GPUs consist of many small cores (SMs) with tensor cores, while TPUs have fewer but larger matrix units, trading off flexibility for efficiency in matrix multiplication.
Chip design involves sizing trade-offs between systolic array size and register file size to optimize area, power, and performance.

Source material

Transcript

I'm back with Ryan or Pope, who is the CEO of Madx, which is a new AI chip company.

Last time we were talking about what happens inside a data center, now I understand what happens inside an AI chip.

How does a chip actually work?

For this closure, by the way, I am an individual investor in Madx, so hopefully you have designed a good chip.

Also, if you're listening to this on an audio platform, it's much preferable to watch this blackboard lecture on a platform we can see what's happening, so just over to YouTube or Spotify.

So I'll start with sort of the very smallest fundamental unit of chip design, and then we'll sort of build up into what an overall like actual production chip, what are the components of that?

At the very bottom level of a chip, the primitives that we work with are logic gates, which are very simple things like and or not.

And then these are connected together by wires that have to be laid out physically as Madx traces on a chip.

The main function that AI chips want to compute is multiplication of matrices and really inside that is the fundamental primitive is multiply a cumulative just like of pairs of numbers.

So we're going to sort of demonstrate what that calculation looks like by hand, and then sort of infer what a circuit would look like for that.

It'll turn out to be sort of easiest if I do multiplication a cumulative something like a for-for-bit number with another for-bit number, and then we're going to the actual clearest primitive is actually multiply a cumulative.

So there's a multiply these two terms, and then we're going to add in sort of product of these two terms, and then we're going to add in an 8-page number.

I didn't get an ask clarifying question.

Why is this the natural primitive for whatever computation happens inside a computer?

Yeah, so there's a few reasons for this.

It's a little bit more efficient, but the reason it's natural for AI chips is that if you look what's happening during a matrix multiply, the what is matrix multiply in very short, it is there's a fall loop over a i and over j and over k of output i k plus equals to input i j times other input j k.

And so multiply a cumulative happens at every single step of a matrix multiply.

And then the other observation is that the precision will almost always be higher in the accumulation step than in a multiplication step.

This is maybe a specific to AI chips, but your multiplying load precision numbers, but then when you accumulate errors accumulate quickly, and so you need more precision here.

So there's why we've chosen to do a four-bit multiplication in an 8-page addition.

Let me make sure I understood that there's two ways to understand that one is that the value will be larger than the inputs.

And the other is that if it was a floating point number, it would be maybe that that part is like less intuitive to me, but it's maybe this principle.

It is really the same principle.

I guess the separate principle is that as you are summing up this number, you are summing up a whole bunch of numbers, and so you've got a lot of rounding as we're accumulating.

Whereas in this case, there's like there's only one multiplication in that chain, and so there's not a lot of rounding errors accumulating in the multiplication.

Why are you summing up a whole bunch of numbers?

It's just two numbers.

I mean, this summation happens.

It's repeated.

I see a chain many times.

Yeah, any errors accumulate, I see.

Yep, I see.

So how would we perform this calculation by hand?

I mean, as a human, we would probably separate it as a two, but we can sort of do it all in one using long multiplication.

So the multiplication term first, we're going to multiply this number, this four-bit number here, by every single bit position in the other four-bit number.

So we write that out.

First, one zero zero one multiplied by this bit position.

That is this number itself.

Then shift it across by one, we multiply by zero.

I give us an old zero as number, shifted across even one more to multiply by this one.

We get one zero zero one.

And then finally, for this last bit position, we get an old zero as a number again.

So this sort of gives us a bunch of terms that we're going to have to add for the for the multiplication.

And then while we're doing that summation of this, we might as well add in the actual accumulator term as well.

So we just copy that directly across.

So this is the sum.

It's a five way sum that we're going to want to commit a compute.

So firstly, what logic gates did it take us to even get to this intermediate step?

We needed to produce all 16 of these partial products.

How do I produce one of these partial products?

So let's take this number one, for example, here.

It is one.

So how do we produce this number by multiplying this number by this one over here?

We can actually produce that by an end gate.

This number is one.

If both this bit is one and this bit is one.

If either of them is zero, then the multiplication of one times one of zero times anything is zero.

So to produce all of this stuff, we ended up consuming 16 end gates.

Or in the general case, if I were doing a P bit multiply times a Q bit multiply, then this will be like P times Q, many hands.

Finally, I sum them.

Actually, most of the work is going to happen in the summing.

And so let me describe sort of the other logic gate that we use here.

And it's almost the simplest logic gate that exists on a chip.

It's almost the smallest.

At the other extreme, typically the very largest logic gate that you'll use is something called a full adder.

And what this does is sort of it does like coming from software, you might think that a full adder is like had 32 bit numbers together.

In this case, it just adds three single bit numbers together.

And so you can think of it as like adding zero one and one together.

Now when I add these together, the result can be zero one two or three.

So I can express that in binary using just two bits.

So it has input it has three bits and as output it has two bits, which in this case are one and like the number two in binary is one zero.

So so this is also known as a three to two compressor.

Because it takes three bits of input and produces two bits of output.

The two inputs are an x and a y value and then some carry that came in from like the three inputs are all all bits that are in the same sort of bit position like three bits that are in a column here.

Yeah.

And then the two outputs I have sort of drawn them vertically here and horizontally here to kind of match this vertical versus horizontal layout here, which is expressing that things that are in the same column are in the same like bit position, whereas things that are in adjacent columns like this is a carryout with, whereas this was for some.

So if the input in the full adder are worth say like one zero one, then the output would still be one zero.

It was one one one, it'd be one one, it was zero zero zero, it'd be zero zero zero, it was like zero one zero, it's still be zero one.

So yeah, it's just counting essentially the number of things that makes pressing that in binary.

So this circuit actually can sort of capture what we as humans naturally do when we're doing something along a column.

So I can show that sort of, I'll show sort of one iteration of using the full adder to some.

The way some here is going to be a little bit unnatural for humans.

Humans we would sort of sum along the column and then remember the carry, but instead of remembering the carry will actually just explicitly write it out.

So in this, we proceed from the rightmost column towards the left.

On the rightmost column we sum the one in the one, and that produces like a zero here and a carry of one.

So we sort of used a this full adder circuit on this pair of bits and produced a pair of bits as output.

Now we can do the same thing with this column.

We've got a column of one, two, three, four numbers and so maybe we'll like take the first three of them, run a full adder on them and that gives us a zero and a zero is output.

So it's like some of these is zero zero.

So that's the full compressor full adder applied to all of these bits.

As I've used up bits, I'm going to sort of just cross them out to indicate that I've handled them.

Let's just keep going a little bit more.

So we'll go here.

I take these three numbers.

I add them.

That gives me a one and a zero.

I've dealt with these three numbers and now I take one two and I can even take these three numbers for example right now and add them and that gives me a one and a zero and I've dealt with these numbers.

So I can sort of like the way I should view this is that I have this whole grid of numbers that need to be added.

I'm going to just keep applying full adders to all the bits that are here.

Constantly removing three numbers from a column and then writing it out two numbers is output.

Keep going with this over and over and over again until I eventually get like some just one single number coming out here.

Something like that.

This is probably the wrong sum.

So this is a approach that I've described here.

This is called a data multiplier.

And this is sort of like the standard for how you do area-efficient multiplies using full adders.

Let's try and quantify the circuit size of this just so we have got a sense of like how big things are so we can compare to them later.

How many full adders do I today use?

I started with how many numbers I have the 16 partial products which is the product of all of these terms with all of these terms plus the 8 terms that I'm adding here.

So I started off with 24 bits and then I produced 8 bits on the output eventually and in every step I was sort of crossing off three numbers and writing two numbers out as a result.

And so every single use of a full adder eliminates one of the bits here and so how many full adders it must be the 24 minus the 8 so that there were 16 full adders in this circuit.

In general this is true in the general case as well there will be P times Q many full adders in this circuit.

We're sure I understand the logic of that so the input bits 24 is P times Q plus P plus Q that's right and the output bits is just P plus Q and so P times Q plus P plus Q minus P plus Q equals P times Q.

That's right.

So I think this explains sort of or at least hints at these secondaries and why we chose to do a multiply accumulate.

Firstaries and being that's actually what shows up in my history matrix multiplication but secondaries and being able to save us this very slick P times Q very simple algebra.

So we've sort of described like this whole procedure every single atomic step that I took here becomes a logic gate and then sort of the wires connected together like when I had these three inputs that I salvaged to produce these two outputs like if I think of mapping this to a physical device they would be a wire that runs sort of connecting all three of these things together into a logic gate that produced this output.

Okay so this is the the main primitive at different bit widths that is that is inside an AI chip.

We're going to build up from here to how would you use that to run all of the other operations you might want.

This might be the wrong time to ask this question but whenever in video reports that this chip can do X many Fp4 or half as many Fp8 it seems to imply that those circuits are fungible that there's not as dedicated like Fp4 versus Fp8 but the way you're mapping it out here it seems like you would need if it has to be mapped out in the logic you would need a dedicated Fp4 multiply a cumulative and then a dedicated Fp8 accumulate.

Basically can you can you fund them?

As drawn they're actually not particularly fungible.

This is actually one of the main choices you have to make when designing a chip which is how much of Fp4 how much of Fp8 do I have and then sometimes I'll make that consideration from the point of view of like what do I think the is the customer requirement another way to take an angle on that is to say what is the what is the power budget for equalize the power budget between Fp4 and Fp8?

So then when they report those numbers and they just happen to be the case that like it has two X as many Fp4 as Fp8 they just happen to choose like give the equivalent die areas to all the floating points and as a result that end up being like why is the ratio exactly?

Yeah exactly.

Yeah So part of it is I mean that surely that one being exactly like exactly equivalent die area there's there's a data movement reason actually and what will maybe come back to this when we sort of look through how it goes into an out of memories.

There's something really nice just from a software level of the fact that I can pack two 4-bit numbers into the same storage as an 8-bit number and so when I store that to a memory or something like that it's the sizing of the the bosses that I wire within the chip actually makes that work out really really not that I actually come to think about it's not just two X it would be the amount of the area it takes it sounds like is quadratic.

It's quadratic in terms of yeah with a bit length so that's why a smaller precision is like even more favorable than you guys think.

This is a really big reason so in fact Nvidia made a change historically up until B100 will be 200 every time you have the bit precision you double the the flop count that ratio is exactly like for the reason you said because of this quadratic scaling that ratio is actually slightly wrong it should be like an even bigger you should get even bigger speed up and then you might otherwise think Nvidia's products aspects of sort of started to acknowledge that in B300 and beyond where the F before is three times faster than the Fb it though it should be for us yeah what I've shown here is like the simplest case of integer multiply when you're when you're dealing with fighting point as you do in Fb before in Fb there's this sort of other term which is the the the exponent that's just complicates this kind of collision so so what can we see already from this it's like I think the the big observation you've made is that there's this quadratic scaling with bit width which which is like very effective and is the single reason why low precision arithmetic is worked so well for neural nets but the other thing we're going to do now is we're going to compare sort of the area spent on the multiplication itself with all of the circuitry that is around it so we'll we'll walk back in time a little bit and and see how did GPUs prior to TensorFlow's work as which is the same way as the way that CPUs worked in fact so which is like where do we stick this multiply accumulated so generically I'll describe like a CUDA core or a CPU you'll have some register file which stores some number of entries maybe it's like eight entries of like in this case I guess four bit numbers about typically like 32 bit numbers or something about which is not numbers so this is the like inside the CUDA core I'll have some register file of some depth and then I will have my multiplier cumulate circuit multiply and a cumulate circuit and what it's going to do it's going to like it's going to take three arbitrary registers from this register file perform the multiply accumulate and then right back to the register file so it's going to make you right to this one but it it was able to read from this one this one and and another random one so it'll take three inputs like this so this is the core data path of of many processes most processes look like this you've got some set of registers and then you've got some set of logic units or ALUs we want to analyze the cost of the data movement from the register file to to the ALU and back so ultimately there's going to be some circuit that says well I don't always have to select this guy I might select any of the registers in any point in time and so sort of a first question is how can I build a circuit the circuit that I'm going to look for is is a max so in this case it's going to have eight inputs one from each entry of the register file and it's going to have one output which is actually producing this output and then like what is the cost of this thing it's it's like all we have to to build it out of is and and or and so how do we build it we we do the dumbest thing possible we like form a mask saying we okay when we want to read like the third entry we're going to and every single entry with either one or zero based on whether that's what we want to read and then we're going to or all of them together okay just to make sure I understand the basics what the max is doing is it's just like selecting just selecting an input yeah so like invisible to software is like you say I want to put number three that means there's a max yeah and so like what is the cost of this max so an n input max operating on p bits well I'm going to so I have n n rows that's this eight rows and I've got like each row is p bits well I have to and every single bit so I get n times p many and gates every single input I have to say yeah I'm going to like mask it out or not and then and then I'm going to order them all together and so there's going to be like n minus one times p many or gates which is saying I've got all of these different things almost all of them zeros but I need to sort of collapse them down into into like from my eight options down into one option and so every step I need to or like one row into into an existing row yeah yeah is actually kind of funny that you would sort of um you don't think at the level of hardware you sort of just think like oh I'll just select element three and something as simple as that is I sort of like in and of itself a quite complicated circuit yeah I mean this is the first step of all of the hidden data movement costs that are sharp yeah and so like the thing like we're just going to like compare like I have to pay this cost and I've got one max here and then in fact I have two more copies of that for each of the three inputs to my multiply accumulated operation and so I have this cost which is like like three times n times p and gates over here compared to this p times q like sort of gates in the actual circuit that that is doing the thing I care about and if we plug in actual numbers like this n being 8 like I get like 24 times p gates over just just in the data movement compared to like if q is 4 like 4 times p gates just in the in the outer multiply outer and sorry where is the three coming from?

three different inputs here go ahead so the case I like really just what I'm hinting at here is that like all of this work which scales like as as the size of the register file and this is a very smaller register file all of this work just moving the data from the register file to the to the to the logic unit is many many times more expensive than the logic unit in the most recent cluster max report 70 analysis ranked almost 100 different GPU clouds kruzo was one of only five that made the gold tier.

setting analysis found that gold tier providers like kruzo had a total cost of ownership those five to 15% lower than silver tier ones even when they had a identical GPU pricing.

this makes sense because total cost of ownership is downstream of a bunch of different things that don't necessarily show up in the sticker price but that kruzo has optimized things like how well you detect faults and how quickly you replace failed notes.

for example kruzo was one of the first clouds to adopt any sentinel in media's own GPU monitoring and self feelings off for for enhanced GPU uptime utilization and reliability.

this that's kruzo may use with everything that in media has learned about why chips fail across all their different fleets and deployments so that kruzo can catch faults earlier in the process.

and once they identify failure, kruzo can swap in a healthy node in less than 10 minutes.

but these are not running bare metal.

kruzo doesn't have to spend time installing an operating system or configuring drivers.

they can just spin up a new VM on an already running and pre-qualified host.

if you want to learn more about this or the other reasons that kruzo made gold tier, go to kruzo.ai slash dorkash.

it may be helpful to just see what a mux looks like maybe like a two bit or four bit mux.

yeah great.

so we'll take some inputs.

well we'll have maybe like we'll just do a two way mix.

so we've got two different numbers.

we've got these two inputs and then we have a so these are the inputs.

they're being selected between and then we have a selector.

which says which can either be like I want this one or or it could be I want the other one.

so this is a one-heart encoding.

so this is what we all start with.

and then the output we want to produce like let's focus on this case.

so this is the actual input we cut.

we just want to produce this guy as they're on.

and so like sort of very laboriously what we do is we end this bit with all of these.

and so um that produces like ending this bit with this row and likewise we end this bit with this row that produces all zeros.

so this was the there's four ends here.

there's four ends here.

and then finally we just ore these two together and this goes to one.

we ore these two together.

this gives a one.

we ore these two together.

gives a zero.

we ore these two together.

we'll give so one.

and so this is the four ores.

so like this actually ends up looking a little bit like addition.

in fact like we we did exactly the set of same same set of ends here.

so if we've added all of these things together but then instead of collapsing it by using these full adder circuits we just get a very simple collapsing with ore gates.

and that doesn't look like n times p.

so so yeah so this was this was with n equals two inputs.

in the general case we will have n like and so this is n rows and then we'll have p bits per row.

so that gives us the n in times p.

many n gates.

so this is how I've described here.

almost all of the cost like seven-eighths of the cost is in the reading and writing the register file and only a tiny fraction of the cost is in the logic you know that's off.

so this is the problem to solve.

this essentially was the the state of play prior to the voltage generation of Nvidia GPUs.

this is what what this kind of thing is what was inside the CUDA cores and this and this sort of problem statement is what motivated introduction to tensor cores which are marginarily called systolic arrays.

so if we think about how are we going to solve this problem like we were spending almost all of our circuit areas on something that we just really don't care about and is hidden to the software program and the thing that we actually care about is not much of the area.

well make this one bigger somehow while keeping this at the same size.

that's the goal.

so the sort of the evolution was like we had baked this much into hardware in the stage.

this single line is a multiply accumulate and this was the single thing was baked into hardware.

the idea of a systolic array is to sort of go two levels of loop up and and and bake this entire loop out here into hardware.

and so the idea being that if we have a much bigger granularity fixed function piece of logic and maybe the taxes we pay on the input and output a much smaller.

it sounds like it's suggesting that if you have if you go up one step in the in the matrix multiply loop that there's some you can tilt the balance more towards compute than communication.

that's right.

so there's two effects that we're going to take advantage of here.

one is just that we can do more stuff before we pair every trip through a register file.

and then the other thing we're going to take advantage of is in fact in in some of this loop we can take advantage of for example some sort of thing staying fixed.

so let's sort of visually we're going to look at this matrix multiplication.

so this portion of the loop corresponds to a matrix vector multiplication in fact.

so we'll take a matrix and multiply it by a vector.

so how do we do this?

we take every column gets multiplied by the vector and then sum so we're going to sum sort of a long columns.

so this 0 and 3 gets multiplied by the 3 and 7 and get summed and then the 1 and 2 gets multiplied by the 3 and 7 and get summed.

so there is a multiply accumulate associated with every single one of these entries in the matrix.

so we'll just draw out these four multiply accumulates.

I just make sure I understand why there's before multiply accumulates.

so if each entry in the column that corresponds to the output vector is a dot product and in this case it will be like two multiplications and then the addition of those two multiplications.

so like you're accumulating.

yeah so the addition so really there's only one addition per dot product but like we like to start with the initialization of 0.

yeah so the what we're going to aim for is to have so we've got it we want to have quadratically more compute we do we have we have got sort of x times y as much compute as we had before but we're going to want to somehow aim for having only x times as much like communication and this is sort of the intention so that we get this advantage term going as y.

so we've laid down the multiplications bringing in like we're going to want to bring in a vector of size two and so that sort of already is in line with our columns target that's fine.

however we need to somehow manage the communication of this matrix which which which which exceeds our budget of x and so the idea is that in an AI context this matrix is actually going to stay fixed for a long period of time and so instead of like bringing it in from the outside so we've got some register files sitting over here we we don't want to have like the amount of stuff coming out of this register file this is the term that we want to go sort of as as x in some sense.

we don't want to bring this full matrix in from the register file every every cycle because we don't have enough that would cost too much in terms of wiring from the register file and so instead we're going to store our key trick is that this matrix can be stored locally to the systolic array and so where we'll store these numbers 01, 2 and 3 in just like a date called a register that like physically stores these numbers and they're going to reuse these numbers over and over again for a large number of different matrices.

And so the optimization here is that like the nature of matrix multiplication is you can store this like square quadratic thing directly where the logic is happening and which is like higher dimension than the or has an extra dimension compared to the inputs which you keep swapping in an out.

That's right.

I mean this is a nature of what a matrix multiplication is that you do a lot of multiplication to get one value out like a dot product is the result of a lot of other locations and so that optimization means that you're just like you can stuff a lot of like multiplication in before you get some value out of it.

That's right.

So just to complete the picture here of concretely how that looks.

I swapped the 3 in the 2 here 3 and 2.

So just like this 0 and 3 is going to multiply by the 3 and 7 and so we're going to form a dot product sort of along columns here.

So somehow we're going to feed a 3 and a 7 in here.

These participate in sort of this feeds into this multiplication and also feeds into this multiplication likewise the 3 feeds into here and also into here and then we're going to some sort of along here with like starting at the top of a column we feed in 0s and then coming out the bottom we get results coming out.

So sort of just to visually see what we've got.

There's a dot product that is performed along columns in a matrix and that sort of maps exactly to what is done spatially in the systolic array here.

So this is one dot product summed vertically and this is a second dot product also summed vertically.

And then what is the data that needs to go into and out of the register file?

We have x amount of data that's coming out here on the output and we also have sort of this data coming from the input, x amount of data from the input.

And so with respect to the input and output vectors at least we sort of met our goal of having only the x as much data going in and out of the the register file.

This leaves open the question of like I said that the the weight matrices weight matrix is stored locally and systolic rate how did it get there in the first place is sort of like at some point you need to boot your chip and populate this data and so where did that come from?

The trick is just we just do it very slowly.

So we we're very slowly trickle-feated into the systolic array.

The sort of the simplest strategy is that we we sort of run this daisy chain that says like feed a number into here and then and then on the next clock cycle it'll move down to the next entry of the systolic array.

And so we can do that in every in every column in parallel and that gives us the sort of this is also going to come from here and this is going to be another factor of approximately x units of of bandwidth coming in.

You just would you mind if you're in the instance?

So like we're sort of like we know that we're going to be bringing in numbers on the rarely into the matrix.

And so we just want to come up with any construction at all such that the amount of wiring that actually feeds into sort of crosses this boundary of this systolic array like this boundary right here.

We just want to keep that bounded to x and not be not going to xy.

And so a particularly simple strategy is that we we sort of bring in a number into the top row of the systolic array.

That's what we can do in one clock cycle.

And then for like for y consecutive clock cycles we're going to be bringing in the top row every time and then sort of shift all the other ones down by one.

And that keeps the the sort of the wiring that needs to come from this expensive register file only down to a factor of x for the next.

I see, okay.

So there's two questions in terms of communication and there's like communication time and then there's communication bandwidth.

Yes.

And you're saying since we're only going to be loading this in once, let's maximize, let's minimize bandwidth.

That's right.

Because bandwidth equals diarrhea.

And let's just like load it in slowly over like smaller lanes.

Because we're just going to keep this value in there for a while.

Interesting.

So it's interesting to me that when we were talking last time about inference across many ships, the big high level thing we're trying to optimize for is increase the amount of compute per memory bandwidth that is to say per communication.

And here also we're trying to increase the amount of like actual multiplies or actual additions relative to transporting information from registers to the logic.

So in both cases, you're trying to maximize compute relative to communication.

Yeah.

This shows up all the way up and down the stack.

There's a sort of close to the bottom, sort of like to the gates.

There's sort of a version that's maybe even closer to the gates of just like even the precision of number format that you choose to use.

We saw that same effect.

There's like a square cube law or like square versus linear term going on both in just purely the precision of this ALU, but then also in terms of the size of this, the matrix.

Yeah.

Interesting.

So this unit is sort of the next bigger unit.

We had like the multiplication circuit.

And then on top of that we have a like a pretty large systolic array.

I drew it as 2 by 2, but in like for example, older TPUs, they would describe it as 128 or 128 of this circuit shown here.

And this circuit ends up being this is the most efficient known mechanism for circuit for implementing a matrix multiply.

I see.

So we've talked about sort of, it seems obvious that you've tried to maximize compute relative to communication.

What are non-obvious trade-offs that actually you are, you know, keeping up at night about what should we do, actually we do why.

And it's not obvious what the answer is.

Yeah.

So I mean, I think most of the decisions in chip design are sizing decisions.

And so already in what we've drawn so far, like, so AI chips all have this circuit in it.

They have a systolic array and then somewhere near it, a register file providing inputs and outputs.

The two sort of like even within this scope sizing questions that you have are how big should I make my systolic array and how big should I make the register file?

And then the trade-off for the size of the systolic array, actually these two questions are coupled, is one way to think of it is to say, I'm going to have a budget for how much, what percentage of my chip array, I want to spend on data movement.

So maybe I just say that I want this to be 10% and the systolic array to be 90%.

And then sort of I can size my register file, bigger register files are more flexible.

They allow me to run sort of more, I can get more application level performance out, but but then they sort of take away from this area spent on the system like your right.

I recently ran a nasty contest where I asked people to write about what I consider to be some of the biggest open questions about AI.

The submission window closed last week, so I used cursor to create a couple of different interfaces to help me review the entries.

One interface anonymizes the missions and hides a necessary information.

It lets me group responses by question, add notes, and record my scores.

The other interface helps me review entrance who also want to be considered for the researcher role that I'm hiring for.

The UI puts the applicants essay right next to the resume and the personal website so that I can see everything at once.

cursor's harness is really good and helping these models see and improve their UI.

I watched it render these interfaces in the built-in browser, take screenshots, click through sections, and keep iterating.

At this point, cursor is where I do most of my work.

Whether I'm reading and visualizing a bunch of research papers, we're coding up an interface to review applications or making flashcards from my Blackboard Lectures.

cursor just makes it very easy for an AI to look at whatever I'm looking at and help me understand it and work with me on it.

So whatever you're working on, you should do it in cursor.

Go to cursor.com slash dworkesh.

Where does the clock cycle of a chip company and what determines what that is?

Yeah, and what is a clock cycle of a chip?

So I guess at baseline, it's sort of worth observing the chips are incredibly incredibly parallel, right?

You've got 100 billion transistors in a chip.

A key thing that you need to do whenever you have massive parallel is you need to synchronize between the different parallel units.

In software, typically you have these very expensive synchronization methods like a mutex.

So one thread will finish what it's doing.

It will grab a lock somewhere stored in memory and then notify the other thread that it's done.

On chips, we take a very different approach and say that every nanosecond or so all circuitry in the chip will kind of pause for a moment and then synchronize every single nanosecond or so.

So that is the clock cycle.

The entire chip, typically all in one fell swoop, goes in lock step to the next operation that happens.

So what this looks like in circuitry is that you will have this typically drawn.

So the clock is sort of mediated by registers, which are these storage devices that we've drawn elsewhere.

And the way to think of it is that I have some storage, which is drawing like a bit, which might be 0 or 1.

And then I have some sort of cloud of logic, which maybe is like this systolic array or this multiplier or something like that.

And then I've got some, and that's going to produce some output.

So my inputs, I've got a bunch of inputs, feedings of this cloud of logic, and then eventually later there's going to be some output register that this writes to.

There is a global clock signal, which drives all of these registers, and it says at a certain instance in time, when the clock strikes, whatever value happens to be on this wire at that instant, that's what's going to get stored in there.

And so the sort of the challenge here is like I would like to have my clock speed run as fast as possible, because if I can run at two gigahertz, I can get twice as many operations done per second, then if I run at one gigahertz.

But what that ends up meaning is that I'm very sensitive to the delay through this cloud of logic, because any computation that is going to happen in here needs to sort of finish before the next clock cycle hits.

So a major point of sort of optimization on any chip, then is to make this delay from here as short as possible.

Interesting.

And is there ever, because the constraint here seems to be that if you add too much logic, then you might risk missing the clock cycle.

But if you don't add enough, then you're leaving potential compute on the table.

Is there ever a situation where you're like, you'd take a problem with a mistake chance, that a compute computation finishes, and or is it just like, no, either it's going to finish by clock cycle or not?

Yeah, in standard chip design, you margin at such that, I mean, there is a probability, but it's like many, many standard deviations, like way standard deviations out such that for all intents and purposes, it is a reliable part.

It will always put a meat to clock.

There are some weird exceptions to that.

There are clock domain crossings where you go from one clock to the other clock, and then you actually do have to reason about this probability.

But interesting.

In the main path, you just like, you margin at such that you'll get there, like 25% of the clock cycle in advance, so it's very unlikely.

And in this, in this, the clock, where the clock synchronized, I guess, where the registers are, this is not something you determine as a chip designer.

This is sort of just like an artifact of, hey, I want whatever sequence of logic, and then the software you use to convert your very log into the thing you send to TSMC, that just determines like, hey, in order to make this work, you got to put a register here here and here to make sure that there's a, there's no one step that is like too long, such as it makes the whole clock cycle of the entire chip longer than it has to be.

Yeah.

So this is actually a huge part of the work of designing a chip, actually, is inserting them.

So it is done in a combination of manually and automatically.

So, I mean, just like to show the very sort of dumb version of like what you can do here, you can take this logic and split it in half.

And so, like, say, actually, instead of just one part of logic, I'm going to have two smaller clouds of logic, which do the same thing, but split them up by a register.

Right.

Feeding in like this.

And this is like, like, if you split it like in the middle, you can hit twice the clock frequency.

That's great.

You get twice the performance at the cost of this extra register.

And so at the cost of some more storage.

And stepping back, why do we need to synchronize the whole chip?

If you imagine playing factorial or something, there's no like global clock cycle.

It just should have done on its done.

There's iron on the plate.

You can take it if you want.

Yeah.

So taking that analogy, the thing that you need to be mindful of is if I've got two different paths through some logic.

So I have to do a computation like f here and then computation g here.

And then they're going to come and meet for computation h, somewhere here.

And so there's going to be manufacturing variants here.

In some chips, f will take a little longer, maybe in some chips g will take a little bit longer.

And so if I've got some signal that's propagating through here.

And the result from f and g have to sort of meet up at h.

The thing that can go wrong is that f can get there early and it meets like the previous value of g or the next value of g.

And a h needs to know when to start exactly.

Like when when has this next iteration of and so this explains why different chips made at the same process.

No, the same like TSMC take a technology can have different clock cycles.

They got two chips made at three nanometer, minor, different clock cycles based on whether they were able to optimize making sure that like there's no one critical path that is so long that it's those down the whole chips clock cycle.

That's right.

This optimization that I showed here, this is just this sort of pipeline register insertion.

It's called, we've inserted in the middle of the pipeline register here.

This is a sort of pure trade-off between clock speed and an area.

Yeah.

This is the easy case.

There is a harder case, too, which is sort of drawn as a pipeline of logic here.

But in other cases, you may have some calculation which actually feeds back in on itself.

So it runs some function f and then writes back to itself like this.

So for example, this might be this addition.

Like you've got some number that you're adding into every clock cycle and so this could be like a plus, we're adding in some number every clock cycle.

So this little circuit essentially is just going to sum all of the numbers that was entered on a different clock cycles.

And the challenge is if this plus takes too long, what can I do?

If I try and put a pipeline register right in the middle of it, like here in the middle of it, this will end up changing the computation that's done.

Instead of forming a running sum of everything that comes here, I will actually have two different running sums I'll end up having a running sum of the even numbers and a running sum of the odd numbers.

So sort of this constraint where I have a loop in my logic which all chips have somewhere.

This is actually the thing that is the hardest thing to address and set the clock cycle.

I don't understand why we problem to have that.

Or I'm not sure even what it would mean to have a register there.

Because it's a sort of atomic operation, right?

Yeah, well, so plus is not really atomic.

I think as we just demonstrate.

Yeah, it took a whole lot of work to do a transformation.

And so like you can take the early parts of that work and then stick a register in the middle and take the late parts of that work.

Okay, okay.

Yeah.

And I guess it's then up to CTSMT offers a PDK which says today is hey, here's the primitives of logic that we can grant you in the chip and it's up to them to determine that no primitive is bigger than the clock cycle.

They're hoping a process node targets.

But other than that is they're like, what further optimize?

Can't you just say like, hey, here's all the primitives from TSMC and keep adding registers in between the primitives as much as it's needed until you get your desired clock cycle?

Yeah, as a logic designer, like the chip architects set the clock cycle.

So just for one example, the primitives you get from TSMC are on the order of like hand gates or for ladders.

They depends a lot on voltage and frequent and which library you choose and so on.

But generally they you can typically have about like 10 or 20 or 30 of these in a clock cycle sequentially.

So these primitives are very, very fast.

Like 10 pic of seconds or something like that.

And so as a logic designer, I mean, like in principle, if you literally just had like register and then hand gate, it kind of in a loop like that, you could get an insanely fast clock speed like more than four or five or six gigahertz and they like that.

But if you take this, this like really sort of like simple circuit and you look at the area you're spending here, like this is maybe like one, I mean, this is this is called one gate equivalent in size.

So like, you know, of one in area and this thing is like unit of eight in area or something like that.

And so like this is just, again, almost only because just being this like synchronization or communication cost, we're compared to the actual logic.

And so so this would be a case where you've gone too far.

You've made your clock speed really, really fast, right, at the cost of spending almost all of your area on right, on on on pipeline registers.

Interesting.

So what you're hinting at is a dynamic where you can have really fast clock speed, but you're not getting that much work done.

Yeah.

And so you can have like low latency, but low bandwidth or throughput rather.

Yeah.

It has your throughput, in fact, because like like the throughput of your chip, you can think of as the product of how much I can get done per clock cycle, which is based on this area efficiency thing times how many clocks I get per second.

This is actually so similar to the thing we're just going to think last time about like batch size where if you have a low batch size, then you can any one user can receive their next token really fast, but the total number of tokens that are processed and say an out or will be kind of lower than could otherwise be.

Yeah.

Exactly.

You get less parallelism out.

If you drive your clock speed up, I believe.

Language models are starting to compete against the best human forecasters.

I sat down with two senior Jane Streeters, Ron Minsky and Dan Puntikoro and asked, at some point, this AI just do what Jane Street does.

There's a world that we should take seriously where, you know, we're going to build large language models or some other AI systems that are like strictly smarter than all humans on the planet and more capable at all cognitive tasks, trading in particular feels to me as like kind of AGI complete sort of like and P complete because at the end of the day, trading involves figuring out what things are worth, which means making predictions about the future.

Jane Street isn't betting against AI.

They just signed a six million dollar compute deal, but Ron's view is at the edge keeps moving.

I have never been more desperate to hire more engineers and more traders than I am today.

You know, you have the usual thing of like the other hard parts that we don't get know how to automate, well, that ends up being where the competitive edge lies.

You can find these open positions and watch the full interview at Jane Street.com slash the work hash.

Okay, so I remember talking to an FPGA engineer at Jane Street Clark who actually helped me prepare for the previous interview we did together and he was explaining why they use FPGA.

I imagine that for high frequency trading throughput is less important than latency and so having very specific control of the clock cycle deterministic ways, the most important thing.

Maybe the interesting to talk about how you why you can't just achieve that with an ASIC or why FPGA is the why you might use an FPGA to have deterministic clock cycles for high frequency trading.

Yeah, so I mean, firstly, let's consider the sort of the business case for an FPGA versus an ASIC.

FPGA is an ASIC's use largely the same sort of conceptual model, which is that I have a series of gates built from ANS or six or as those like very small primitives connected together with a fixed clock cycle and connected together with wires that are running in a fixed clock cycle.

So anything you can express in an FPGA you can express in an ASIC 2 and it'll be about an order of magnitude cheaper and like a better energy efficiency on an ASIC than an FPGA.

The tradeoff is that the first FPGA costs you $10,000 whereas the first ASIC you make costs you $30 million because of it is requires an entire tape out.

So sort of the business use case for an FPGA would be that I want something that has this very deterministic latency and in fast-run time and high parallelism, but I'm going to change it very frequently and change what I do every month for something like that and so then I don't want to pay the tape out gosh every time.

Now, how does an FPGA actually implement it sort of like it emulates the ASIC programming model but in a fixed piece of hardware and so how does that work actually?

So what it has at the base is it's got a it's got the two components we just talked about it's got these registers as storage devices and then it's got these these are called lots lookup tables which actually provide all of the gates.

So and then we're going to see even the sort of the third component we then have like sort of a swarm of these registers and lots and all of these are available and then they're connected by sort of this big set of sort of boxes so in front of every single one of these we've got something like one of these boxes which selects one input from everywhere else sort of selecting from all of these for things and we you know we've got a whole bunch of different options feeding into into into all of these things.

So so what this allows is essentially a when I program my FPGA I can say that I'm going to take all of these components and I'm going sort of superimposed on top of this a particular wiring which like goes through this lot and then feeds into this lot and then goes to this register and then feeds into this lot something like that.

So what I've drawn in orange is like how you like FPGA means field programmable gateway this is the the oranges what has been programmed in the field whereas the white is all of the wires that must exist in the FPGA in order to actually take to make the device in the first place.

What does it mean to me program into a field like programmed in the field so like the device is being deployed in a days of center it's sitting in the field and then you can come and program that field is like electric field no field as in like out there in the world field okay and so if I see look at the how the the field programming comes out of the first look up table and goes in a second one how is it how like where where the wires that may that happen again yeah so I I got a little bit like lazy and drawing a lot of these every single device here has a max sitting in front of it which can select from all of the like nearby like circuits that are available yeah and so the actual configuration of the FPGA is like amounts to it is the max control so like we in this max here we have the data inputs and then we have like the control that selects and so like there's a little storage device sitting next to every single one of these um uh max is saying this is where you're going to source your input from right and so programming it consists of like configuring every single one of these maxes so that makes sense what is happening instead of the look up table yeah so the purpose of the look up table so it's going to also have a little bit of control feeding telling it what to do as well the purpose of the look up table is to function to be able to configureably take the role of an and gate or gate x or any of those different things so there's many ways you could consider doing that the way it is done in sort of traditional FPGA is is to say it will support so it will be a look up table will be it will have four bits of input one bit of output how many different functions are there from four bits to one bit there are 16 different functions and so you can actually just tabulate this as like 16 different like 16 different numbers you go to table of so one one one one one zero zero one 16 entries and so what it does is this this table is stored in this blue configuration bit and then it views these four bits as binary looks up the relevant row of the table and emits that bit so this is a truth table view of look up table so essentially okay so the look up table if you think about an and gate or gate nor gate x or gate these are all like take as input those are like two input functions yeah sometimes we have like more complicated like a three input function would be a three way x or right or a four way x or and in this case how many it is depends how big it is but typical size for lots is four input and which is sort of just a sweet spot between um there's another computer communication trade off like here like if it if it has two few inputs then you need to use more lots yeah if it if it is true but basically the look up table is like a truth table it's a truth table yeah and with a truth table you can program in any gate you want that's right and so it's a look up table just think like a programmable gate that's right and so I mean one of the things you can do here is you can see why whether we'll have found that an FPGA is like an order of magnitude more expensive than an ace it comes from um is to count how many gates would be inside this look up table yeah so we can view this look up table essentially as one of these um and so it is a max with has to select between 16 different values and so it is a max with uh sort of n equals 16 options p equals one bits and so what we saw away earlier is that um it this circuit costs like n times p many gates and so it's like um so it costs like n times p equals 16 and gates and also 16 or is this circuit being the max uh yeah exactly the max the max is the core the max it goes into the look up table uh so the look up table itself you can think of as being actually a big max that like selects from all 16 rows down to one output yeah okay that is the look up table but the way you've drawn it here there's like a max and then the look up table it's max is all the way down so I mean there's a second max that is inside here this max is is this max guy okay and then the the other max is just saying where it came from in this sort of mess of of of case right and then the second max is okay now you have one value but that value is still um still a four-bit value yeah so I've selected four bits from the soup right um and then and then I use those four bits to select which entry and the look up table yeah I'm gonna use right okay it's just like suppose in the first max there's like eight nearby eight you're you're pulling from eight nearby registrers is input and so that's like a total of like 32 bits going in and then out of that four bits come out those four bits go into the second max which is inside the look up table or so actually I would say in yeah in this case these registers are single bit registers so if there are eight like eight nearby registrers and look up tables then I have eight bits total coming in in nearby I select from eight down to four different values so there's actually like four different maxes one associated with each of these input like little max associated with each of these input bits each of them is selecting one out of eight and one of those they coming from nearby registrers and other lots and each register is one bit yes yeah and so I guess you AMD or whoever makes these um FJs still has to be opinionated about what uh what registrers are connected with registrers and then you can program in the actual gates but they had a wire in the connect like the communication topology right yeah so there's the sort of like you get flexibility and a local green thing there's a sort of nearby neighborhood where you can select from but then more grossly like more costly longer distant connections that you they form independent of right yeah and the reason it's 10x slower is why so if you look at the cost of like building this look up table it's like 32 gates yeah and then it can give me the equivalent of like what's one an interesting thing I can do here I can do a four way and gate and so that's like I am using 32 gates of look up table to sort of implement like a four way and means like what is a four way and I would do like and and and and and and so to like this is a surca that I could implement in an A6 directly using these three and gates but using a lot I can also implement it but it's going to take like these 32 gates instead of three and so the overhead is really coming from the like um the fact that the look up table the mocks and the look up table is there's a more concise way to describe a truth table then listing out every single possible combination of inputs which is just to like write out the gate yeah like to like place down the policy like and then that's interesting one important made to me is that the reason they prefer FPGAs to CPUs is because they get deterministic clock cycles they know what a pack will come in and go out um why is it not a guarantee in CPUs so you can actually design a CPU that has deterministic latency as well um and in fact like the the processes that are inside a lot of AI chips actually also have deterministic latency to advertise this GPUs have that in the core as well um the challenge is getting sort of deterministic latency and high speed at the same time and so um where does the non-determinism in latency come from non-determistic latency comes from specific design choices in a CPU um it's actually possible to remove those design choices and make a CPU that has deterministic latency um those are not very attractive in the market and so people don't make those CPUs anymore um but uh but but but actually in some sense like deterministic latency is maybe sort of a simpler designing start starting point um and then and then like some ship designers have added things into it to be non-deterministic to take a concrete example of that um the probably the most important example is on a CPU just like the the the CPU cache itself so uh in a CPU you have the the CPU this is this is the CPU die itself and then there is a memory uh off on the side this is the DDR memory um off on the side and then you have a cache system here inside it is the cache uh that sort of remembers recent access to DDR and and and and stores them and so uh when when I'm running through my CPU instructions uh every time I uh I have an instruction that access is memory it first checks in cache uh was it was the data sorting cache and then if not it goes uh it fetches up to DDR yep this is a huge optimization the cache is like two orders are magnitude faster than the DDR um if you never uh if if you never like use the cache like you all basically all programs would run a hundred times slower so uh the presence of a cache is absolutely necessary for CPU to run at reasonable speed um but whether or not you get a cache hit is dependent on the sort of ambient environment of the CPU like what other programs are running what has run recently what is the random number generated inside the cache system doing and so so that is a big source of non-determinist in in the runtime of a CPU yeah um so this is sort of the memory system for a CPU um the the the big thing that you can do differently is uh instead of having a hardware say I'm gonna like read uh read memory and then decide the hardware decides whether or not it comes comes comes from cache or not you can actually bake this in this decision into software so uh a different design philosophy is is to um so and you see this in maybe for example TPUs um the the TPU instead has I mean I'll draw the same diagram but I'll call it a scratch pad and so the main difference is um so this would be like a TPU and then like HBM in this case rather than DDR but it's still an off-chip memory um and instead of like the software saying first access like memory and then the harder decides um you've got some instructions that I think they go here this is like one kind of instruction and then a totally different kind of instruction that goes to HBM yeah and so this style is is generically known as scratch pad um instead of cache the key distinction being that you have like one kind of instruction that says read or write scratch pad and a totally different instruction that says read or write HBM so there's scratch pad being the cache yeah this this thing here is the scratch pad um just read there so stepping way back people say computers have the one called John Moin uh uh uh vannoyman at architecture where there's this uh serial processing of an operation and maybe just because we've been talking about a pair accelerators but I just don't like the fj is super parallel uh um the uh the kind of the axle that TPU is super parallel um even CPUs are super parallel if you think about all the cores they have and so is it actually like in what sense is modern hardware actually the vannoyman architecture is it actually a fairway to describe modern hardware um I think it's a fairway to describe CPUs like just the amount of parallelism like on a CPU the amount of parallelism you got as about a hundred cores times maybe like sixteen wave acting in it so 166 uh about a thousand wave parallelism on a CPU yeah is it one question is what is the there is a die that is being used for the CPU and if there's fewer threads just as a matter of like transistor uh voltages are like switching on and off is it just that there's like literally one control flow like a small part of the die where like voltages are switching on and off or like in what how how do you actually occupy the die area of a CPU if there's a as opposed to if there's so few cores like what do am I saying all that I think in there yeah yeah the cores are just much bigger and more complicated um so uh I mean like so I guess we we should compare like a CPU core which takes up one 16 one one hundred to the die to like I mean to a lot like a lot is just only these 16 gates right so like it's clear why there are so many more lots in an FPGA then cores in a CPU but then sort of maybe the the like why they're more CUDA cores for example then then then CPU cores I think would be like why a like what's the difference between a CPU and a GPU or something like that would it would be a big difference um inside the CPU you have um so one big use of so the sort of the top unit uses of area in a site of CPU at the cache um very little is actually the ALUs like mostly it's like these register files rather than the logic units um and then the uh both of these things have equivalent in a GPU and so that's not a big difference but the thing that does not have any equivalent in a GPU is the sort of this branch predictor and so there is a whole big area in the CPU which is uh sort of uh just a whole bunch of predictors that are saying um when will my next branch be and where the where is the branch target for that and so uh stripping a lot of that out uh as well as sort of making these register files tighter in in a sense um is uh driving a lot of where the like GPU gained so what is the purpose of the branch predictor to like execute both branches at once or what does it do?

So the issue is that when I've got a series of instructions like um like instructions instructions instructions instructions I uh if I have a branch like here if this instruction is branch um the the actual processing step of processing and instruction um takes a really long amount of time it takes like maybe five nine or seconds or something like that um so like uh the time to actually notice that I've got a branch and then like uh evaluate the Boolean where it whether it's true and then and then update the program counter to the new target and then read from the instruction memory for that that could take like like actually five nine or seconds to to finish and so in reality this may finish some way down here um i don't want to like but like i wonder on a clock speed that is much faster than what five nine or seconds allows like like five nine or seconds is 200 megahertz clock speed oh like to run up one or two gigahertz or something like that and so um i need to run other instructions while the branch is being evaluated and so i like i really just want to keep running the following instructions that happen after me um but that might have been wrong like if the branch ended up being taken then i need to know that instead of evaluating these instructions i actually need to like jump to where the target is and run from these instructions instead and so the purpose of the branch predictor is like like genuinely to predict based on like before you i even get to the destruction to be like five cycles earlier to predict there was going to be a branch that's going to happen so if i think about how the brain works versus what you're describing here at a high level the differences might be that while you can do structured sparsity in these accelerators and then save yourself some um area they would have otherwise had a dedicated to these gates in the brain there's unstructured sparsity you know any neuron connect any other neuron and not in like ways where they'd have the column aligned or whatever yeah um then there's a fact that memory and computer co-located i guess you could say in a way the memory computer co-located on this this is exactly the co-location in some sense of the main that's right that's right yeah yes maybe that actually isn't a big difference and the other maybe a big a big difference is that the clock cycle on the brain is much slower than on um on computers and partly that's a preserve energy because the faster the clock cycle the bigger the voltage needs to be in order to identify for the signal to settle and to like identify what state of transistor is that's right i don't know if you have other high level things about like how any commentary on what you know what the brain might be doing versus how these chips work yeah i mean so so let's take the clock speed or one first actually yeah um the clock speed is quite high on on a chip because that i mean drives highest throughput like when we compare a like a GPU running some some workload it's running batch size thousand or something like that whereas like the brain is not running batch size thousand because only one of me and so you could sort of imagine saying well take a GPU and like instead of running at a gigahertz running at a megahertz or something like that and and that would start to look maybe a little bit more like like um sort of equivalent things that you're talking about in the brain um there is in the way that silicon works um there are like that does not give you an 1000x advantage in energy efficiency so um what it ends up looking like is you can uh like you sort of just end up running this circuit once to stabilization and then it'll sit idle for a long period of time um it doesn't consume a lot of energy while it's sitting idle because uh most of the energy is consumed in in sort of toggling bits from zero to one and back um uh so actually let's let's talk about the energy consumption uh of all the circuit like this the way to think of a bit being stored is you've actually deposited some charge in in a capacitor somewhere sitting somewhere in the chip implicitly so it becomes charged when it is uh bit becomes a one and then it becomes discharged when it next goes to a zero and that cycle of like charging the capacitor and then dumping that charge out to ground that is where the energy is consumed you know this called the dynamic or switching power uh this is most of the energy consumption of a chip there is some other energy consumption just coming from the fact that um uh insulators on perfect insulators but we'll just just kind of that most of the energy consumption uh actually comes from just the charging and discharging of like toggling from zero to one and back to zero right um so if you run a chip much slower and you only clock at once every thousand clock cycles or something you will have a thousand times fewer transitions it'll be about a thousand times less energy consumption um but but like but not a substantial advantage of energy efficiency um okay so you're described how a TPU works at a high level um what is the difference at a high level between how a GPU and a TPU work yeah so I mean I think there's sort of a high level organization principle that is different um and then there's sort of inside the cores what are different but we'll look a sort of outside the uh like at the high level um so we'll take a GPU um an a TPU and what is like sort of the top level work structure look like um if you think of this as the whole chip um in each case um the organization of of the GPU um uh is mostly a bunch of uh almost identical units which are these um these are the SMs um and then they've got an L2 memory in the middle and then a bunch more of these SMs on the bottom um and so uh they're sort of this fairly regular grid of of cores um um and then like if we look at a TPU in in comparison um you end up with much course of grain units of uh logic and so you end up with something like um um some large number of um maybe like maybe just a few emic uh matrix units these are the um these are the big like um sister like arrays and then in the middle you've got some vector unit and then you've got your um matrix units at the bottom so um now sort of like matrix units with a vector unit in the middle sort of this is the whole TPU chip you can sort of think of scaling this thing down into a really tiny unit with a smaller matrix units, smaller vector unit and that is sort of what a uh an SM is so sort of at a very high level point of view the the GPU has a lot of tiny tiny TPUs um sort of tile across the whole uh whole position.

So like you're suggesting the tensor core within a streaming SM is now like an alico student MXU yeah it's a very very similar yeah I see and so if you had more like more lack of structure having a bunch of tiny TPUs makes a lot of sense whereas if you kind of just have like huge matrix multiplication you're like why don't we just why don't we avoid the cost of having the individual SMs with their own registers and warp schedulers and things like that why don't we just like make a huge thing and like amortize those costs of course the whole thing and I mean I think this shows up in how large you can grow things we've sort of seen this theme like especially with a systolic array where logist systolic array amortizes the register file costs better yeah um this sort of design allows you to have larger systolic arrays this where's the sort of GPU design constraints you are having small units of everything um there's a trade-off however um the there ends up being because of this sort of core screen separation of things there you need to move a lot of data from the vector unit to the matrix units and so like you need to move a lot of data through through any sort of like two two lines of parameter here um whereas if you sort of look at the Coventing here you've got vector units everywhere and you need to move data through this line through this line through this line through this line through this line through this line through this line so the amount of data you can move between a vector unit or matrix unit is actually much higher in in a GPU than in a in a TPU because because like it's like instead of having to like move all the data through these just two lines you're moving all these data through like 16 lines or something of of of of wiring instead of a GPU right but also you you might have to move across less area which I mean is also a saving like it's an energy sensor so so so so data ends up moving like if you can operate entirely within S and SAM the data movement is much smaller but then the moment you want to operate across SAM is like it becomes sort of more complicated and expensive so you don't have to comment well one might expect that a thing that X might try to do is to get the GPU like smaller structure of um systolic race surrounded by S RAM but also the same time make it so that like the things you need in an SM to support the CUDA architecture uh but take a bunch of much of space you might discard yeah um we've talked publicly about something which we call a sysplitable systolic array which is sort of in in some sense you can think of as like big systolic arrays that can be small systolic arrays as well go um okay I think it's a good note to close on right or thank you so much thanks for looking at me here again.