
Latent Space · 2026-03-10
NVIDIA on Agent Inference at Scale and Developer Experience
Hosts: Vibhu
Guests: Kyle Kranen, Nader Khalil
Why it matters
NVIDIA's Dynamo is a data center scale inference engine optimizing transformer model serving by separating pre-fill and decode phases and enabling scale-out with Kubernetes.
Key claims
- NVIDIA's Dynamo is a data center scale inference engine optimizing transformer model serving by separating pre-fill and decode phases and enabling scale-out with Kubernetes.
- Pre-fill is typically compute-bound while decode is memory-bound; separating them allows specialized resource allocation and better scheduling.
- Brev is a developer tool simplifying GPU access with a user-friendly interface and CLI, now integrated with NVIDIA post-acquisition.
- AI agents at NVIDIA can access files, internet, and execute code but are restricted to only two of these capabilities simultaneously for security.
Briefing memo
Summary
This episode of Latent Space features Kyle Kranen and Nader Khalil from NVIDIA discussing the challenges and innovations in scaling AI inference at data center scale, particularly through their Dynamo inference engine. They highlight the importance of balancing compute and memory demands in pre-fill and decode phases of transformer models, and how Dynamo enables efficient scaling out with Kubernetes integration. The conversation also covers NVIDIA's focus on improving developer experience by simplifying GPU access via tools like Brev and the growing role of coding agents that interact with terminals and APIs to automate workflows.
Security considerations around AI agents are emphasized, especially when agents have capabilities to access files, the internet, and execute code. NVIDIA is actively working on enforcement points to safely enable these powerful agents. The guests also discuss the evolving landscape of AI models, including trends in long context lengths, sparse architectures, and the future of multi-agent systems where agents coordinate sub-agents for complex tasks. Finally, they share insights on NVIDIA's culture of passion-driven innovation, zero-billion-dollar market investments, and upcoming GTC sessions focused on Dynamo and AI agents.
- NVIDIA's Dynamo is a data center scale inference engine optimizing transformer model serving by separating pre-fill and decode phases and enabling scale-out with Kubernetes.
- Pre-fill is typically compute-bound while decode is memory-bound; separating them allows specialized resource allocation and better scheduling.
- Brev is a developer tool simplifying GPU access with a user-friendly interface and CLI, now integrated with NVIDIA post-acquisition.
- AI agents at NVIDIA can access files, internet, and execute code but are restricted to only two of these capabilities simultaneously for security.
- Coding agents that interact with terminals and APIs are proving highly effective for automating developer workflows.
- NVIDIA invests in zero-billion-dollar markets and encourages passion-driven projects, enabling innovation beyond immediate revenue.
- Long context lengths and sparse model architectures are active research areas, with hybrid models and new attention mechanisms pushing boundaries.
- Future AI systems will likely be composed of multiple cooperating agents and sub-agents, improving modularity and efficiency.
Source material
Transcript
Agents can do three things.
They can access your files, they can access the internet, and then now they can write custom code and execute it.
You literally only let an agent do two of those three things.
If you can access your files and you can write custom code, you don't want internet access because that's what you want to see from the vulnerability, right?
If you have access to internet and your file system, you should do the full scope of what that agent's capable of doing, otherwise, now we're going to get injected or something that can happen.
And so, that's a lot of what we've been thinking about is like, you know, how do we both enable this because it's clearly the future, but then also, you know, what are these enforcement points that we can start to like, protect?
All right, welcome to the least-based podcast in the Chromey Studio, welcome to all the guests here.
We're back with, like, guest host, Vibu, welcome, good to have you back.
And our friends, net air, and Kyle from NVIDIA, welcome.
Yeah, thanks for having us.
Yeah, thank you.
Actually, I don't even know your titles, I know you're like, architect something of dynamo.
Yeah, I'm one of the engineering leaders and architects of dynamo.
And your director of something, developers, yeah, your developers, developers, developers guy at the NVIDIA.
The source agent marketing, brev, yeah, dev tools and stuff.
Yeah, I'm gonna focus.
And we're sort of recording this ahead of NVIDIA, GTC, which is coming to town again, we're taking over town, which we'll all be at, and we'll talk a little bit about your sessions and stuff.
Yeah, yeah, we're super excited for it.
And all my favorite memories for Natter, like you always do marketing stunts.
And while you were brev, you had this surfboard that you went down to GTC with.
And like, NVIDIA apparently, like, there's so much that they bought you.
Like, oh, what was that, like, what was that?
Yeah, yeah, we, our logo was a shaka, we were, we were always just kind of like trying to keep true to who we were.
I think, you know, some of the sub startups, you're like trying to pretend that you're a bigger, more mature company than you are.
And it was actually Evan Conrad, a sub-compute, who is just like, yeah, guys are my guess.
Yeah, it was amazing.
Yeah, yeah, he was just like, guys, your two dudes in the room, why are you pretending that you're not?
And so then we were like, okay, let's make the logo a shaka.
We brought surfboards to our booth to GTC and the energy was great.
Some palm trees too.
They actually poked out over, like, the walls, so you can, you could see the brev booth.
And no one else, just from very far away.
Oh, so you remember it back then?
Yeah, I remember it.
Pre-acquisition.
I was like, oh, that was good.
That makes sense because we, so we signed up really last minute.
And so we had the last booth.
It was all the way in the corner.
And so I was worried that no one was going to come.
So that's why we had like the palm trees.
We really came in with the surfboards.
We even had one of our investors bring her dog.
And then she was just like walking the dog around to try to like bring your energy towards our booth.
Yeah.
Yeah, she's the best.
Yeah, as a conference organizer, I love that, right?
Like, it's like, everyone who sponsors the conference comes, does their booth, they're like, we are changing the future of the eyes and things like generic bullshit.
And like, no, like, actually try to stand out and make it fun, right?
And people still remember it after three years.
Yeah, yeah.
Yeah.
You know what?
I'll give you this clip if you want to add it in.
But my wife was at that time, fiancee, she was in medical school.
And she came to help us because there was like a big moment for us.
And so we, we bought this cricket.
It's like a vinyl printer because like, how else are we going to label the surfboard?
So we got a surfboard.
Luckily, we was able to purchase that on the company card.
We got a cricket.
And it was just like fine tuning for enterprise or something like that that we put on the on the surfboard.
And it's 1 a.m.
The day before we go to GTC, she's helping me put these like vinyl stickers on.
And she goes, you son of a, she's like, if you pull this off, you son of a bitch.
So, right, pretty much after the acquisition, I stitched that with the music acquisition.
I sent it to our family group chat.
Oh, oh.
Hey.
No.
Well, she, she made a good choice there.
Was that like basically the origin story for launchables?
Is that we, maybe we should explain what brev is?
Yeah.
I mean, brev is just, it's a developer tool that makes it really easy to get a GPU.
So, we connect a bunch of different GPU sources.
So the basics of it is like how quickly can we SSH you into a GPU?
And whenever we talk to users, they wanted a GPU.
They wanted an A100.
And if you go to like any cloud provisioning page, usually it's like three pages of forms or in the form somewhere, there's a drop down and in the drop down, there's some weird code that you know to translate to an A100.
And I remember just thinking like, every time someone says they want an A100, like the piece of text that they're telling me that they want is like stuff the way in the corner.
And so we're like, what if the biggest piece of text was what the user's asking for?
And so when you go to brev, it's just big GPU chips with the type that you know.
Beautiful animations.
So you work on pre, like free, like now you can just prompt it.
But back in the day, it was our handcrafted artisanal code.
Yeah, I was actually really proud of that because it was an, I made it in Figma.
Yeah.
And then I found, I was like really struggling to figure out a turnip from like Figma to react.
So what it actually is is just an SVG.
And I have all the styles.
And so when you change the chip, whether it's like active or not, it changes the SVG.
And that somehow like renders like looks like it's animating, but it would just have the transition slow.
But it's just like a JavaScript function to change the underlying SVG.
Yeah.
And that was how I ended up like figuring out how to move it from Figma.
But yeah, that's artisan speaking of marketing stunts though.
He actually used those SVGs or kind of used those SVGs to make these cards.
Oh, yeah.
Like a GPU gift card.
Yes.
He handed out everywhere.
That was actually my first impression of that.
Yeah.
I think I still have one of them.
Yeah.
They look great.
Yeah.
I have a ton of them still actually in our garage, but just they don't have labels.
We should honestly like bring the back, but I found this old printing press here actually just around the corner on Venness.
And it's a third generation San Francisco shop.
And so I come in and excited started founder trying to like, and they just have this crazy old machinery.
I mean, awe because the whole building is so physical.
Like you're seeing these machines.
They have like pedals to like move these saws and whatever.
I don't know what the machinery is.
But I saw all the generations.
Like there's like the grandpa, the father and the son, the son was like around my age.
It's like a holy Holy Trinity.
Yeah.
Yeah.
Yeah.
So I just took the same SVG and we just like printed it and it's foil printing.
So they make a mold that's like an inverse of like the 800.
And then they put the foil on it and then they press it into the paper.
And I remember once we got them, he was like, hey, don't forget about us.
You know, I guess like early Apple and Cisco's first business cards were all made there.
And so he was like, yeah, we get like the startup businesses.
But then as they mature, they kind of go somewhere else.
So I actually, I think we're talking with marketing about like using them.
Should we go back and make the purpose?
Yeah.
Yeah.
I remember, you know, as a very, very small breath investor, I was like, why are we spending time like doing these like stunts for GPUs?
Like, I think like as a, you know, typical like cloud hardware person, you go into AWS, you pick like T5, XXL, whatever, just like from a list and you look at the specs.
Like why animate this GPU?
And I do think like it just shows the level of care that goes throughout the earth.
Yeah.
And also Daniel.
And in video, I think that's what the thing that struck me most when we first came in was like the amount of passion that everyone has.
Like I think, you know, you talk to, you talk to Kyle, you talk to like every VP that I've met at in video goes so close to the metal.
Like I remember it was almost a year ago.
And like my VP asked me, he's like, hey, what's cursor and like are you using it?
And if so why?
And I'm just like surprised at this and he downloaded cursor and he was asking me to help him like use it.
And I thought that was, or like just show him what, you know, why we were using it.
And so the amount of care that I think everyone has and the passion and appreciation for the moment, right?
This is a very unique time.
So it's really cool to see everyone really like appreciate that.
Yeah.
One thing I wanted to do before we move over to sort of like research topics and the stuff that Kyle's working on, it's just tell the story of the acquisition, right?
Not many people have been through an acquisition with a media, what's it like, yeah, just anything you'd like to say.
It's a crazy experience.
I think, you know, we were the thing that was the most exciting for us was our goal was just to make it easier for developers.
You wanted to find access to GPUs, make it easier to do that.
And then all, oh, actually, your question about launchables.
So launchables was just make one click like one click deploys for any software on top of the GPU.
And so what we really liked about in video was that it felt like we just got a lot more resources to do all of that.
I think, you know, Nvidia's goal is to make things as easy for developers as possible.
So there was a really nice like synergy there.
I think that, you know, when it comes to like an acquisition, I think the amount that the soul of the products align, I think is going to be, is going to speak to the success of the acquisition.
Yeah.
So in many ways, feels like we're home.
This is a really great outcome for us.
Like we, you know, I love brev.invited.com, like you should, you should use it.
It's a front page for GPUs.
Yeah.
You get your views.
You go there again.
And it's like internally is going very quickly.
I don't remember.
You said some stats there.
Yeah, yeah.
I wish I had the exact numbers, but like internally externally, it's been growing really quickly.
We've been working with a bunch of partners with a bunch of different customers and ISVs.
If you have a solution that you want someone that runs on the GPU and you want people to use it quickly, we can bundle it up in a launchable and make it a one-click run.
If you're doing things and you want just like a sandbox or something to run on, right?
Like open-cloth, huge moment, super exciting.
And we'll talk into it more, but, you know, internally, people want to run this and, you know, we have to be really careful from the security implications.
We've let this run on the corporate network.
Security's guidance was, hey, run this on brev.
It's in, you know, it's a VM.
It's sitting in the cloud.
It's off the corporate network.
It's isolated.
And so that's been our stance internally and externally about how to even run something like open-cloth.
We figure out how to run these things securely.
But yeah, I think there's also like you'll almost like where the right team at the right time when Nvidia is starting to invest a lot more in developer experience or whatever you call UX or I don't know where you call it software.
Like obviously Nvidia's always investing in software, but like there's like this is like a different audience.
It's a wider developers.
Yeah.
Right.
Yeah.
Yeah.
It's funny.
It's like, it's not.
So what is it called internally?
What is this?
The people should be aware that it's going on there.
Yeah.
Yeah.
Yeah.
I think strategy here.
Nvidia always wants to make a good developer experience.
The thing is, a lot of the technologies just really complicated.
It's not.
I think the thing that's been really growing or the AI is growing is having a huge moment, not because like let's say data scientists in 2018 were quiet then and are much louder now.
The pie is there's a whole bunch of new audiences.
My mom's wondering what she's doing.
My sister's learned like taught herself how to code like the, you know, I actually think just generally AI is a big equalizer and you're seeing a more like technologically literate society.
I guess like everyone's learning how to code there isn't really an excuse for that.
And so building a good UX means that you really understand who your end user is.
And when your end user becomes such a wide variety of people, then you have to almost reinvent the practice.
Right.
You have to actually build more developer UX, right?
Because their tiers of developer base that were added, you know, the hackers that are building on top of open cloth, right?
For example, have never used GPU, they don't know, but could is they just want to run something.
Yeah.
You need new UX that is not just, hey, you know, how do you program something you can code and run it?
And then we built, you know, like when deep warning was getting big, we built, we built torch.
And, but so recently, the amount of like layers that are added to that developer stack is just exploded because, yeah, it's become ubiquitous, everyone's using it in different ways.
Yeah.
It's moving fast in every direction.
Vertical horizontal.
Yeah.
You even take it down to hardware.
Like the DJX Park, you know, it's basically the same system is just throwing it up on big GPU clubs.
Yeah.
Yeah.
It's in the base of blackwell.
Yeah.
Another preview at the last year's GCC and that was one of the better performing videos of our NVIDIA coverage so far.
Awesome.
This will be it.
That was like so far.
Yeah.
Even when Grace Blackwell or when the DJX Park was first coming out, getting to be involved in that from the beginning of the developer experience and it just comes back.
You were involved?
Yeah.
Yeah.
Yeah.
Yeah.
I mean, it was just like, I got an email.
We just got thrown into the loop and suddenly, yeah, I do, it was actually really funny because I'm still pretty fresh from the acquisition and I'm getting an email from a bunch of the engineering VPs about like the new hardware GPU chip where we're not chip, but just GPU system that we're putting out.
And I'm like, OK, cool, that is now involved with this for the UX and like, what am I going to do here?
So I remember the first meeting I was just like kind of quiet as I was hearing engineering VPs talk about what this box could be, what it could do, how we should use it.
And I remember one of the first ideas that people were writing was like, oh, the first thing that it was like, I think a quote was like, the first thing someone's going to want to do with this is get two of them and run a Kubernetes cluster on top of them and I was like, oh, I think I know why I'm here.
I was like, the first thing we're doing is easy SSH into the machine and then and you know, just kind of like scoping it down of like once you can do that, every you like the person who wants to run a Kubernetes cluster on two sparks has a higher propensity for pain than then, you know, someone who buys it and wants to run open claw right now, right?
If you can make sure that that's as effortless as possible, then the rest becomes easy.
So there's a tool called Nvidia sink.
It just makes the SSH connection really simple.
So, you know, if you think about it like if you have a Mac or PC or whatever, if you have a laptop and you buy this GPU and you want to use it, you should be able to use it like it's a GPU in the cloud, right?
But there's all this friction of like how do you actually get into that that part of brev's value proposition is just, you know, there's a CLI that wraps SSH and makes it simple.
And so our goal is just get you into that machine really easily.
And one thing we just launched this CES, it's in it's still in like early access, we're ironing out some kinks, but it should be ready by GTC.
You can register your spark on brev.
And so now if like remote managed, local single running glass, because brev can already manage other clouds anyway, right?
And you just spark on brev as well, right?
Yeah, but yeah, exactly.
So you set it up at home, you can run the command on it and then it gets, it's essentially it'll appear in your brev account.
And then you can take your laptop to a Starbucks or to a cafe and you will continue to use your, you can continue to spark just like any other cloud node on brev.
Yeah, it's just like a pre-provisioned, so you'll be able to do this on your home.
Yeah, exactly.
Yeah, yeah, I'm tiny and a little data center.
It's time to go.
It's time to go.
Yeah.
One more thing before we move on to Kyle, just have so many gents and stories, and I just love mining gents and stories.
My favorite so far is SOL.
Oh, yeah.
Oh, yeah.
What is SOL?
As well as actually, I think of all the lessons I've learned, that one's definitely my favorite.
It always stick with you.
Yeah, yeah.
You know, in your startup, everything's existential, right?
Like we've run out of money.
We were on the risk of losing payroll.
We've had to contract our team because we ran out of money.
And so like because of that, you're really always forcing yourself to, to like understand the root cause of everything.
If you get a date, if you get a timeline, you know exactly why that date or timeline is there.
You're pushing every boundary and like you're not just accepting like a no just because.
And so as you start to introduce more layers, as you start to become a much larger organization, SOL is essentially like, what is the physics, right?
The speed of light moves at a certain speed.
So it flights moving slower than you know something's in the way.
So before trying to like layer reality back in of like, why can't this be delivered at some date, let's just understand the physics.
What is the theoretical limit to like, how fast this can go?
And then start to tell me why, because otherwise people will start telling you why something can't be done.
But actually, I think any great leaders goal is there's to create urgency.
There's an infinite impelling events.
Yeah.
Yeah.
All of the tournament and videos used to instigate a compelling event, you say, this is done.
How do we get there?
What is the minimum as much as necessary as little as possible thing that it takes for us to get exactly here?
And it helps you just break through a bunch of noise.
Yeah.
The one thing I'm unclear about is can only Jensen use the SOL card, like, you know, get the bullshit out.
Because obviously it's Jensen, but like, can someone else be like, no, like, front line engine, you're shooting.
Okay.
It's not so much about, like, get the bullshit out.
It's like, it's like, give me the root understanding, right?
Like, if you tell me something takes three weeks, it's like, what's, yeah, the first principles, it's like, what's the, what, like, why is it three weeks?
What is the actual, yeah, what's the actual limit of why there's going to take three weeks?
If you're going to, if you, if let's say you wanted to buy a new computer and someone told you's going to be here in five days, what's the SOL?
Well, like, the SOL is like, I could walk into a best buy and pick it up for you, right?
So then anything that's like, beyond that is, and is that practical?
Is that how we're going to, you know, let's say, give everyone on the company a laptop, like, obviously not.
So then, like, that's the SOL, and then it's like, okay, well, if we have to get more than 10, suddenly there might be some, right?
And so now we can kind of piece the reality back.
So, so this is the Paul Graham do things I don't scale.
Yeah.
And this is also the, what, if you will, would not call, be high agency.
Yeah.
Yeah.
Yeah.
It's actually really interesting because there's a, there's a second hardware angle to SOL that, like, doesn't come up for all of the orgs, so SOL is used, like, culturally, to be there for everything.
I'm also in my name for, like, I think that can be annoying sometimes.
And, like, someone keeps going, SOL as I saw at you, and you're like, guys, like, we have to be stable.
We have to, we have to, we have to fucking play it.
Yeah.
It's an interesting balance.
Yeah.
I encounter that with, like, actually, just with, with Alec, right?
Because we, we have a new conference, so we need to launch, we have goals of what we want to launch, but, uh, by the conference, and like, yeah, at the end of the day, this is GTC system.
Well, this is, like, so we, I mean, we did it for C, yes, we did it for GTC before that.
We're doing it for GTC, and I was like, so I mean, like, every, you know, we have a new moment, we want to launch something, and we want to do so as so well.
And that does mean that some, there's some level of prioritization that needs to happen.
And so it is difficult, right?
I think, um, you have to be careful with what you're pushing, you know, stability is important.
And that should be factored into as well as, just like, build everything and let it break.
You know, that, that's part of the conversation.
So, as you're laying layering in all the details, one of them might be, hey, we could build this, but then it's not going to be stable for XYZ reasons.
And so that was, like, one of our conversations for C, yes, was, you know, we, like, we, we can get this into early access registering your spark with brev.
But there are a lot of things that we need to do in order to feel really comfortable from a security perspective, right?
There's a lot of networking involved before we deliver that to users.
So it's like, okay, let's get this to a point where we can at least let people experiment with it.
We had it in a booth.
We had it in Jensen's keynote.
And then let's go iron out all the networking kinks.
And that's not easy.
And so, uh, that can come later.
And so that was the way that we layered that back in.
Yeah.
But it's not really about saying, like, you don't have to do the main nense or operational work.
It's more about saying, you know, it's kind of like highly to progress as incremental, right?
Like, what is the minimum thing that we can get to?
And then there's SOL for like every component after that.
But there's the SOL to get you get you to the starting line.
And that's usually how it's asked.
On the other side, you know, like SOL came out of like hardware at a video, right?
So SOL is like literally if we ran the accelerator or the GPU with like at basically full speed with like no other constraints, like how fast would be able to make a program go.
Yeah.
Right.
So in training that like, you know, then you work back to like some percentage of like MFU for example.
Yeah.
That's a great example.
So like there's an SOL MFU and then there's like, you know, what's practically achievable.
Cool.
So we move on to sort of a Kyle side, Kyle, you're coming more from the data science world.
And I mean, I always when I remember I mean, someone who's done work in tabular stuff, graph neural networks, time series, these are basically when I go to new ribs, I go to the ICML, I walk the back halls.
There's always like a small group of graph people, small group of tabular people.
And like there's no one there.
And like it's very like, you know what I mean?
Like yeah.
Like it's important, interesting work if you care about solving the problems that they solve.
Yeah.
Yeah.
I mean, it's like it's like the black hole, right?
Yeah.
Has the bed rise and reached this yet in nerves.
But like, you know, those are those are transformers too.
Yeah.
And those are also like interesting thing.
Anyway, I just want to spend a little bit of time on those that background before we go into dynamo proper.
Yeah, sure.
I took a different path to the video than that or I joined six years ago, seven if you count when I was an intern.
So I joined in video like right at a college.
And the first thing I jumped into was not what I'd done during internship, which was like, you know, like some stuff for autonomous vehicles, like heavyweight object detection.
I jumped into like, you know, something like recommenders.
This is popular.
And yeah, he did.
He did.
Yeah.
I mean, that was the tabby of data at the time, right?
You have tables of like audience qualities and item qualities and you're trying to figure out like which member of the audience matches, which item or more practically which item matches, which member of the audience.
And at the time, really, it was like we were trying to enable uh, recommenders, which had historically been like a little bit of a CPU base workflow into something that like ran really well in GPUs.
And it's since been done like there are a bunch of libraries for XS, that run on GPUs.
Uh, the common models like deep-roaning recommendation model, which came out of meta.
And the white and deep model, which was used or was pretty released by Google, were very accelerated by GPUs using, you know, the fast HPM on the chips, especially to do, you know, vector lookups.
But it was very interesting at the time and super super relevant because like we were starting to get like this explosion of feeds and things that required recommenders to just actively beyond all the time.
And sort of transition that a little bit towards graffinal networks when I discovered them because I was like, okay, you can actually use graffinal neural networks to represent like relationships between people, items, concepts.
And that that interested me.
So I jumped into that in a video and got really involved for like two years.
Yeah.
And something I learned from Brian kind of Zaro is that you can just kind of choose your own path in the video.
Oh my god.
Yeah.
Which is not a normal big corp thing.
Yeah.
You have a lane.
You stay in your lane.
I think probably the reason why I enjoy being in a big company.
I'm just trying to start up guy.
Yeah.
Like this.
Yeah.
It feels like a big game of pickup basketball.
Like you know, if you want to play basketball, you just go up to the court and you're like, hey, we're going to play this game and we need three and you just like find your three.
That's honestly for every new initiative.
That's what it feels like.
Yeah.
Yeah.
Yeah.
It also like shows, right?
Like Nvidia is just releasing state of the art stuff in every domain.
Like, okay, you expect foundation models with Neymotron, voice, just randomly and pop to your character.
It just comes out.
Another one.
The oyster voice team has always been previously.
There's always just every other domain of paper that comes out data set that comes on.
It's like, I mean, it also stems back to what Nvidia has to do, right?
You have to make chips years before they're actually produced, right?
So you need to know you need to really for design process starts like exactly.
Three to five years before the chip gets to the market.
Yeah.
I'm curious more about what that's like, right?
So like, you have specialist teams, is it just like, you know, people find an interest, you go in, you go deep on whatever and that kind of feeds back into, you know, okay, we expect predictions like the internals that Nvidia must be crazy, right?
You know, you know, you must not even without selling the people.
You have your own predictions of where things are going and they're very based, very grounded, right?
Yeah.
It's really interesting.
And two things I think that admitted as was quite interesting.
One is like, we really index into passion.
There's a big sort of organizational top sound push to like ensure that people are working on the things that they're passionate about.
So if someone proposes something that's interesting, many times they can just email someone like, way up the chain that they would find this relevant and say, like, hey, can I go work on that?
That's actually like, I worked at a big company for a couple years before starting on my start of journey.
And like, it felt very weird if you were to like email out of chain, if that makes sense.
But emails in video are like, mosh bits, and it's just like 60 people, just whatever.
And like, there's something I missy like, reply all, you know, it's insane.
It's insane.
It's what's helped.
You know, it's a good text.
But, but that's actually like, I've actually, so this is a weird thing where I used to be like, why would we send emails?
We have Slack.
I am the exact opposite.
I feel so bad for anyone who's like messaging me on Slack because I'm so unresponsive.
So you know, I know.
I know.
You know as a different email is perfect because we can't work together.
You know, it's great because important threads get bunked back up.
Right.
Yeah.
And so Slack doesn't do that.
So I just have like this casino going off on the right.
They're on the left.
And like, I don't know which thread was from where or what, but like, the threads get.
And then also just like the subject.
So you can have like working threads.
I think what's difficult is like, when you're small, if it's not 40,000 people, I think Slack will work fine.
But there's, I don't know what the inflection point is.
There is going to be a point where that becomes really messy.
And you'll actually prefer having email because you can have working threads.
You can see more than nine people in a thread.
You can force stuff.
You can fork stuff, which is super nice and just like, yeah.
And so what that is part of where you can propose a plan, you can also just like start.
Honestly, momentum is the only authority, right?
So like if you can just start to make a little bit of progress and show someone something and then they can try it.
That's I think with been, you know, I think the most effective way to push anything forward.
And that's both that in video.
And I think just generally, yes, there's the other concept that like is explored a lot in video, which is this idea of a zero billion dollar business.
Like, market creation is a big thing.
In a video, like you'll want to go and start a zero billion dollar business.
Jensen says we're completely happy investing in zero billion dollar markets.
We don't care if this creates revenue.
It's important for us to know about this market.
We think it will be important in the future.
It can be zero billion dollars for a while.
I'm probably mingling his words here, but like, you know, like, I'll give an example.
And video has been working on autonomous driving for a long time.
Like in a video car.
No, like, like, like, say to like, right around the HQ.
And then I think it finally just got licensed out now they're starting to be used quite a bit.
Yeah.
Yeah.
Like for 10 years, you've been seeing Mercedes with NVIDIA logos.
Oh, yeah.
Yeah.
If you're in, like, this is something.
Sanctuary.
Sanctuary.
Yeah.
Yeah.
Yeah.
So zero billion dollar markets are a thing, like, you know, Jensen, I mean, okay, the cars are not a zero billion dollar market, but yeah.
It doesn't matter.
Yeah.
I think that messaging zero today, but or even, like, internally, right?
Like, it's like, org doesn't have to ruthlessly find revenue very quickly to justify their existence, right?
Like, a lot of the important research, a lot of the important technology being developed.
That's kind of where research is very ideal, ideologically free at NVIDIA.
Yeah.
Like, they can pursue things.
Well, you research, officially.
I was never in research, officially.
I was always in engineering.
Yeah.
I'm in an org called Deep Warning Hourisms, which is basically just how we make things that are relevant to deep warning.
Go fast.
That sounds freaking cool.
And I think a lot of that is underappreciated, right?
Like, time series, this week Google put out time effort, yeah.
I knew time series, right?
Rexis, semantic IDs started applying transformers, LMS to a Rexis.
And when you think the scale of companies deploying these, right, Amazon recommendations, Google Web sort of like, it's huge scale and you won't fast.
Yeah.
Yeah.
Yeah.
Actually, there's a fun moment that brought me like full circle, like, Amazon ads recently gave a talk where they talked about using Dynamo for generative recommendation, which was like super, like weirdly cathartic for me, I'm like, oh my god, I've supplanted what I was working on.
Like, you're using LMS now to do what I was doing five years ago.
Yeah.
Amazing.
And that's going right into Dynamo.
Maybe introduce, yeah.
Sure.
That's what it top down.
Yeah.
I think at this point, a lot of people are familiar with the term of the inference.
Like, funnily enough, like, I went from, you know, inference being like a really niche topic to being something that's like discussed on like, normal people's Twitter feed.
Some billboards here.
Yeah.
Drive driving, seeing just an inference ad on 101 inference at scale is becoming a lot more important.
We have these moments like, you know, open claw where you have these agents that take lots and lots of tokens, but produce incredible results.
There are many different aspects of test time scaling so that, you know, you can use more inference to generate a better result than if you were to use like a short amount of inference.
There's reasoning, there's requering, there's adding agency to the model, allowing it to call tools and use skills.
I know it's sort of came about at Nvidia because myself and a couple of others were sort of talking about these concepts that like, you know, you've inference engines like VLM, as she laying tensor at TLM and they have like one single copy, they sort of think about like things as like one single copy, like one replica, right, like one version of the model.
But when you're actually serving things at scale, you can't just scale up that replica because you end up with like performance problems.
There's a scaling limit to scaling up replicas.
So you actually have to scale out to use a maybe some Kubernetes terminology.
We kind of realized that there was like a lot of potential optimization that we could do in scaling out and building systems for data center scale inference.
So Dynamo is this data center scale inference engine that sits on top of the frameworks like VLM as she laying tensor at TLM and just makes things go faster because you can leverage the economy of scale, the fact that you have KV cache, which we can define a little bit later, all of these machines that is like unique and you want to figure out like the ways to maximize your cache hits or you want to employ new techniques in inference like disaggregation, which Dynamo had introduced to the world in in March, not introduced.
It was an academic topic beforehand, but we're one of the first frameworks to start supporting it.
And we want to like sort of combine all these techniques into sort of a modular framework that allows you to accelerate your inference at scale.
By the way, Kinye became friends on my first data video and I always love because like he always teaches me new things.
By the way, this is why I wanted to put two of you together.
I was like, yeah, this is going to be good.
It's very different, you know, like we've we've talked to each other brunch.
Actually, you asked like, why can we scale up?
Yeah, model, you said model replicas.
Yeah, so scale up means assigning more heavier.
Yeah, heavier, like making things heavier, adding more GPUs, adding more CPUs.
Scale out is just like having a bearer saying, I'm going to duplicate my representation of the model or a representation of this micro service or something.
And I'm going to like replicate it many times, standard load.
And the reason that you can't scale scale up past some points is like, you know, there are sort of hardware, bounds and algorithmic bounds on on that type of scaling.
So I'll give you a good example that's like very trivial.
Let's say you're on NH-100.
The maximum and the like domain for H-100 for most D-Jex H-100s is H-E-PUs.
Right.
So if you scale that past that, you're going to have to figure out ways to handle the fact that now for the GPUs to communicate, you have to do it over in Finland, which is still very fast, but it's not as fast as having me like, is it like one on a magnitude like 100?
It's about an order of magnitude.
So not terrible.
Yeah, I need to remember the the data sheet here.
Like, I think it's like about 500 gigabytes a second unidirectional for enreeling and about 50 gigabytes a second unidirectional for in Finland.
It depends on the the generation.
I just want to set this up for people who are not familiar with these kinds of like layers of the transges piece.
Also maybe even just going like a few steps back before that.
Like most people are very familiar with you see, you know, you can use on your laptop, whatever these STL and you can just run in front.
So you can just run on that laptop.
You can run on laptop.
Then you get to okay, models got pretty big, right?
GLM5, they doubled the size.
So what do you do when you have to go from?
Okay, I can get 128 gigs of memory.
I can run it on a spark.
Then you have to go multi GPU.
Okay, multi GPU, there's some support there.
Now, if I'm a company and I don't have like not hiring the best researchers for this, right?
But I need to go multi node, right?
I have a lot of servers.
Okay, now there's efficiency problems, right?
You can have multiple 8 to 100 nodes.
But you know, is that like how do you do that efficiency?
Yeah, how do you like represent them?
How do you choose how to represent the model?
Yeah, that's like a hard question ever and ask.
Like how do you size?
Oh, I want to run GLM5, which just came out new model.
They've been like four of them in the past week, by the way.
Like a bunch of new models.
You know why, right?
Deep sea.
No problem.
Yeah, but GLM5, right?
We have this new model.
It's of like a large size.
And you have to figure out how to boost scale up and scale up, right?
Because you have to find the right representation that you care about.
Everyone does this differently.
Let's be very clear.
Everyone figures this out in their own path.
I feel like a lot of AI or ML even is like, is like this.
I think people think, you know, there was some tweet a few months ago that was like, why hasn't fine-tuning as a service taken off?
And you know, and like, it might be me.
Yeah, it might have been you.
Yeah, but people want it to be such an easy recipe to follow.
But even, like, if you look at an ML specific to you.
Yeah, and the model.
And there's a question.
There's so much tinkering.
Like, when you see a model that has however many experts in ML, we model.
It's like, why that many experts, you know, they tried a bunch of things.
And that one seemed to do better.
And I think when it comes to how you're serving inference, you know, you have a bunch of decisions to make.
And you can always argue that you can take something and make it more optimal.
But I think there's this internal calibration and appetite for continued calibration.
Yeah.
And that doesn't mean, like, you know, people aren't taking a shot of this.
Like, tinker from thinking machines.
Yeah.
RL is a service.
Yeah.
It's, it also gets even harder when you try to do big model training.
Right.
We're not the best at training MLEs when they're pre-trained.
Like, we saw this with Lama III, right?
They're trained in such a sparse way that meta knows there's going to be a bunch of inference done on these, right?
They'll open source it.
But it's very trained for what meta infrastructure wants, right?
They want to, they want to inference it a lot.
Now, the question to basically think about is, okay, say you want to serve a chat application, a coding copilot, right?
You're doing a layer of RL.
You're serving a model for X amount of people.
It's a chat model, coding model.
Dynamo, you know, back to that.
Yeah, sorry.
So we sort of, like, jumped off of, you know, you know, jumped off of, you know, uh, on that topic, every kind of sort of like, they're an attorney.
And I like to think of it as defined by like, what is the modeling need?
What is the accuracy you need?
Actually, I talked to it, now that I've heard about this earlier, there's three axes you care about.
What is the quality they're able to produce?
So like, are you accurate enough, or can you complete the task with enough performance?
Yeah.
There's cost.
Can you serve the model or serve your workflow?
Because it's not just the model anymore.
It's the workflow.
It's the multi-turn with an agent, cheaply enough.
And then can you serve it fast enough?
And we're seeing all three of these like play out.
Like we saw, we saw new models from opening eye that, you know, our faster, you have like these new fast versions of the models.
You can change them out of thinking to change them out of quality, right?
Produce more tokens, but at a higher cost and a higher latency.
And really like when you start this journey of like trying to figure out how you want to host a model, you think about three things.
What is the model I need to serve?
How many times do I need to call it?
What is the impate sequence thing?
Was that what is the workflow look like on top of it?
What is the SLA?
What is the latency SLA that I need to achieve?
Because there's usually some, this is usually like a constant.
You know the SLA that you need to hit.
And then like you try and find the lowest cost version that hits all of these constraints.
Usually, you know, you start with those things and you say, you kind of do like a bit of experimentation across some common configurations.
You change the tensor parallel size, which is a form of parallelism.
I take goes even deeper.
First kind of thing, low model.
Yes, it's like a multi-step design process because as you said, you can choose a smaller model and then do more test time scaling and it'll like equate quality of a larger model because you're doing the test time scaling or you're adding a harness or something.
So yes, it goes way deeper than that.
But from the performance perspective, like once you get to the model, you need to host.
You look at that and you say, hey, I have this model.
I need to serve it at the speed.
What is the right configuration for that?
You guys see the recent, there's a paper I just saw like a few days ago that if you run the same prompt twice, you're getting like double-dry it again.
Yeah, exactly.
And you get a lot.
Yeah.
But the key thing there is you give the context of the field try, right?
Yeah.
It takes a shot and this has been like, you know, basic guidance for quite a while.
Just try again because, you know, it's just try again.
Did you try again?
All of my life.
And I think it's a paper for Google if I'm not mistaken, right?
I think it's like a seven-bades little short paper.
Yeah.
The title is very cute and it's just like, yeah, just try again.
Yeah, that's a context.
No, no, no.
You just like say, okay, like, you know, like take a little bit more and take a little bit more information trying to fail.
And that basic concept is going pretty deep.
There's like self-distolation RL where you do self-distolation.
You do RL and you have past failure and, you know, that gives some signal.
People take, try it again, not strong enough for listeners who listen to here.
People actually, and I, and we run a second YouTube channel for our paper club where, oh, that's what we would just cover this self-distolation and all that.
That's why he's, that's what he's speaking on it to check it out.
Yeah, it's just a good practice like everyone needs a paper club where, like, you just read papers together in a social pressure, just kind of forces you to speak.
We are just like a big inference reading group at the end.
It's so bad every time I, he put it on, like, on our, he shared it.
Yeah, but one of your guys is, is, is big in that.
I forget.
Yeah, he shot, he shot, he shot, he shot, he shot, he shot, he shot, he shot, he shot, there's a, there's a employee transfer between us.
He shot and worked for Natter at brev.
And now he, he was, he was our head of AI and then, yeah, once we got in.
So, because I, I'm always looking for like, okay, can, can I start it?
And other podcasts that only does that thing.
And a, he shot was, like, I was trying to, like, nudge his onion to like, there's, there's something here.
I mean, I don't think there's, there's new emphasis like this every day.
So, it's like, it's like, you would actually be surprised.
You're a man of blog post, you see.
And if there's a period where it was like, Medusa, hija, what eagle, like, you know, when you forms a decode, we've, we've new forms of spec, we've decoding or new, what do you expect?
I mean, it's exciting when you guys put out something like, Nema Trunk, because I remember the paper on this, Nema Trunk 3, the amount of, like, post-training, the amount of tokens that the GPU rich can just train on.
And it was a hybrid state space model, right?
Yeah, it's code design for the hardware.
Yeah, code design for the hardware.
And one of the things was always, you know, the state space models don't scale as well.
When you do a conversion or whatever, the performance.
And you guys like, no, just keep training.
And Nema Trunk shows a lot of that.
Yeah.
Also something cool about Nema Trunk.
It was released in layers, if you were very similar to Dynamo.
It's essentially, it was released this aggregate.
You can, the pre-training post-training data sets are released.
The recipes on how to do it are released.
The model itself is released.
So you can just benefit from us turning on the GPUs.
But there are companies like service now to the data set.
And they train their own model.
And we were super excited.
Yeah.
Like, you know, so they do work.
Yeah.
They do the French model.
Yeah.
The zoom is zoom is easy.
Yeah.
I think, you know, also just to add, like, a lot of models don't put out based models.
And if there's that, why is fine tuning or taking off, you know, you can do your own construction by industry.
You guys put out based model.
I think you put out everything.
I boo, boo, boo, boo, boo, boo.
Basically, basically cancel it.
It's going to be canceled.
Yeah.
It's going to be canceled.
It's going to be canceled.
Do we get a full picture of Dynamo?
I don't know if we're doing it.
What I'd love is, you mentioned the three axes.
Like, break it down of like, you know, what's pre-filled decode and like, what are the optimizations that we can get with Dynamo?
Yeah, that's, that's a great point.
So to summarize on that, three axes problem, right, there are three things that determine whether or not something can be done with inference costs quality latency.
Right.
Dynamo is supposed to be there to provide you like the runtime that allows you to pull levers to, you know, mix it up and move around the parade of frontier or the parade of surface that determines.
Is this actually possible with inference in AI today?
It gives you the knobs.
Yeah, exactly.
It gives you the knobs.
And one thing that like we use a lot in contemporary inference and is, you know, starting to like pick up from, you know, in general knowledge is this concept of disaggregation.
So historically models would be hosted with a single inference engine.
And that inference engine would ping-pong between two phases.
There's pre-fill where you're reading the sequence, generating KV cache, which is basically just a set of vectors that represent the sequence, and then using that KV cache to generate new tokens, which is called decode.
And some brilliant researchers across multiple different papers, essentially made the realization that if you separate these two phases, you actually gain some benefits.
Those benefits are basically A, you don't have to worry about stepping synchronous scheduling.
So the way that an inference engine works is you do one step, and then you finish it, and then you schedule, you start scheduling the next step.
It's not like fully asynchronous.
And the problem with that is you would have essentially pre-fill and decode are actually very different in terms of both their resource requirements and their sometimes to run time.
So you would have like pre-fill that would like block decode steps.
Because you still be pre-flowing, and you couldn't schedule because you know the step has to end.
So you remove that scheduling issue.
And then you also allow yourself to like split the work into two different types of pools.
So pre-fill typically, and this changes as mall architecture changes, pre-fill is right now compute bound most of the time.
With the sequence is sufficiently long, it's compute bound on the decode side, because you're doing a full pass over all the weights and the entire sequence every time you do a decode step.
And you don't have the quadratic computation of KB cache.
It's usually memory bound, because you're retrieving a linear amount of memory and you're doing a linear amount of compute as opposed to pre-fill where you retrieve a linear amount of memory and then use a quadratic method.
You know, it's funny.
Someone ex-alabs did a really cool demo for the DGX Spark, which has a lot more compute.
You can do the compute hungry pre-fill on a DGX Spark and then do the decode on a Mac.
Yeah.
And so that's faster.
Yeah.
You can do machine stratification.
Yeah.
And with our future generations of hardware, we actually announced with a Rubin this new accelerator that is pre-fill-specific.
It's called Rubin Cpx.
So I have a question, when you do this scale out, is scaling out easier with Dynamo, because when you need a new node, you can dedicate it to either the pre-fill or decode.
Yeah, so Dynamo actually has a Kubernetes component it called Grove that allows you to do this like crazy scaling specialization.
It has like this hot, it's a representation that I don't want to go too deep into Kubernetes here.
But there was a previous way that you would like launch multi-node work.
It's called Leader Worker Set.
It's in the Kubernetes standard.
And Leader Worker Set is great.
It's served a lot of people super well for a long period of time.
But one of the things that it struggles with is representing a set of cases where you have a multi-node replica that has a pair, pre-fill and decode, or it's not paired, but it has a second stage, that has a ratio that changes over time.
And pre-fill and decode are two different things.
As your workload changes, the matter pre-fill you'll need to do may change, the amount of decode that you'll need to do might change.
Let's say you start getting insanely long queries, that probably means that your pre-fill scales like harder, because you're hitting this quadratic scaling growth.
Yeah.
And for listeners, pre-fill will be long input decode will be long output, for example.
Yeah.
So decode scale, I mean, decode is funny because the amount of tokens that you produce scales with the output length, but the amount of work that you do per step scales with the amount of tokens in the context.
Yes.
So both scales with input and the output.
That's true.
But on the pre-fill, the usage side, if suddenly the amount of work you're doing on the decode side stays about the same or if like scales a little bit.
And then the pre-fill side jumps up a lot.
If you actually don't want that ratio to be the same, you want it to change over time.
So Dynamo has a set of components that A, tell you how to scale.
It tells you how many pre-fill workers and decode workers if things you should have.
And also provides a scheduling API for Kubernetes that allows you to actually represent and affect this scheduling on your actual hardware, on your computer infrastructure.
That kind of like, I feel a little embarrassed for being proud of my SVG function earlier.
Yeah, it's really cute.
I like it's all engineering.
It's all engineering.
One thing I'm kind of just curious about with all with UC at a systems level, everything going on here.
And we're scaling it up in multi-inditioned systems.
I think one thing that's like kind of above the moment right now is people are asking, is there any SOL sort of upper bounds in terms of like, let's call it just call it context length for one for a little bit of word.
But you can break it down however you like.
Yeah, I just think like, well, yeah, I mean, I clearly, you can engage in hybrid architectures and doing some state-space models in there all you want.
But it looks still looks very attention-heavy.
Yes.
Yeah, lung context is attention-heavy.
I mean, we have these hybrid models.
And most models like cap out of a million context.
And that's it, like for the last two years has been it.
Yeah, the model hardware context code design thing that we're seeing these days is actually super interesting.
It's like my passion, like my secret side passion.
We see models like Kimmy or GPUSS.
I'm going to use these because I know specific things about these models.
So Kimmy two comes out, right?
And it's an interesting model.
It's like a deep-seek style architecture.
It's MLA.
It's basically deep-seek scale like a little bit differently.
And obviously trained differently as well.
But they talked about why they made the design choices.
For context, Kimmy has more experts, but fewer attention heads.
And I believe a slightly smaller attention, like dementia, but I need to remember.
I need to check that.
It doesn't matter.
But they discussed this actually at length in a blog post on GPU, which is like RGP, which is like super cool.
Yeah.
In RGP.
Yeah, so it's actually an incredible blog post.
Like all the MLAs that I've seen on GPU are like very brilliant.
But they talk about like the creators of Kimmy K2, actually like talked about it on their blog post.
And they say, we actually did an experiment, right?
Attention scales of the number of heads.
Obviously like if you have 64 heads versus 32 heads, you do half the work of attention.
If you still scale quadratically, but you do half the work.
And they made a very specific sort of barter in their system in their architecture.
They basically said, hey, what have we given more experts?
So we're going to use more memory capacity, but we keep the amount of activated experts the same.
We increase the experts barc.
So we have fewer experts, the ratio of experts activated to number of experts is smaller.
And we decrease the number of attention heads.
And kind of for context, what we had been seeing was you make models sparser instead.
So no one was really touching heads.
You're just having, well, they did, they implicitly made it sparser.
Yeah, for Kimmy, they did.
Yes.
They also made it sparser.
But basically what we were seeing was people were at the level of, okay, there's a sparsity ratio.
You want more total parameters less active.
And that's sparsity.
But what you see from papers like the labs like Moonshot Deepseek, they go to the level of, okay, outside of just number of experts.
So you can also change how many attention heads and less attention layers.
More attention layers.
Yes.
So, and that's all basically coming back to just tied together.
It's like hardware model code design.
Which is harder model context for code design.
Yeah, right.
Like if you were training a model that was like really, really short context, or like really what is good at super short context tasks, you may like design it in a way such that like, you don't care about attention scaling, because it hasn't hit that like the turning point where like the quadratic curve takes over.
How do you consider attention or context as a separate part of the code design?
Like I would imagine hardware or just how I would have thought of it as like hardware model code design would be hardware model context code design.
Because the harness and the context that is produced by the harness is a part of the model once it's trained in.
Like even though towards the end, you'll do long context.
You're not changing architecture through.
I mean, you can try.
You're saying everyone's training the harness into the model?
I would say to some degree, or there's a code design.
I know there's a smaller amount, but I feel like not everyone has like gone full sand on this.
I think I think I think it's important to internalize the harness that you think the model will be running into the model.
Yeah, interesting.
Okay.
And then it's bashes like the universal harness.
Right.
Like I'll give an example here.
Right.
I mean, or just like a like an easy proof.
Right.
If you can train against the harness and you're using that harness for everything, wouldn't you just train with the harness to ensure that you get the best possible quality out of?
Well, I can provide the counter argument, which is what you want to provide a generally useful model for other people to plug into their harnesses.
Right.
So if your harnesses can be open source, right?
Yes, that's effectively what's happening with code X.
Yeah.
But like you may want like a different search tool and then you may have to name it differently.
I don't know how much people have pushed on this, but can you train a model?
Would it be have people compared training a model for the for the harness versus like post-training for I think it's the same thing.
That's the same thing.
It's just extra post-training.
I see.
And so I mean, cognition does this course.
It does this where you just have to like if your tool is slightly different, either force your tool to be like the tool that they train for or undo their training for their tool and then it's really annoying.
I would hope that eventually we hit like a certain level of generality with this is not a new tool.
It's not a GI, like it's really stupid, like learn my tool bitch, like I don't know if I can say that, but I think what my point kind of is is that there's like I look at slopes of the scaling laws and like this slope is not working, man.
We are at a million token context.
Okay, maybe next year, two million.
They were not going to 100 trillion.
You know, like this this is so many interesting.
This doesn't work.
It doesn't work.
What kind of funny is whenever I feel like we always want to see a trend that we can predict, but every time something's come, it's been like a leapfrog.
So I imagine, I don't know how we go from one to two, but I imagine what's likely to happen is we break through that from some new.
Yeah, there's actually, there's an interesting formalization of this.
There's an essay that it's pretty interesting essay by Leopold Ashrenbrenner called situational awareness.
Okay, yes.
He introduces a concept in whether or not it's called an unhover, right?
So, you know, Leopold in this essay details, hey, I want to get, you know, like, I want to get to this point in intelligence.
And I think that it is four orders of magnitude worth of, like, compute and data in training way.
And, you know, he says, oh, yeah, I think data centers can scale up by about this much.
I think that you can do scale up the data and some other things by this much.
But one of the things that like makes the rest of that order of magnitude growth possible is this unhover.
So, like, these scientific discoveries that are discovered during, you know, model architecture search or training that's really, really, really impact how you are able to scale.
Like, a good example of this might be that, like, we see like a lot of models that are, and this is probably a very tiny unhover, but it's important for the performance perspective.
We see a lot of models that are like trained with multi-toe contradiction, natively, during pre-training.
And, for DeepSeek and their paper, they say, hey, they said this actually helped us insure state and more stable convergence.
But they're like unhovers that are like that, and then they're like rather large unhovers, right, like architecturally, a lot of our models, like we had different types of attention, and one of the problems with the attention is like you have a lot of KV.
But people found, like, different forms of attention, like group query, tension, and, like, MLA in DeepSeek, multi-toe, native attention, that, like, decrease the burden that KV has on the model, which allows you to grow, like, longer in context.
Yeah, and that was very drastic for DeepSeek.
Yeah, yeah, for context, like, the total, I think the total context of DeepSeek is 128,000 tokens, or might be 256,000, with rapid extension, that entire context, I think it's 128,000, fits into eight gigabytes.
And, previously, context, like, I think the, the Lama 405B context of a similar size was, like, 40 year 80 gigabytes in the same precision.
Yeah.
So, like, those and halbers, like, really decrease the stuff of that size, and I wouldn't be surprised if we do see the ability to, like, break through to, like, 10 million, 20 million, a hundred billion context through the, and unhalbler showing up.
See, and it's just so.
Yeah, it's more deep learning algorithms.
Yeah, more deep learning algorithms.
Yeah, more deep learning algorithms.
I could, I could actually pick up an answer.
Yeah, I could actually give you an example, like, of, like, a theory, not a theory, but something theoretical about it.
I'll go to that you're talking about it.
Well, and how about that?
I mean, I haven't seen.
So, it could be a tarpet, it could not, just not work.
But I would be really excited to see a model that does pre-fill and decode differently.
So, model that does pre-fill, like, locally, like, document-wise pre-fill, like, it doesn't in chunks.
And then you do decode globally across, like, the entire sequence.
Because, logically, to me, it doesn't seem like you would necessarily need to have KVB associative between documents that have, like, no mutual association.
But that, like, places a lot of burden on pre-fill to, like, or, just, right, on decode and pure attention within the decode phase to, like, make those connections, since the KV is, like, static at that point.
And you see other techniques that are interesting, like, this, too.
But if, if you're able to do that, like, if pre-fill becomes local and decodes is still global, you solve that pre-fill quadratic scaling problem, because you have a bunch of, like, small chunks that you pre-fill independently.
Okay.
All right.
Well, let's, uh, wait and see.
But I think it'll be pretty exciting.
Bring your stress.
Yeah, I'm excited for pre-fill decode on separate hardware.
So, like, yeah, crock acquisition, right?
Can we decode on the crock?
Can we guess if we're fast?
I don't think I'm allowed to comment on this.
The mark is going to issue arrows at us.
Uh, he's got a crowbar.
Yeah, he's in his eyes, really.
He's like, like, basically, yeah.
Yeah, uh, I'm super excited to see the team come in.
And, like, you know, I've gotten the, the pleasure of working with some of the, the crock people come in.
And so, you know, yeah, I, I know, sunny.
We've had him at the sea of Congress.
I say you're at.
Yeah.
Um, and I, I think you're, you guys are going to be doing some sessions.
They said, GTC, I don't know if you want this is a good place to plug them.
Yeah.
Yeah.
Yeah.
So, I can't speak to any LPU related sessions at GTC.
I have no idea about that.
I'm on the, on the grock said, yeah, I use this.
So, sort of in video you.
Um, uh, on the, on the, on the, on the video, a dynamo side, we're, we're giving it there are a large number of sessions for those that aren't aware.
You can actually search all of these sessions for GTC online.
Just go to the GTC website.
I don't know what the URL is.
But go there.
Yeah.
And you can just look up dynamo and you'll get all the sessions are about 20.
There are a couple that are hosted by the dynamo team.
There are a couple that are hosted by people that use dynamo that want to show off the results they've been able to get.
But there are two that I'm really excited about.
One is just the general dynamo tutorial.
And this is the, I'm going out with Harry, who's our lead product manager for dynamo.
And we're sort of talking about like how to use dynamo to get better performance and also like where we see dynamo going in the future.
And then there's another session that I'm doing with one of our agents seems at in video to talk about sort of the future of agents in production in front.
So we're talking about there's like this new horizon with respect to agents because we these harnesses that actually impart structure upon calls.
Like if you compare it like the past and the present with respect to like how LM calls work.
Like in the early days when there were chaplots like every call was like very different.
There was basically no structure.
You could assume that like people you if there's conversational there might be like some implicit structure because you have you know a multi-term conversation.
But agents you have this, this harness that like abided by rules, right?
So it imparts direct structure onto the context.
And you see this, there's an interesting Twitter post about how cloud code like structures it's context so that you get as many cashets as possible.
And I think it was by one of the pms for cloud code and he wrote about it.
And that type of structure that the harness can impart actually like goes hand in hand with the inference code design.
So I'm doing a talk.
I don't know the session name or the session number.
But I'm not doing a talk.
You can look at me up by name on the GTC website on how we accelerate agents and where we see specific optimizations for agents going in Dynamo and in inference in general.
Yeah.
I think there's only one pm for cloud code and it's kind of woo.
The rest there's there's there's Devrelle.
There's Boris.
Maybe it's Devrelle.
I mean let's go into agents.
I think this was like the last part of the discussion we planned.
How we not talked about it.
Also, we scheduled it.
Okay, you know like let's have like cohesive sections or yeah.
I mean there's a big news right then video is a huge like deployment of codex.
Yeah, the term it uses everything.
Yeah, it's just cursor and we use this for that.
But that's a pretty big deployment right like that's tens of thousands of people totally.
Yeah.
Yeah, it goes back to the mosh pit of emails we kind of mentioned earlier or just the like um health fluid the org feels.
So when there's new technology people will just email it out and everyone will try it and if it's making people's lives easier it'll spread like wildfire a lot of times agents and we'll get it and we'll be like let's make this work across the company.
Yeah, I guess we're right now.
Honestly, if I was a startup I feel like a cool hack.
If you have something that's going to save an Nvidia time they'll spread it to a couple and the same thing, right?
It'll just spread like wildfire.
Like careful before you're e-mailable.
So I mean we I love using codex.
It's been a ton of fun.
I've been using it personally.
You've been using it at work it's been um yeah.
I don't know it's been great to see the rollout.
Something really funny on a data we got uh codex and called code access.
I founded this person uh as names Carlos at the company he wrote an outlook CLI.
Oh yeah.
And uh just a CLI for e-mail and this is what I've been using that.
Yeah, like four or five weeks ago and uh the site so once I got like codex access I installed the CLI ahead of scale and I just asked it to go through all of my emails which it's very messy.
So I don't want to respond to your e-mail.
I'm really sorry but I asked it to give me a summary highlight any escalations that I should look at.
Put any thread that thinks I should respond to in a folder and it orc I have everything and it did.
So if I missed your e-mail it's because it didn't get.
So I should put a prompt injection in my emails to yeah.
What you should do is just space that or she's uh oh yeah my escalation is highest on FaceTime.
But that was magic and so sent it in a big e-mail thread to like 500 people a bunch of folks tried it out.
I started like FaceTimeing whoever I could at the company to get them set up with this.
Yeah um that specific example you guys deal with like so pretty sensitive emails.
Yeah is there a security review with this?
Because I one guy made it made it for himself but like it's not meant for all of it.
Yeah Gary team and NVIDIA is incredible.
Like shout out to them.
They're they're trying to be other.
You have an amazing security team because they're progressive and they know that this is really important technology and we have to bring it in.
If you think about like if you were going to be company your laptops usually very locked down.
If you can only access certain things in video engineers have those restrictions aren't there.
So you're expected to understand the risks when you try things out.
And so very quickly you know made sure to chime in security on what we were doing.
There's actually a lot that we've been thinking about especially with open claw.
Right?
Like there's you know agents could do three things.
Yeah agents can do three things.
They can access your files.
They can access the internet and then now they can write custom code uh and execute it.
And you literally only let an agent do two of those three things.
If you can access your files and you can write custom code you don't want internet access because that's what they see.
If you have access to internet and your file system you should though the full scope of what that agent's capable of doing otherwise now working get injected or something that can happen.
And so that's a lot of what we've been thinking about is like you know how do we both enable this because it's clearly the future.
But then also you know what are these enforcement points that we can start to like protect.
And then is there any directive of like hey we have a come be a company of agreement with open AI we use open AI models here or like choose whatever.
I'm going to know so so I would never put any company data in a model that's not either that we don't even have some security.
Yes.
Yeah.
Like how do you around the whole like you said you know obviously you could run your own models you have Neemotron and we'd have rather we yeah we have an even internal cluster so we have you know of of course in any event um uh yeah I think we're done it was first customer.
Actually there's a funny story about like how I got the experience that inform what we needed for Dynamo.
At one point there's a what's like a bill done in video calm and also for us in French done in video calm that is the last people to try models because in API service you can call the model with like a rest API and you know you get a response.
I ran the model site for that and it was at one point the largest inference deployment and still may actually be the largest inference deployment in video.
I've since like handed it off to some people and they're doing a wonderful.
But this is an extremely under-known or less-known resource.
They'll die in v.com.
You can get any of these open source models and it's rate limited but it's free.
So it's perfect for hackers to stay at hand and end.
The SLA on getting models days your models up is like a day.
Yeah.
Like they're incredibly good at like figuring out the right way to host the model to get it up there as soon as they come to that.
It you ran this.
Yeah.
I ran it a long time ago.
It was originally called a video AI playground.
Then it was called.
Yeah.
I found it.
And then it was called build an MDR call.
And I ran the model side of it.
So there was a large multi-organizational team.
I ran how whist models should we post?
How should we host them?
And like what's the proportion of them?
And then of course there was like an SRET in that like made sure that things ran well and scaled the models as well.
But I ran like you know model how do we get the model to silicon?
And then which also worked with our product team to turn on like which models were important.
A very long time ago.
Yeah.
There's also like a middle ground in between there.
This is like for the hacker try anything.
There's the brev console.
Then there's dynamo.
There was also nims, right?
Yeah.
I remember I had it's a little moment like a year or two ago.
Is it still?
Yeah.
And then nims, you know, inference, oil.
I think it lasted for something.
It's still longer.
Yeah.
Yeah.
It's just a nim.
Yeah.
Nim is how enterprises can take our any of the any of the technology and run it with support and all of that.
And so that includes dynamo that includes, I don't know, all of our other optimizations that are Packers of Enterprise.
Yep.
Yeah.
Anyway.
So you got a bunch of experience like running the internal inference gateway of the grounds.
Yeah.
I'm Bill also built how filled in videos first internal like the uscode thing.
You call the MB code it's like the exception.
Yeah.
It was of course, it was originally like the fork.
We guys going we jokes had so lonely not to choose the world back to you like we should have a four fioscode hackathon where you let's for it's a best four fios yeah.
We were earlier we were doing a how do you make a billion dollars someone from the us code was there.
And he was like someone down the gas involved and I was like, oh, you should do Yeah, that's what I said then that the cool thing became for Chrome hack at the Chrome, and now no IDs and not cool.
I was talking to Joseph from Robo Flow and the partner in crime.
We were talking about how with the new alpha mile model, so Nvidia just released an open source, the Mercedes cars, these thought directions aren't crazy.
Yeah, release, you open source, a autonomous driving model.
I had it.
Yeah, so we were thinking like, could we hackathon the driver this car?
Like, I have my old car.
Let's just try it.
We'll take it out like, click, try it out with a treasure.
I liked it in the middle of the day.
Just see, like, everyone.
Yeah, like, how many cameras do we need, right?
Like, one, two, three, four.
I don't really find things.
I don't know.
Yeah, but I think we're going to try it.
You just do with us.
We can see, we can even have a race.
It's like the first person to automate their driving.
Let me over a weekend.
We do have an autonomy track.
It was fair.
Yeah, we almost there.
Like, yeah, Nvidia did sense people of those for a good, not because you didn't have to drive anything yet.
Yeah, it's cool.
I think comma comma also has a version of this comma and they've open source driving.
They've they've done a fun hackathon.
He and I as well, because I really, what I really want is a Tesla, it was Tesla, that was all driving.
Yeah, but as a smart car, like a two-seater, that's the way to basically a wheelchair with a roof.
And all the way, they make them and move on.
Did they get the demand?
His DNA is.
Yeah, they didn't waste their orders.
No, they they're like, is this really five years?
Yeah, really?
Yeah, they were a different manufacturer.
Where is it?
Oh, I thought he's one of those things.
We'll we'll see someone buy the brand and it'll be revived.
I would buy it.
Like, I'm proud of that.
We prefer to go.
Someone here is this go buy.
This is Yaka.
Yeah, that's crazy.
No, because they're like, I think my only handbrake says, I'm Mercedes.
I'm in their Sager who is to make them.
Yeah, I don't know if he likes it.
They own the brand and you out.
Yeah, you're dream might come true.
Enough.
Okay, we're time to eat your ass and all of all right.
And every time I try to park in San Francisco, I yeah, I have to buy his hard car, because like to play him for a set of the parking lots and San Francisco only fits my cars.
Yeah, so we've picked it really.
That's what I mean, that's like mall.
You're really was late here trying to buy this come folks home on that like basically does not drive you.
That's where the the Vespal was a life hack.
Yeah, exactly.
Yeah, you know what happened to the Vespal?
Well, he's at the Z yellow Vespal.
I left it outside the hacker house wearing moved out.
It's trend.
It's just it was always there and then like a month ago, it's not there anymore.
I've been meeting to do.
We I don't know.
You could have liked it.
So it's actually been here.
He beat it.
There's you.
You forgot about it.
Yeah.
And let's okay.
Yeah.
No, this is probably has it.
And as we can hack this, I also wanted to give a brief shout out to the world shortest hack button.
Let's go.
You did twice.
We're going to what did a handful of times.
Yeah, there's going to be one at GTC.
Oh, we're doing a pretty much we have a bunch of challenges that know that we haven't released.
And you get to bring your agent to come and attend to go through those challenges.
It's like a zero, the zero minute hackathon idea.
Because you just bring your I have rose eight out of a long long time ago.
You just bring your agent.
And then you press the go button.
You're not allowed to code.
It's just the agent doing fun.
It's a good hit in evil, right?
Yeah.
Do you make a jarrel?
You make like I thought this is something I will love to see from cognition or someone else be like come bring your agent like drop it in.
So you don't know.
You like to improve a little bit.
Be, you know, I'll break a browser order of pizza.
We'll just see it like that.
Chris, Nick, you know, or they'll be, and you don't know what the task is.
Yeah, you don't know what task is.
Like we're just like you don't even know what the judging categories are.
And then you give it the judging category.
There's like trying to win as much as possible.
It's great.
It turns into like, yeah, so let's build something on dyno poetry.
It's a great person.
It's a great funny story.
Actually, we have a couple of people in video.
We've been working with security to like bring agents for like close to compute.
So we now have like stuff where you can like tell dyno, like go right some experience with dyno, like on X cluster and just like try it right now.
Like queue up once you get queued, like send this request load.
And we've actually been able to like just like, you know, like one shot problems, like we said this problem where, you know, with with dyno, you have to like find the right configurations.
I mean, sort of do it automatically for some parts of it.
But you have to like a good initial configuration that you want to use.
And we just had like an agent just completely one shot that it goes.
It gets the compute.
It like runs a couple of experiments.
It's like, this is the best.
This is these are part of the freighter frontier.
Go run this.
And then we just like give that to people.
And it's like faster than anything that they have.
It didn't UX an agent marketing are super important.
There's something we're thinking a lot about.
And um, Alec is like redoing the entire brev CLI.
So that you can fetch all the different compute types that are available.
I don't know.
It's going to be really soon.
But then you can, you can just browse what GPUs are available and in provision one.
It's safe to it right there.
And you can pipe all the commands.
But I think it goes back to like the Alec CLI.
Like if you coding agents are it's kind of funny.
I really coding agents have been so much more effective than a general purpose agents.
And I think a large part of that is it just has access to the terminal.
Like you said, and that means it has access to everything that you've installed into your terminal.
It can run is so, you know, it would write code and it can compile the code.
And if there are errors, it can fix it.
It can run your suite of tests because that's all just in your terminal.
And so that, you know, that for the idea or why I've come here really excited about the Alec CLI, we're now just turning through building CLI's for the entire, like for the entire business.
We slaggleding snacks.
Also, we're a DCLI at APPO.
I've also done that for my so first.
Really?
Yeah.
We're going to, we're going to open source all of this.
And like, yeah, all the, I mean, they're just, they're, they're, they're yeah, CLI's for the business applications.
We would love for someone to run with this and like build like, I don't know, like open CLI foundation and or something.
Yeah.
We, I video would love to support anyone that's doing this.
Like every dev tool should really have good CLI support of this.
And like at one point, it was you want your docs to be accessible by it.
I'll let them, right?
You want, I'll let them go.
No, everything needs some CLI tool.
Yeah.
It's kind of funny, right?
Like, like, computing began with a terminal with a shell.
But we said that it's not empathetic to, yeah, humans.
So we don't, these nice user interfaces.
And then now we have LM's navigating our user interfaces.
And ironically, we're not empathetic to the machine anymore.
Yeah.
Just kid of the, the, the LM access to the shell.
One thing that's slightly, basically, in front of people is like, why do you have to build CLI?
So why can we just expose APIs?
Like, I have an interesting answer to this.
So there are a couple of reasons.
Like, there's like, you know, portability is like one issue, like, you know, like, sometimes APIs are not like discoverable or like reachable, right?
But some, you know, touch of things.
There's some element of locality, right?
Like, like, like, the CLI's like literally you interfacing the theory of like local system, which is a little bit different.
You could still do it by API.
But like, there's this highlighting of like, what is the difference between like a CLI and an MCP, right?
Like, the kind of occupied that same purpose is, and you call them, it does something on the system, and that's done.
I think that in pre-training, there's just an enormous map of the online data.
Yeah.
Yeah.
Like, even, let's ignore, let's ignore our, like, you're doing no harness, you're doing no harness posturing.
Just the amount of like CLI versus API documentation for just like navigating this world of the CLI and your file system through that is just enormous.
Yeah.
Right.
I think there's a couple of things, too.
Like, if let's say we want to, so one, I think your intuition's right.
The CLI is just scrapping the API, right?
So I'm sure no, the functional read, right?
Yeah.
I think it's nice because one, you're, you're being very specific and pedantic, even, um, of what, and that's really good because you're describing the problem space.
So you know what the, I don't know, I don't know, I don't want to call it like, the space for vulnerability.
You know what network calls you're making.
It's not arbitrary.
And that's on beside it on the fly.
That's like pre-decided, which is important from a security perspective.
But then if you were to write a bunch of API requests, you would probably do that, I don't know, with the model like use Python to do so.
I kind of like that everything, like a CLI is just dash because it's ubiquitous, like it's just there.
And you don't have to make sure that there's certain environment variables that are set up like if you're Python versions, if it's the my Python version, we're using the same model to go do the same thing is like going to write like different code, it probably would.
And so it's kind of nice to do work, right?
We haven't human.
Yeah.
Though, I think just like making those decisions happen ahead of time versus, yeah.
One last thing on this sort of agent, I guess, maybe co-location or whatever you call it, one pattern I'm tracking for this year.
I was trying to think about what's the theme of this year going to be.
That's your definitely coding agents.
This year is definitely coding agents breaking out of containment into brother ring.
Her real world.
I go definitely has to hear her rent a human.
Yeah.
Don't you want me to do it?
Yeah.
I'm one.
Are you real?
What happened?
I'm like $5,000 or so.
Do anything.
Really?
I think so.
I mean, I need the my policy from Costco.
But I think the best part is only the agent can book me, you know?
Yeah.
It's very usually, it's just like another labor marketplace.
Mechanical Turk was this.
So this is like, I have a weird story with why I did it.
So back to your example of just giving agent access to compute, right?
You guys are GPU rich in video.
I hooked up.
He's not shy about it.
I have a 24-7 agent running.
I hooked up to run part.
It doesn't shut down instances.
And I'm like, I've tried prompting you.
I've given instructions shut down when you're done.
It's like, I need to keep it warm.
I'll need it soon.
It's horrible on time estimates, too.
Because like, they've realized it's like, yeah, I'll need it in 45 minutes.
45 minutes.
I'll shut it down.
45 minutes of human time is actually three minute of agent time.
So it's like, I'm booting it up.
I'm waiting.
I'll just leave it on all night.
And most models go that chatting down after some inactivity.
I had it on my local server like a little dual GPU thing that just stays on.
I will space heater at home now.
But careful.
So basically, you know, they don't care about the concept of money.
Just burn it.
I need it.
It's useful.
And another way we did you explore.
It'll be really nice.
I think I'm looking at it as it's much about super useful for agents.
Because yeah, you buy it once you plug it in and they it can rip.
I'm going to make it.
I'm going to make it in video ad here.
Okay.
The black well, like RTX 6000 car and so pro are only like, I think it's $8,000 slightly cheaper.
Yeah, well, it's much, it's much cheaper than the data center cards.
Yeah.
And it's got 96 gigabytes of your RAM.
So if you and your your crew want to go like run a local agent for you, you in the home.
I feel like it's got a significant amount of VRAM.
I thought about purchasing this and running in my basement except my neighbors and hate me.
It's just a single like two, three, two, two, three, two, two, two.
It's more like, it's a PCIe.
Yeah, it's a PCIe.
It's a PCIe.
It's a GPU.
You can go by that.
I mean, the big difference against like the RTX like gaming GPUs is it.
I mean, obviously it's like black bulb, like it's a pro GPU and has a lot of VRAM, which means you can run pretty large models like you can stack for them for the max queue and the system.
Yeah, this has to be.
It's beefy.
You can run where the 96 is bigger and anything.
That he's like, you're in a lowly seek, but also they, they are slow.
They're not, I mean, performance of speed will be somewhat slower, but the two API like, oh, yeah, that's sugar.
So again, the big learning economy of scale allows you to do things that allow you to get both speed and throughput.
Like you can run, I'll give you an example.
There's an optimization called Y to EP.
I'm not going to go into it fully, but like it featured heavily in in inference max for the music.
And there's a, there's a, there's a great set of stories from a media and from semi-analysis about like YYP is important.
But for like, MOE models, it's like basically essential.
And you run it, like the level of parallelism, the level of scale up parallelism used for it is like 32.
So it goes beyond that AP area.
And it, like, really, really, really is important to have that MVL72, GB 200 MV link to serve at scale.
And like, it's like, I don't remember the, you know, cost improvement.
I think it gets hopper, right?
Like, it's hopper with this MVL72 system.
You're getting like 35 times cheaper per token for like a lot of the curve.
Yeah, which is crazy.
Yeah.
And normalized per GPU obviously is part of the GPU's cost, where the code, the GPU's part of the counts.
And one thing I'm exploring is this year is also the year at the sub agent, where you have the main agent, but then that also kicks off tools, which are in themselves agents that have limited ages.
And such amount of contact local deals, whatever, right?
Different prompts.
So for example, one thing that condition does is, before you kick off a search, they do it's like a fast context model, where you kick off April, or you just search across the code base.
And I was all that, that is better than indexing a lot of the times, not all the times.
And you should still index for some things.
But like, the idea that agents should be able to command sub agents and probably run them, maybe close to the inference is why I don't know if that's like, architecturally possible or, yeah, we're thinking about that for DML.
That's like, our big theme for the year.
Because like, like, if you can design that into your stuff, then a lot of people, a lot more people will use it right now.
It's like just kind of theoretical because you do pay a lot of, like, back and forth coordination companies.
Yes.
I think you'll net speed up, all right?
Like, even at a basic level, speculative decoding, you're running a small model.
You're running two instances, but it's net, that is one example, yes.
Yeah.
But this is like a little bit like different with, like, agents, agents.
Yeah.
I think, I think there's like a summarization of that trend that I like to do where I like to say to my team.
It's like, this is the year, so the two things.
This is the year system as model, right?
We're like, instead of having like a single model be a thing, you have a system of models and components that are working together to like emuate the black box model.
So when you make an API call to something that's like, like, a multi-age in the background, it still looks like an API called to a model.
You're still getting back to the trends, but are under the hood.
Yeah, under the hood, it's like a billion different models.
And that's a lot of complexity with Dynamo and with other libraries and media where we're looking out to manage that.
Compliance funding.
We actually, for CES, we just released a model router for DGX Park where you can have a local model that's running on the spark.
And then also a foundation model.
And then the model router decides when to send queries to which one.
So it's no longer.
There's like either or it's used the best of everything that's available to you.
You have a good post-training model that's running on.
These are our leads that are also the bread from channel that you're being able to manage the spark.
Oh, yeah.
I did know that.
Yeah.
I did Jay request.
Did you?
Yeah.
There we go.
I actually like a question like I like to like a stand in football.
How much longer do you guys think like agents are going to be running?
Is that something I've been throwing around like what happens when I mean always aren't it even it fits the like back to the pre-filled decode, right?
Like codex is I'd say compared to code but it's much longer at tasks.
Like yeah.
I think we'll like to run six, seven, eight hours.
I'll run it overnight.
Yeah.
And I'll go back and I have like a little crappy logging software I use.
And there's just times where it wants to like, no, I'm going to go deep on research and it'll eat up 80,000 tokens, go on another, go on another.
Yeah.
Just eat through tokens.
And you know, that's part of it.
Like at the end it does, it does hit a long task and I think you will only see that.
But yeah.
Yeah.
There's insatiable demand for tokens and every improvement that comes kind of just makes our demand even higher.
It's kind of funny.
Like if you have like a teammate and you're asking me to do a task and they're like, should I save some effort and not think too hard about this task?
Fuck now.
I mean, my favorite.
You can have four shots, right?
Like the original codex before the app.
Why do one call?
Like give it four attempts.
Just just use all the token to like, all right.
Try more.
Try again.
Try more.
It's like it's like the the matter index, right?
Is the thing that tracks like how long models are able to run?
I expect that we'll just see like log linear, if not log super linear growth.
We will see before the end of be here.
An agent that is capable of running for longer than 24 hours with like self-consistency in the entire app.
I would also poke at different domains having different desires, right?
Like at a turn to humor level, I'm getting slightly frustrated at 20 minutes per basic query.
Sure, you can optimize, you know, six, eight hour.
I don't see myself shooting off many one-week agents, right?
Someone doing like, okay, GPU kernel research or medical or biological, like, you know, in those domains, sure, shoot off a lot that take them out of hope.
So like, I think it will be somewhat domains with them because you also really need to train that in.
You know, it's funny.
Well, it goes doing your taxes, right?
Like that's tax.
Yeah, it's got you almost.
Yeah, okay.
Yeah, exactly.
Get it right.
I wonder if like this may your supposed to assess sort of like, uh, speculative decoding is like your agent figuring out what you might be prompting it the next day at night and like, prefetching.
Yeah, you know, really good at branch branch prediction.
Oh, what, no, that's, that's, that's too low level, but yes.
Sorry.
Yeah, yeah, yeah.
Well, one question I got to get is, so like, we actually did record a part with the, the beat of folks, uh, the Sarah, right here.
Their chart is the human equivalent work, uh, ours work rather than how long the ADS dip cells are of being autonomous in that.
This is huge difference, right?
They human work five hours, if it work 30 minutes, like, it's actually 30 minutes, not, uh, yeah, five hours, right?
Like, so like, that chart that you see is them estimating what the human equivalent replacement is.
Um, I think the, I think actually in topic release, uh, more of a chart that showed cloud code autonomy from their production traffic numbers.
And that was 20 to 45 minutes.
That's roughly where we are.
So yeah, that's, that's the sort of realistic thing.
I mean, I, I do think like there's experimental setups, we can just like ralf with them that just prompted to keep going on when it stops and obviously you can, that can go arbitrarily long.
Really from my experience around, yeah, I guess 20 to 40 minutes seems right for what I'm using, like, code X or cloud code.
But then like, I always try to just like, if I want to spin up like a new there's a net new project, I'll often start to rip it.
And like, it only unfold, I believe, anchor.
Yeah, like, spin up like the, there are new, like, from the V3 agent, like, it'll spin up a web browser and like, click around and discover new bugs and just keep turning.
Um, so I think, like, my longest was like over an hour that had been turning.
I think before we see super long running, I think there's going to be a bit of an efficiency hit.
So sure, you can take an hour and go down paths, but you also want, you want to be more efficient, you want to be smarter in your reasoning, right?
So I think that'll actually go down before we go back up like you don't want to scale non-optimize systems just for the heck of it.
As much as I love staying use all the tokens, you know, they are expensive.
Like going from dense to reasoning models, that's an added cost, or you're paying for a lot of tokens.
And it doesn't make sense to just scale stuff that's not optimized.
So there's, there's always a little balance.
Yeah.
But, you know, if I think you'll see bugs sides of it.
Yeah.
So 23 was super exciting.
I think if you were an NSF, you were like, okay, I know this is going to be a huge world-changing moment, but it seemed like, you know, no one had known yet.
And maybe even before was it 2022, maybe?
Yeah.
I was like, yeah, I ruined at this tweet where like, everyone was NSF from like 2021 to 23.
Yeah.
I understood what it was like to be like, already.
Totally.
Yeah, 2021.
That's why I'm in my first opening eye account.
Yeah.
It was crazy.
And I remember it was so funny because at the time, SF had not been doing well.
So pretty much, would it felt like was the concentration of founders in the city had had risen, because where my neighbors were used to doing a bunch of stuff, those people had all left.
So the only tool that we're still in the city where people that really wanted to build, it was cheap tech.
It was, yeah, it was also way cheaper.
I feel really bound to anyone who is trying to get rent now.
But there was a cello was, they had a huge office.
So blockchain, it took over the old Casper building.
Yeah.
They had the showroom and they had the, like, the, well, I think was like the back warehouse.
It was, it was a huge office.
And it's right across an opening eyes and the other link.
Yeah.
Is it the original arena?
I named the arena because of it.
Yeah.
Yeah.
And so it was really exciting because like robo flow, I think I forgot the really lifai.
But yeah, meant lifai, uh, brev was there.
You guys were there.
I remember that was actually, it was there that you bought the ai.engineier domain.
Yeah.
I didn't know what else was going to do in ai.
Yeah.
I want to do something.
But it was kind of this, it was a really fun moment where we were kind of all in this cello space.
And it, um, I don't know, it was, it was a really cool community, especially being so early.
Yeah.
And so it came in, you got me early cruise access.
Oh, yeah.
So there was a going for a time.
They're both cruise in the Wemales, we're just free.
Yeah.
Always seeing.
If you had it, I mean, they're, they're so back cellos opened again.
Yeah.
Hi.
So nature zooms zooms zooms zooms zooms zooms.
So it's actually really cool.
And you guys have this video so close to uh, cello.
Yeah.
This rock climbing gym right around the corners.
Like, um, to God.
Oh, yeah.
So um, yeah.
It's an awesome block.
Whoa.
Yeah.
Just in the little bit of a Sarah, because it's going to shoot, but I do think what a, one thing I try to do with the podcast is like bring, like, what does I get to be as I have to go to the rest of the world?
Yeah.
And it also just like, maybe give, uh, I'll take a look at it.
Yeah.
Yeah.
My favorite talk was in the city.
And yeah, stick and stream.
I know it's very good.
Yeah.
And I guess what it's like to be in San Francisco, I think it's just everyone seems to be super supportive.
Uh, sometimes I feel like the city believes me more than you do.
And even, uh, I don't want to, if you remember, but I remember posting my first blog post and I had met you on Twitter and you gave me like an hour of your time super randomly and you kind of coached me through uh, writing content for developers.
And I was trying really hard not to come off salesy or plug myself.
And so I kind of stripped all personality out of the blog post.
Yeah.
And you, you brought that out.
You're like, people do it.
It's okay to talk about what you're doing.
Like, you don't have to be weird about it.
And I remember just that, that I think that really helped me kind of figure out what our voice is and not shy away from it.
And so always really grateful for you.
Hey, you inject your voice into like everything.
Yeah.
Actually, you did manage to be like very genuine about what you care about.
Yeah.
Yeah.
You imagine like summer at summer in for Cindy MCU.
And like, it's like, can you give me feedback on this blog post?
And it's pre-aboring.
And you're like, fine.
Like, you know, he looks interesting.
I'll just do a Zoom call.
And then you meet this guy.
Yeah.
All right.
He's so energetic.
So he'll be right there.
That's, and by night, I think people are trained to write certain way in school.
And yeah, they never totally see theirs like a broader world.
And on, on, on writing writing is thinking.
And like everyone thinks differently.
So like, and as well as just like, yeah, yeah.
Fight your way.
Cool.
Well, thank you for indulging with us, uh, really probably in discussion, but I love like you guys are like, so like the sort of young faces on video, it will be so much energy.
And, but like, also a lot of technical data.
And I think, uh, people better than about for this session.
So thank you.
This is awesome.
Thank you.
As a thank you for everything that you've done and you take a good talk.
Yeah, and yeah, the fun.
Guys, follow the above and see your GTC.
Have an hour really.
And wait for it to end.
Yeah.
Cool.
Thanks a lot.
Thank you.
Thank you.