
Latent Space · 2026-06-01
Ethan He on xAI's Video Agent Models and the Future of Multimodal AI
Hosts: Alessio Fanelli, Swyx
Guests: Ethan He
Why it matters
xAI rapidly built Grok Imagine video models within months, leveraging strong compute and iterative training.
Key claims
- xAI rapidly built Grok Imagine video models within months, leveraging strong compute and iterative training.
- Synthetic paired text-video data is essential due to poor natural alignment of internet video metadata.
- Video models compress frames into latent tokens to reduce sequence length and enable transformer training.
- Most quality gains in video generation come from language models acting as prompt rewriters or agents.
Briefing memo
Summary
Ethan He, formerly of NVIDIA and Cosmos, joined xAI in early 2025 to help build video foundation models, contributing to the rapid development of Grok Imagine. He emphasized the critical role of strong compute infrastructure and iterative training cycles in accelerating video model development. Ethan detailed the challenges of training video models, including the need for synthetic paired data due to weak natural text-video alignment on the internet, and the importance of compressing video into latent tokens to manage sequence length.
He highlighted that major improvements in video generation quality increasingly come from advances in language models that act as prompt rewriters or agents, rather than from video diffusion models themselves. Video agents combine language models with video generation and editing tools to produce longer, interactive, and real-time video content. Ethan envisions a future where generative UIs powered by video agents replace traditional interfaces, enabling highly interactive and personalized experiences.
Ethan also discussed the technical challenges of real-time, long-horizon, interactive video world models, including context length management and temporal alignment of audio and video. He sees video agents as a stepping stone toward fully interactive world models that can be used in applications like robotics and immersive environments. Finally, Ethan shared his motivation for leaving xAI to focus more on language model research, believing that language models are the key bottleneck and driver for future multimodal AI advancements.
- xAI rapidly built Grok Imagine video models within months, leveraging strong compute and iterative training.
- Synthetic paired text-video data is essential due to poor natural alignment of internet video metadata.
- Video models compress frames into latent tokens to reduce sequence length and enable transformer training.
- Most quality gains in video generation come from language models acting as prompt rewriters or agents.
- Video agents integrate language models with video generation and editing tools for interactive, long-form video.
- Real-time, long-horizon video world models require managing massive context lengths and temporal alignment.
- Generative UIs powered by video agents could revolutionize user interfaces by generating pixels directly from user intent.
- Ethan left xAI to focus on language model research, viewing language models as the critical bottleneck for multimodal AI.
Source material
Transcript
Okay, we're here in a studio with Eden Ha, most recently of XCI.
Welcome.
Yeah, thank you, glad being here.
We were so here with Vibu.
You were first coming to us or joining Lily in Space World because you were working on Cosmos in a video and you did a great paper.
We loved it.
You presented it as well.
So thank you for doing that.
Yep.
I also presented the memories.
Yes.
Yeah, yeah.
How did you actually hear about us?
Did we reach out to you?
Is it how it worked?
No, actually.
It's a community.
Like I realized, oh, there's this online community.
Yeah.
We had people talk about AI and also learn, learn from each other through papers.
Every week, sort of paper clock.
It's very nice.
Yeah, I learned that.
I think three years.
Non-stop.
We haven't stopped even on Christmas in years.
It's meant many weeks.
I want to stop.
Thank you.
It's good.
No, it's good.
I think you posted that you worked on a paper and I was like, oh, very cool.
We have a paper club.
Present.
Yeah.
Yeah.
Yeah.
Because it's an amateur club.
Yeah.
So it's very unusual.
But we have sometimes people authors come by and actually explain the paper.
Today we just did the poolside paper, which is apparently very good.
Came out yesterday.
Pretty interesting.
Right.
Fully open.
They talk about everything.
Systems.
Is it good?
Bring us up to speed on your transition to X-A-I.
Because I actually don't even know when you joined.
Just like tell those stories about the transition.
Before X-A-I, I was working on Cosmos Words Model as an atom video.
So Cosmos is a giant video foundation models.
I can.
That aims to similarly is a war.
And for it serves as a foundation for all of the roboticists to build on top of.
There.
Once I build the Cosmos one, I realize as they're saying.
Also has a scaling loss.
I'm going to learn English model.
We need to scale up the video models further.
That's why I realize I need to move to somewhere with much more computer resources.
That's how I did in video.
Did you get this game in the song?
Yeah.
And timeline wise, when was Cosmo?
It was pretty early, right?
It was open world model.
Yeah.
It was like end of 2024.
End of 2024.
Yeah.
Then at, at made 2025, I moved to X-A-I.
As that time, I joined the base of time when X-A was about to build video models.
And in much model models.
Never know.
No infer.
No data.
No model.
And just a few engineers.
We built it in three months.
It released the first model grocky magic in 0.9.
And since then, I keep working on video models and move more from portraying and to post training of video models.
For example, like reference to videos.
Kind of like the camera feature and video extensions.
And before I left, I worked on a work model.
They didn't have a small team to focus on the real time, long hours of video generation.
Can you give a rough roadmap of like, okay, you're on a brand new team.
Grock previously was only tags.
So they partnered with BFL for their image and stuff.
What do you, what are the building blocks, right?
You have compute data.
You can procure somewhere.
Like you just, you know, what are the sequence of things that people should think about when you're setting up a new team?
I mean, actually, even deeper, not just data, you can procure.
You guys have to go through getting the data too, right?
So you shipped it pretty fast.
But yeah, a few months is like, from actually very surprised.
Yeah, one thing, I see like thanks to my experience at a video.
Because first time when we were building costments together, we built it for about a year.
So this is like the second time I do it.
Roughly, roughly have an idea like what to do.
I see the most important thing is it's a talent.
Everyone, everyone, we're very strong and clever, very close with each other towards a common goal.
So that's beat up things a lot.
So you reduce the communication bandwidth among people.
And everyone can kind of work towards the same goal.
It's like, every day, there's not that much meeting sounds a kinder.
Maybe like a sink a day.
And after that, it's just all building.
It was pretty fun at that time.
And another thing is that XI has very strong foundations of data, data, inference, model inference.
And the supporting there can help the model develop a lot.
When I look at like training models, I don't.
So I feel the top important thing is like how many iterations can you do like per day?
And the more iteration can you do, you can trends a model much faster.
So if you have very strong infera and you have a lot of compute, you can trends these models in very short period of time.
That can give you a much larger buffer to for arrows.
And it also gives you the opportunity to spot more backs.
Yeah, what is an iteration?
Is it like a few hundred steps or whatever?
Let's say just trend trending the model like from a car new data and maybe design new algorithms and trending new model, maybe as models.
Yeah.
So cycle time for like any hyperparam that you're in.
Yeah.
So cycle time and trend.
You know, like evaluation model is a model better.
It's my previous iteration.
Yeah.
So it's like before you, someone had already set this up.
That you can iterate very quickly.
Yeah.
I think the foundation there is extremely good for developing and research, research models.
And often I find that it's kind of boring.
Like a lot of the improvements.
That's not come from algorithms.
It comes from finding small backs here and there.
And then I've applied in the in the model training pipeline.
Those get those give the biggest boost to the model quality.
It's interesting, right?
So you say it's like small team less communication bandwidth.
But also a lot of quality is like fine little bugs.
It seems counterintuitive.
You have a lot of people.
You can iron out more of those.
But it's interesting to see the other side.
Yeah.
I also want to have you.
Do you try using a lens to look for bugs?
I don't know.
I remember as that time, it was made 2025.
So it's the coding model wasn't quite there yet.
I remember like December 25.
It was extremely good.
Yeah.
I've been I've been using it as a time.
It's it's helpful.
But sometimes it produce codes that are kind of difficult to maintain.
Even though like the first time, it builds something extremely fast.
But it gave the like a spaghetti code.
Thousands of lines that I could maintain.
And the LM itself could then figure out what's what's wrong and how to improve on top of it.
But now I find it much much much better.
Yeah.
I want to wrap another point here is like now coding models are much more easy.
And it can help help us implement stuff much faster.
Compute might become a bottleneck again.
Because previously like if you want to trend on your models.
So you want to generate new synthetic data.
And then or write a new algorithm.
It might take a few weeks.
And during that period of time, you might not have experiments to run.
But now you can build that thing within a few hours.
Then you can immediately trend a model.
Now you have to have enough compute to try all of the ideas.
So compute might be the bottleneck is iterating speed again.
Yeah.
I actually honestly, I think like it's kind of a stressful job.
Because you're like, well, I should be trying everything.
And if I'm not, and I'm not doing my job well.
I mean, there's also the stress of your eating thousands of GPUs per hour, which is very expensive.
And you know, compute can go down with the researchers.
Daddy, you know, daddy, you know, there's still finite amount of compute.
Like you want to use it.
You want to use it.
You want more of it.
That was quite stressful indeed.
Yeah.
One thing is the with coding models.
No, like a lot of these jobs can be automated.
There's just much better.
A second.
It's a marathon.
So you've got to maintain good health and the.
The regular schedule.
It's hard to hear that when you ship from zero to nothing in two months.
Yeah.
I mean, I think obviously the culture.
Actually, it's very famous.
You know, people people work very hard.
One thing I did want to dive into, you know, in a nose that you sent ahead of time.
You had specific comments about the cost of video generating.
Physiomobly is this on that class is one right.
Do I do make a work cluster?
And whatever you want to share.
I think there's there's three things to talk about.
So there's a video agenda.
There's also the image and model that you put out.
Do you want to like complete that?
Okay.
It's a zero to one.
You have a few months.
Just what are the stages of.
Create.
Oh, yeah.
And then, you know, from there, there's video.
Gender's audio.
I'd love to get into those next.
But what is that first few months like so small team.
A lot of bugs iterations were like, you know, what is it like to take something off the shelf to we just.
Get data compute what's what's the few months like how do you go to.
Say that they are image and model.
Do you just start?
Yeah.
I cannot comment specifically how excited.
But it's it's so quite standard process.
I can draw.
Draw some examples from cosmos.
So mainly it's like building building video model.
You actually need to build the image model first.
And building building these two models.
The data you need is a 100% synthetic pair.
If language and image or language to video.
Because on the on the internet, actually the the videos don't naturally associate this text.
So we can say, oh, like on YouTube, you have the title and you have the description and the comments as a video.
But you're only there not relevant to the video itself.
And say, maybe like the video is a natural scene of mountains or something and the title is like, I'm so happy today.
So they have not, they have not no correlation at all.
So the first step is to you have to can a synthetic pair of language with the videos.
So you guys are videos from the internet.
And you use a VR lamp to caption the videos.
So that's part.
Here's a question like, how do you how do you get the VR lamp to begin this?
So like if there's no you fuse the model right?
Say, if there's no like VR lamp exists, like how do you channels the text to the beginning, right?
It's impossible.
Okay.
In the beginning, it's like you ask human to describe the video as a detailed as possible.
For example, you ask them to describe everything.
Like all objects, all characters and all interaction and dialogues in the videos.
So that's in the protocol of cosmos labeling.
We require the objective.
It gave to the laborers was that you have to describe the video as detailed as possible.
Such that a blind person hears a blob of text can reconstruct what the video is like from their head.
Video image.
You're talking about the video or image either.
Either one of them.
Okay.
This was pretty common when we went from like clip and dolly, right?
Yes.
Training on really detailed captioning of images.
So same as applied to video.
But if using multi model model to pass in video images and write with descriptions, you can also.
I mean, I think there's this traditional perspective of supervised or very highly human created thing.
I feel like there's a unlocked with unsupervised.
Right?
Where you have enough to bootstrap that you can just throw common corpus on it or, you know, whatever.
Like unsupervised vision and language pairing, right?
Like where you just have interest burst image and text.
And it just learns.
To me, that is the vial language through.
That is different from the clip.
Different from the pre-alem era.
Yeah.
Yeah.
It's interesting to see that you kind of need both data.
Yes.
For example, for me to bootstrap it up.
Yeah.
For the Chinese and model training, there's also usually like a small percentage of just non-label data.
So the model is interested to generate a video result and the text instruction.
That can also help the model generalize.
So after, after this stage of generalization set up here.
So one important common step is to try to compress or tokenize or if the image or videos.
So because if your trend, if you can technically, it's theoretically trend image or video models on pure pixels.
But the problem is that it's a lot of tokens.
So like what image?
Like, it's a southern biosousin.
It's like one million tokens, one million pixels.
It's impossible to trend transformer on that.
So you need to try to tokenize or which can go from image to latent space and latent space back to image.
So when we name the podcast.
Exactly.
But basically talking about vocabulary.
Yes.
So what is a million is impossible?
In general, the model, the vocab is continuous.
It's a continuous space.
We can think about like you map an image to a vector.
It's a fixed lens vector.
It's like a 16 or 40 something like that.
And then you map that vector back to the image space.
And the mapping is, the mapping is patch-based.
So you say you have a 16 best 16 patch.
Then you map that patch of pixels into this latent space.
We've covered this.
This is a vision transformer.
Yeah, VAEs.
You basically compress your input.
You do your generation, your reasoning all that generation and smaller dimension than you project back out.
Yeah.
VAEs is a form compression, but I think that for me the patching thing is from the IT, right?
Yeah, you can make it.
Literally the paper is titled like 16x16 is all you need.
Something like that.
Yeah.
And then I think also people make a lot of comparisons with this kind of patching with convolutions.
Yes, yes.
Which is, you are kind of reconstructing the old paradigm with the new.
Yeah, actually in VAEs, there are both.
convolution networks and transformers.
You can actually do both.
Yeah.
After this VAEs, so what you've got is you've got latent space tokens.
And you've got the language tokens.
So now the training of the diffusion transformer, you really generate models use diffusion transformers.
It's actually a quest standard.
It's very similar to how you train a language transformer models.
It's not that much difference.
It's just a tokens.
The visual tokens in visual tokens out.
The only difference is they are set in the noisy process.
So you train the model to our mask, some of the noise.
Say you add, you add random noise to the visual tokens.
And then you train the model to remove those noise to, to generate a clean tokens.
And they inference.
The model can iteratively remove noise from 100% noise.
Yeah.
And then there's also, to speed things along on the text tree of diffusion.
This CFG.
Yeah.
And then there's also, I guess, latent diffusion.
There's someone in there.
I think somewhere along the line, obviously like stability and all these other guys pioneered a lot of this architecture.
I don't know if you want to get into there or just go or do the video side after you.
After you train such a model, such image model.
The reason is a foundation for video models is that image models are cheaper to train.
And they have much denser connection between language and text.
So sorry, language and images.
For example, you train a billion, you train a billion images.
And there's a mapping from the text to the image.
And the cost to train the same.
Like the, a billion, a billion text to a billion videos.
That's much more expensive.
Because videos naturally have more tokens than images.
Because the diffusion models, their understanding is language purely come from this mapping.
So if you don't have enough mapping.
So if you only train like a 10 million video or something, there you might not see enough language tokens in your training.
So your model does not understand human intention in us.
So that's why you really, you train your first transit image, the future from models.
And then you boost traps the video model from there.
One thing I did want to ask because actually I think you're the first video model person I've ever talked to.
I think we've talked to Luma and all those folks.
There's all these tricks in video compression where basically frame by frame, there's not that much difference.
So actually you don't have to regenerate or it's reset the whole frame, right?
I think and before compression or something else.
Is it tempting to use that or as far as I can tell, everyone just treats it as, no, we would just generate every frame.
Is it roughly the state of the art?
There are a few different approaches.
Let's say first like you want to just directly use MP4 compression and you use that as tokens for the transformers to train right?
So people actually have tried that, but the main challenge is the latent space for the MP4 tokens or not were not very comprehensible for the models.
It's extremely hard to train on that.
And there's a, so that's why they create a very easy, which creates more continuous latent space.
So the models can understand that latent space and learn from it much easier.
Even they send a very different difficulties of the latent space.
You can imagine something that the simplest, the most naive way is like you have an image and you just shuffle all of the images into a vector.
So you don't need to train an AVE, right?
But that latent space is extremely hard for models to train on hoppers.
That's why there are some debate on how do you compress the tokens.
So you can compress frame back frame.
Also you can compress the temporal dimension.
Yes.
So differences, if you compress temporal dimension, you get a much higher compression rate.
Because there's just temporal redundancy between frames.
Because this frame and the last frame likely they are mostly similar.
So there's only some small difference.
For example, I think in 1, 2.1, they have like an 8 by 8 by 4 compression rate.
So the 4 temporal tokens are compressing into 1 tokens.
That can save a lot of the context lens.
If you do a frame back frame, you have to do maybe like 8 by 8 by 1.
Your context lens will be 4 times larger.
That being said, the benefit of the frame, per frame compression, we might come back to this later, is a real timeness and the interactivity.
Because if you, if you train the opposite of the model, frame back frame, you can, as a model, can respond to any user request immediately.
So if you have like a temporal 4 compression, then it might be laggy.
Yeah, there's also lag there in nature.
So you're very filled on this.
Let's just go ahead and bring it up because we have the visual prepared anyway.
There's some frontier applications of real time video and gen.
So flipbook is one of the examples they went viral recently, right?
What is flipbook?
Flipbook is kind of like a white browser.
You can see like it has the white browser UI on top.
The difference is all of the US are generated by generative image model in real time.
And they think here are fake.
But you can explore inside this imaginary world.
Like we actually have engineering, the great pyramid.
Like the model of generative is for us to understand how it works.
And if we want to navigate around on the same further, we can click on some of the, some of the description here.
And the model of the generative is a new page, new sub page, describing the details we want to know about.
So it's basically kind of, we're playing a video.
But it's pausing for our next interaction.
And then it just plays the next thing based on our interaction.
Yeah.
This is kind of cool.
Yes.
And then you kind of decide your story.
You know, how do you make a pyramid?
Levering technique seemed interesting, right?
It shows how do you take, okay.
I don't know.
The demo feed had more animation between frames.
I think it's just skipping.
I was just skipping a lot of frames.
It's also a lot of video mode.
But I guess a lot of people are using it.
Yeah.
There's a live video stream.
We can try.
Yeah.
So this is an example of the kind of future that you see at the extreme.
We don't, we're obviously not in it today.
But in a world-weight inference, it's completely free.
Yeah.
This is better than generating code in text.
Yeah.
So this is a final state of the relatively bad for work model, I think.
Imagine, imagine, internet doesn't exist.
And then you tap in Google.com.
Like, what should a model show you?
It's a model can't imagine something.
And this is what the model imagined.
And these web pages come to you to not exist.
So I think as a inference cost come down, we are going to have a generative UI for everything.
If you think about how the coding model works.
So they write code for web page and they render the code to be covered into binary.
And the binary render the pixels on the screen.
So we, in machine learning, every time we have some breaks through all this, it's more intrand.
So why don't we have, like, user instruction to the pixel directly?
So the generative UI will be user intention to, to the pixels directly.
And say like, even if I want email, I want to say everyone ever have the same interface.
But I wanted a slightly different.
I wanted an email to show to me, like, a TikTok.
So I guess so I've gotten ready for the emails.
Or maybe you want something else.
We can have, come see different things.
Or like, I have, and look at Instagram stories.
I don't like the, like, button.
I always make clicks.
And the generative UI is audit.
So it's going to be a revolutionary replacement as the interface.
So in the future, we might have much more powerful alms and coding models running behind the scene.
And in the, in the front and the dishwasher model, that actually be the front end to show stuff to you.
That's how I imagine it.
Yeah, diffusion front end deterministic back end.
Yes.
Something like that.
I find it very expensive.
But, you know, I find it interesting.
You called alms writing code on the back end deterministic.
But, okay.
Yeah, you write it once again.
And then you execute.
If you think about the cost, say, let's say it's run hand your cost $1 per hour.
And if you use this, it hours a day and 30 days.
So I'm, every month, you're paying this to 40.
You'll actually not, not want to pay for that.
That's even more expensive than Cloud Cloud Max.
But, if you think about the compute cost come down like two times a year, I think that the future go like the arrive.
It's interesting.
Compute cost comes down.
Compute gets faster.
Model gets smarter.
Or it's just smaller.
Yeah, I don't know what you say two times.
Because I think it's like 100 times.
In language models, it is roughly 100 to 1000 times every 12 to 18 months.
For the same given level of elements is, you know.
That's a net of everything, right?
That's model performance alongside compute.
So different than just compute cost come down.
But, you know, very interesting future.
Yeah.
For the web designers, we'll have this shout out that accessibility is an issue, right?
Yeah.
How do you deal with screen readers or whatever?
But yes, this is higher bandwidth storytelling than anything you can possibly generate with code.
Right?
So I think that's the rough idea.
I'd like to add a little bit that.
So human actually have the maximum bandwidth when we're looking at things, look at videos.
And we also have maximum output bandwidth when we're talking.
So in the future, it might be something like we talk to AI models and the AI model response back with the generative UI.
So that will be the maximum input and output bandwidth to interact with AI models before your link happens.
And I mean, it's also a very custom, right?
Some people are very visual.
Some people are not as visual, right?
They prefer the text, but the best thing about generative UI right can also be text.
Yes.
There's another project that we wanted to highlight, which is the neural OS.
Kind of similar idea, but here you're literally accumulating an operating system with a video model.
Yes.
And you can play doom.
You can firefox.
I find it's like mildly less impressive, obviously, because it's an OS that I can run.
But here, everything is imagined.
I was used to the command, I'll be to close the firefox tag.
I didn't crash the two immersive.
That's too immersive for me.
I wanted to close the tab.
But yes, I can play generated.
This is shockingly fast.
Yeah.
Because I remember there was a demo about maybe one or two years ago.
Someone tried to do the first person shooter with image model.
There was no consistency, it was very slow.
But here, let's say, realistically, this is doom.
I mean, I think there's two sides to that, right?
There's like, okay, what is running a game that heavy part of it is actually the game engine, all the lighting, all that stuff, the graphics?
This is just kind of video, right?
We've solved consistency.
This is still, you know, it looks like a few years old image generation.
There's some temporal consistency, but it's kind of just images that's together as frame video.
But it's a good visual representation to picture the future you want to see, right?
Like, that's what I see in these more so.
This reminds me of how the video models got better and better.
So in your OS, it's kind of, if you're just looking at it, it's just a crappy version of the Windows we could have, right?
But there's a difference.
So the model, this model is always shaded on the existing operating systems.
It can generate nothing different than that.
But it's actually also similar to video models.
When we're training these video models, image model, we train them on internet.
There's no imaginary supernatural stuff on the internet.
But once we train this model, you can prompt the model to generate something supernatural that have never existed in the dataset.
So if you train your OS or your computer on the standard square recordings on the entire internet, some model can imagine a new interface to interact with the computer.
This is one of those things that is magical to me.
Usually, generalizing out of distribution is bad.
But somehow we have learned some kind of internal world model.
Yes.
That you say, you know, this plus, but it looks like rainbows and butterflies, it'll do it and it'll kind of make sense.
So, yeah, that's kind of cool.
Yeah, I don't know if there's any comment more and there.
I do want it to touch a little bit more on the model architecture stuff, which I think you are getting.
It's like refreshing.
We don't get a chance to talk about this enough.
So one of the papers that we covered, we've covered every annual segment anything release.
And I don't know if you follow, I mean, you're a computer vision counselor.
So they did memory attention, which is kind of interesting.
And I always think, like, anything where you can across the temporal dimension keep some consistency.
I think it's like very fascinating.
And I don't know if, basically, like, does that the CV side bleeding into video genocide?
I think it's under explored, right?
Like, we talked about it for labeling, but actually, you can borrow the architecture itself.
There's also a complete different approaches, right?
Like, you brought up the term world model.
So we went from video model to world model.
There is diffusion, but there's also other approaches that people are doing so.
Maybe we get into those after as well, you know?
Yeah, he has a whole definition of all models.
Okay, stuff.
I feel like we threw a lot of you, whatever you want to comment on.
I think one thing that we should actually comment back on is, like, okay, so we were talking about the steps to train image and video model.
One thing we don't see as much of is, like, okay, you brought up the delta and training data, right?
So you won't have as much a video model, might not generalize, but what is the cost of training a large video model?
So we know for our lens roughly, okay, even like the pool sight thing that came out today, right?
It's a Gemma level model trained on roughly 40 trillion tokens at this many H200s over this much time, right?
You can see, what is the exact cost of that?
So how many GPU hours over how much H200 costs?
So how do we do the backhand math, you know?
Same thing for video models, image models, how do you kind of break that down?
I can share some back to the amount of calculation.
So surprisingly, video models is, like, the cost is very, is comparable to language models, and obviously the largest scale is the language model, maybe like a medium scale to language models.
I said, just storing the videos alone, it costs a lot.
You can maybe look up on AWS or something.
You're really like, say, if you have a billion videos, and let's just say, like, each video, like, five megabyte, then you will need, like, five megabyte to just store those videos.
And also remember, we talk about, you use a VDE to compress the videos, and you also need to store typically need to store those continuous features on also in your storage.
That's also comparable size with a video of themselves.
So just storing these videos and the features, it's tens of padabytes alone.
And I just looked up the calculation.
Five petabytes on S3 standard is 100K per month.
Okay.
And you need to, then, like, a tens of padabytes 200K, and even more expensive is you have the ingress and egress.
Thank you.
It's sort of the internet.
You have to just download those videos.
I believe it's more expensive on AWS than just storing those videos.
And each training runs, you'll probably need to pull them once.
If you're trying multiple times, it's even more than that.
So it's like, just storing, storing the network, those costs, it's just, I guess it would be a few moments per month to just storing everything, not to mention the GPU costs.
Okay.
My side-panger, like, in there.
Compute rental, like GPU rentals, very efficient.
There's one side, okay.
You can be XAI and build your data center.
Should we not just build our, like, storage compute as well?
Of course.
Of course.
Of course.
You see so much?
Yeah, exactly.
Especially with, like, egress and stuff.
So, you know, that's a good idea.
It also comes to, there are some of its own challenges.
Of course, of course.
Yeah, like, people who build the GPU data centers, they may not expect this much storage.
And, yeah, people will build storage, they just build it somewhere.
Business CPUs.
I just looked up.
Five.
AWS only charges for egress, not ingress.
Tier five for five petabytes is 230K.
Yeah, even more expensive.
Sorry.
My story is provided by you checking that you can check out.
It's so cool.
It's okay.
It's so slow.
Tier, you know, my, my data data is larger than you think.
Yeah.
My backhand math of GPU hours times GPU costs is also very much, you know, I'm missing some storage.
You're also, you're basically, like, also more eye-opounds than normal training.
Yes.
Yes.
Because, like, data loading, caching everything because super important.
Yeah.
So, in customers, it is a lot of optimizations that make it not outbound, so.
Yeah.
Speaking of the training, actually training the model, the GPU costs.
If you look up like the open source model, how big these video models are, think like, RTX has nanting B parameters.
That's a dense model.
And people are also exploring M.O.E.
So it might be like a 20 B active, and like hundreds of B total.
So that's, that's even, that's similar size as medium size on models.
And if you look at number tokens, they disclose that in cost most.
It's also like tens of trillions of tokens on the visual tokens.
So putting these together, the cost of like training these video models, it's actually comparable with all ends.
Not to mention, the infrared is slightly different from all ends, so it might be less efficient to trim these models.
Do you get the benefits of traditional diffusion speed up?
So for, you know, images, there's LCM or as for, you know, fine tuning.
There's a lot of style matching.
Yeah.
There's so matching.
There's a lot of stuff that's been done.
There's some overlap that applies to diffusion on the infrared side and stuff.
Yeah.
So the difference, the infrared side is a completely different story.
I think for the training side, it might be a little bit hard to reduce that cost.
And for the infrared side, the biggest gain is from the distillation of these models.
You can, it's called step distillation, so that you different from knowledge distillation in all ends.
So you typically for flow matching models, you need like a hundred steps or something.
Like the show show model, even even more like a thousand steps to generate a good image or a video.
A step distillation is try to learn to generate your step from the model itself.
It's kind of like, now we use the full model to generate in a hundred steps and then you take a model that only generate ten steps and that's how models learn from the perfect one.
Yeah.
Why is this work strong to week?
It's kind of like strong to week.
From the modeling perspective, the teacher model is trying to model the image and videos of internet.
And that distribution is extremely complex.
The step distill model is just trying to learn from the teacher.
The teacher is a model and the size is fixed.
The distribution is much simpler than the whole internet.
That's an intuition that have why step distillation can work.
So you release these models, certainly in productions.
They only write in a few steps.
In in cosmos, I believe we have we have that four step and a step.
If you do some simpler task, like an image to image translation, it can even write in a first step that one step in in cosmos transfer.
Yeah.
I think this is the same intuition that guides a lot of the consistency model work.
I sent you a link for SCM.
I don't know if you covered that.
To me, that was actually one of the most impressive papers I've ever seen from open AI.
This is the unifying grand concept of consistency models.
I don't know if you were ever in comments on this.
So there are a few different approaches.
Here is two steps versus twenty or one hundred steps, whatever.
It's already done.
So there are a few different approaches, for example, consistency model.
And there were also, actually we shouldn't forget again.
Yeah.
Actually, that was the OG of the step distillation, because it trend just one step to begin with.
So actually a lot of, for example, there's a distribution matching distillation, which use this gain.
As one of the laws for distillation, it again just tells you, hey, like you can't hear the image.
And then it has a discriminator to tell.
Is this image real or not?
So the model just need to learn one of the distribution, not the full distribution.
Because in training, the model is asked to reconstruct the ground truth image from the internet, which is extremely hard.
And when you're training again, it's the one step process.
It's just a, hey, your general image does this image look, look as real as the image from the internet, which has a much simpler task.
And yeah, combining a lot of these approaches together, people would have to do that like consistency model, distribution matching.
And again, we can get these fuel-stab models.
Okay, then there's one step I wanted to add, which is audio.
Yes.
And video.
Yeah, so, Gorky imagine, they were put in mind.
I believe it's, is a first, first audio video, trans model, deployed at a large scale.
So, and that was your first model.
Yes, I was, Gorky imagine's first model.
It's, it's audio video, uh, a trans generation.
I think the, it's a hard part is like the, some modality alignment.
Because before this trend model, like we have, we have text to video alignment.
We, we have this, uh, correspondence and views and text and video.
Typically, most of the, we all have the, the, the understand images and videos, videos, very, and they don't understand audio, mostly.
And, if you look at the audio generation on the online side, you can talk to them, perfectly fine, but if you ask them to sing a song or something, they're, typically, it's not very good.
Also, they don't have, they don't have music either.
The hard part is that, uh, actually, audio has, too component.
It has like a, a discreet component, a continuous component.
The discreet component is like, there's a language.
So when we speak, it's just a song.
It's an ESR issue.
Yeah.
Yeah.
It's, it's text talking with some characteristics, I'll say, but, uh, but music, I think the speech, guys will disagree.
It's like, this one sees and then, you know, it's like, it's like, it's like, largely.
It's like, it's like, it's like, it's like, it's like, this discreet token seen, language models.
Uh, this is like, it's a hard part for, for, for models is, uh, an automation.
We have to align, text, video, and audio together.
Yeah.
So, how?
So, significant, some significant challenges are like, so, so first, that we, we talk about as the VRLMs.
They cannot understand, they, most of them cannot understand audio.
So, you have to have some way to, to do the synthetic data generation for, for audio.
You have to capture the model.
And that involves, that involve synthetic data and human data effort a lot.
As, and not just surprisingly, mostly all M's are very bad at, recognizing, like, the, the beat, ton, and the details, the, the, they can, they can give some general, general, general, general, which song is this, but it's very hard to describe the details of, if the music.
Um, like, if we mentioned, in image generation, like, you have to describe, image as details, as possible, so that someone, blunt, can reconstruct it.
So, here is like, someone, deaf, deaf, can reconstruct, how the music sounds like, without, it is actually listening to it.
Maybe like, you, you can think of, it, it needs to have, have the, what they call the, the subtitles.
You've got to have, have all the details, it's, it's a music, and the dialogue.
So, is the challenge there, typically stuff like, music and audio, or is it just, like, is there a baseline?
Okay, there's enough data where we can understand, you know, narration, conversation, but there's nuances in audio that, that's where you hit all the, the data issues, or, is it just from state zero, you just do it all right?
So, one important thing is, like, the alignment, so it's a model, the model has to, know, like, the video and audio, they, it has to have a time this alignment, like, at which time step, the video and the audio token corresponds to each other.
We actually don't have these kind of alignment for, for, for most of the other modalities.
If you think about, like, text, text and image, text and videos, they are loosely aligned.
So, you can, you can't have a description, as, what's going on in the video, but you don't have to, exactly, you, you typically don't have exact description, or at, at, time, step, one second.
Yeah, three, two, four second.
Yeah.
So, what is the ideal time step, you have to bleed it, and then it's, like, four seconds or something.
So, that comes down to, how you design the model, yeah, to, as far as a model, to be aware of, as a time, as a time, or that, so the model is, like, a time, and that's something pretty unique, if you think about, all, so, if you ask, all, to come see the task, you ask them, and they would say, oh, this task, we'll probably take, tough hours to come see, and they, they come back in my hours, say, have already spent two days on this, and then, every exhausted, everything.
Yeah, so, so, the, themselves, they don't have a sense of time there.
I actually don't think that's just, I'm not having a sense of time.
I think it's, somewhat based, right?
Like, you tell someone, you know, work on this feature, go implement this.
There's a general understanding, you would have of how long they'll take, without LLM's, work at LLM's feed, right?
So, you think back, like, two years ago, if I tell you to, like, build me, like, a new front end for laden space, have a search where I have all this, you'll estimate that'll take a few days, right?
So, you're telling all, go build this, it'll take me a few days, but, you know, I think it's somewhat grounded as opposed to them, not having the best, not saying that they have a great understanding, but I think that example is, like, you can see where it comes from, you're trained on all over the text, they're trying to estimate what a human would say.
Yes, because that's, that's what the data kind of is.
It can start from, the core for us on the internet, people have an estimate.
Yeah, and not even just in direct, like, training samples, right?
Just your world understanding of tokens of how long stuff takes, right?
Go read a book, it'll take you a while, right?
Yeah.
Even if you do nothing, but read a book, it takes a few days.
So, yeah, I'll let my read it, it took me a few hours.
They don't take me a few hours to go through this research, but, there's somewhat a tangent.
So, I want to, yeah, this is a train I thought I haven't really expressed until now, which is basically, like, the full world model, the model will also be recursive, meaning that the participant in the world model must also be aware that they have a world model, which is like this whole recursive thing down the line.
But, yes, and that the world model can be wrong and that they need to update it.
But, yeah, we've argued this on the newsletter as well, that there needs to be recursive for adversarial models.
Okay, I mean, just, you know, task, how do you define world model?
Yeah, let's go there.
Yes.
So, just for context, you know, we talked about video generation, and then there's a, if you say there's a distinction between world models, what's your definition, how do you see the two?
So, this camera, I'm not going to debate, like, what is world model?
Yeah.
Like, there are many definitions.
So, I'll just talk about my definition, since I came from the multi-model domain.
So, mainly talking from video.
So, world model is like, real-time interactive, long-hars and videos.
So, there are three parts.
Like, so let's talk about them one by one.
So, so, interaction.
So, we just, we just look at saybook and the neural computer.
So, the interaction part of it.
So, you, the world model can allow you to interact with them through keyboard mouse, and maybe also voice.
So, these all, all some modality, you can interact with this model and the model should respond recently.
Second part is real-time.
So, once you, once they, you move your mouse, like, it says, the world model generated the game, like how fast can the game respond.
So, if you're, like, professional CS go upstairs, like, like, oh, you have to respond easily.
So, sub 10 minutes, like, so, or even less.
So, that's not, I guess, most of the, oh, 60FES, let's go.
Oh, 300F.
Yes, 500FES.
Wait.
Okay.
Yeah.
I didn't do the math for you.
Okay.
Sir, 100FES, that's a 3 million second.
So, you have to respond.
Oh, it means, whatever.
Most of the video models cannot do that.
Yeah.
And, but if you, say, if you have a video model, that is, say, like, a digital human, the response might be more generous, maybe, like, typically, like, for a real-time voice interaction, it's like, 200 minutes second.
So, that's, that's much more generous.
But even 200, the second is pretty, it's pretty tricky, because, like, remember, we mentioned, you have this temporal compression, coming from the VAE.
So, if you, if you don't compress the temporal dimension, your sequence lens is going to explode.
So, if you want to have this real-time, real-timeness in your model, you have to do with one context problem.
And the third part is, it's long horizon.
Because, if you're not going to just play with video games, just, like, a few seconds, most of the video models only have few seconds.
We're going to play with minutes, hours.
The model have to be able to generate a long-form content.
So, putting these three together, it's a real-time, long horizon, interactive videos.
I think, the final state will be, for example, like, a video version of the book, where you can, you can interact with a, a neural computer, you, you move your mouse, and you play on the generative interface, and it will reply to you through, sort of pixels, generally, in real-time.
But, getting there, it's a very long way to get there.
So, one of the first steps at Gorky Imagine, where I, a lot of small, or model team there, was, was to build video extension.
So, video extension.
Ah, it's the first step of interactivity.
Yeah, it's the first step.
Yeah, so, it's the first step.
You have it here, video editing, yeah.
Yeah.
Yeah, so, the first step, it's because, this on locks, long, horizontal videos.
Typically, for most of the video generation models, you, you give it a prompt, or an image as an initial frame.
You generate videos, that's it.
That's just one time done.
And some, some creators would try to, like, use the last frame, as a first frame for the second video.
It kind, sometimes it works, but if you do it a few times, it, it's a caught, you would agree.
And it doesn't have that context, with a full video.
So the, Yeah, because you gave it the last frame, of course, right?
Yeah.
Exactly.
It's actually a pretty fun hack.
Like, if you see it, oh, no, he says something better.
Yeah, yeah.
And for example, like a view, I remember you, three has, like, a one second context of the, the last video.
It's sadly better than using the last frame, but it has the same problem, similar problems that it, the caught, he would, the degree, like, if you extend a few times to, like, one minute, the video caught, he would look much worse than the first video.
Second, another problem is, as a model, doesn't have long range knowledge of, like, what's happening before, say, is it generates some dialogue, some, to people speaking, and, as they're wise, my change, over, over some time, especially if it's the one second, conditioning, it does not cover the previous context.
So the, these are the caught challenges.
So, the gorky imagine video extension, it has, historical, context of, all of the previous, generated videos.
And it can, it has, it has the context of, who's speaking, and what, what objects have appeared, and everything, having that, to generate the next video.
So, if we now usually do this, you can imagine, just put all of the previous, history video tokens into the context of context lens, we easily explode.
historical, for video models, that can be, like, a few, a few million context, I would say, context lens, yes, with that.
For example, like, just five seconds of video, it's like a 50, 50k or a 60k number of tokens.
So, like, if you do, if you do 50 seconds, it's a 500k tokens.
If you do longer than that, easily explode.
This long horizon, a problem with the first step, we're trying to solve, word model.
It turns out, people, people love video extension.
I call a lot of creators, love using video extension to create longer form videos.
This is the part I like that.
You have a, you have an intermediate step towards the final goal.
It's just a straight, straight shot to the, the final version, very much.
But I can see, you have a strong vision of where we want to end up.
Yeah.
Does it seem like it's an efficiency issue, like, okay, we're at a few million tokens context, you know, if you draw the parallel to language models, we're very short context.
2008, 2008, you know, you scale it up, 1 million, 10 million.
Sure, there's effective context, you know, but at the end of the day, it's just what's it worth.
Like, sure, there's a whole training data side.
In video, it might be slightly easier, because we have 100 million token video.
I just take a movie with the full context there.
Like, is this efficiency from an inference standpoint, that like, but we know how to solve it, or like, why is this not the approach?
So like, my broader point was on your second point of world models, you say it needs to be interactive and live, right?
You should be able to play a game and see the interaction live.
So one thing I see with research is, a lot of what you actually serve is different than what you build, right?
So we talked about distillation.
You train big model.
You distill it.
You do quantizations by color decoding.
We do all this stuff to serve it efficiently.
Should we not just have a solution, like a world model that can interact well, do inference optimization, serve it distill it, secondary.
So make it real time after you solve it.
So like, another parallel is say, continual learning, right?
What we need is someone to solve it and show it works inefficiently, give it a few years.
People will make it efficient.
Same thing with regular attention, right?
It worked over a few years.
People have different forms of attention.
And we've scaled it to be efficient at log context, you know?
So kind of two things there, right?
One is like, it seems like it works.
You've scaled it.
Can we not just scale it a lot more efficiently over time?
Do we need a separate approach if this works?
And same thing with interaction, right?
If we can get it done, like if we can solve some way that it works, you know, we can solve making it more efficient from an inference standpoint later.
Yeah, that's actually very good point.
So in videos, there's actually a lot of redundancies.
So we solve a lot of the pixel redundancy from VAE, but there's more redundancy in long range and long horizontal videos.
Say, if a character appear in the first clip and then it disappeared, it only reappeared, like as the end of the video, you probably don't need the context, like in the middle of the generation.
So you only need that character there, very you need.
So that's why I help build another feature is reference to video.
Is it the same model release or different way?
It's a different one.
Okay.
You probably need to search on reference to video.
Okay.
So reference video, you need to like upload up to seven images as a condition and generate a video save.
I want, it can be characters or objects or even scenes.
So like I want, I want a condition, a shan's selfie and a holding a blade.
Yeah, we put the door.
We put the door, we put the door and the thing.
Yeah, you can put them there and the video models we are generate the video from and copy the context over.
So that can solve a lot of the problems there.
Like the, the long context problem.
It doesn't need to have a very long context, but it's, I feel like it's an intermediate solution.
It's a model.
Yeah.
Yes, a model should be able to, like, selectively know like, we are, we are sure that it draw references.
So say if I want to generate a movie, I generate a positive, like a 10 second at a time or something.
And now this character up here, I can look back to where it first appeared and bring that back.
Yeah, this one.
I put the references.
Yeah, that's the, Optimus, Einstein.
Nice to meet you.
Oddly enough, I used Grok search to find it and they pulled your LinkedIn post.
But you know, we found it interesting.
But, okay, this is a problem.
This is not your fault, because yeah, it doesn't communicate all this work that you do very well.
Because they just have the model release, and then that's it.
But like, actually this details are very, very good.
It's as far as I understand everything you just described is state of the art, like, no one else has done it.
Thanks.
Yeah.
A lot, Yeah, and then you just put this vocals with the cookies there.
I'm like, this is not enough, you know?
But I obviously this is like the high level numbers that people want to know.
But you know, part of that is also like, some, some labs don't share research research into what happens.
Then this is literally bragging about how good they are.
Right?
Why would you not say that you are capable of extending with full content?
You know, this is not a secret source.
This is like we did the work.
Like yeah.
I don't know.
Yeah.
I guess different labs have some of the different communications styles.
Yeah.
Anyway, we are always happy to help you tell your story.
Yeah.
Okay.
So you did references.
And I think, I think the kind of the point you're making is like, it's sort of like a clue, right?
Like this is, you can do seven, but what about 100?
Yeah.
Right?
Then you need a completely different thing.
So I think it's like, this is like a mechanism to like select the context from the history.
And you might not put the entire history into the context.
So for example, there's a paper called frame pack, which have a heuristic that the latest history, like the last one second that put the entire history.
And the history before that, I would compress it and make the video smaller.
So I follow this pattern.
This is the role pattern that the maximum six and students is fixed.
So the further you are from the current frame, you have a smaller image.
So this is just a heuristic.
I think it can be more automatic.
The models, a very, like, which history part of it can be select.
So this part of the research is actually being actually a workstone that a lot of people.
It's also kind of interesting.
I feel this is actually this part of long context.
It's a little bit ahead of the all and part.
So for example, second in LMS, if you, the context keep growing.
That's if you, you call tool.
And the tool called history is extremely long.
That's still in context.
And keep growing, keep growing, even if you switch the topics to something else.
The whole context with there.
There are some agent take harnesses that help you to say, prune the tool results in the prune, like when you, when you require a file, only show like the top 200 lines or something.
So those are very heuristic driven.
For listeners, we did a write up on the cloud code, leak, whether it's different kinds of pruning, including like you prune the tool results and all that.
So you can, you can read up on that kind of thing.
Yeah.
I think a, one breakthrough in continuing might be like a way to automatically, Yeah.
Manage is on these all characteristics.
And it will be replaced by machine learning.
Yes.
Infissory.
The same thing is being researched in both.
Oh, I'm sending video models.
The interesting thing is also like in the paper you showed.
It's actually happening at the model level.
Right.
Compared to like, like good model show, we have base attention, you know, we'll do our own compression.
We'll do our own pruning, which is separate from model area.
Yeah.
Eventually it all just boils in, hopefully.
Yeah.
Yeah.
I think this is a form of like attention.
But like also, sort of reasoning attention.
I feel like there's different than normal attention.
Does that make sense?
Yeah.
Yeah.
It's different in the sense that attention, not to mention, set sparse attention aside, like normal attention.
Like you can read.
You have to attend to all of the tokens.
Yes.
So you, you don't have a, a parallel mechanism to, to travel, which tokens don't, you don't attend to.
Humans, the tertiary span is surprising small.
Yes.
You, you can only remember it.
You'll have a digital phone number.
Well, I have featured detection, right?
I can detect, oh, that's a sequence of one, two, three, four.
In a phone number that is, eleven digit.
Very good pattern matters.
But humans, contacts can, like, contention can work because we can dynamically pull in, contacts from, from different places.
The same, I think it's going to happen for, all of them.
You know models.
Yeah.
Or else, this is a, some of the recent work.
Yeah.
Is there, which is not that, crazy, but it's just recursive.
I think it's somewhat inherent in models, to really, like, use a nice example.
You pull up these.
You can read it fine.
But, language models are also very good at, slow parsing.
You know, you have a, and it's very good at parsing through noise.
You know, that may be a brute force.
It can look over a reason over it, but like, you know, there's, there's parallels to both.
At least it's really fascinating how you relate the world models, stuff to the video generation, which I don't think a lot of people here directly, from people like, like, you know, I think it's really helpful.
Any other, work, do we cover like, video, audio, world models, any other stuff in that, on the team, I guess.
Or any other work at, xia, you want to talk well.
Seems like everything we see publicly announced, oh, cool cookies.
And then there's so much more to it.
There's a lot of that.
Any underrated stuff, you know, just at the time there.
Yeah, I feel the, is a culture is, it's quite interesting, and a bit underrated.
So, so the culture is, the culture is, is, is, is, is, is, is, and the first principle.
Like you, you know, is a ghost that was very ambitious.
It wasn't very, it wasn't, it wasn't possible to, to achieve it when I, when I, now I was thinking, first and came about it, like for example, I could build, those something in three months.
And was that, like, okay, we're starting to team.
We want image we want video, do it by this deadline, or, you know, how do you work back?
we have a rough buy you know this date we want something out or is this like that's a very good point so it's from first principles thinking if you think about people might say like first principles thinking applied more to the physical world than the some models I would say for example like if you think about some limitation for example the car in data like how how fast can we acquire the videos and if you think about training the models like what's the iteration speed for training a model into end and how how would adding more GPUs accelerate that timeline and maybe if you need human data like who was a turnaround time for for a human data to arrive if you put all of those together that is first principles in came where oh you know like what is a timeline was a minimum number of days that is possible to achieve something I think that's this is a lot of Elon's type of thinking right he's like I think it's famous for saying that the only law you can break is the fossil physics something like that yeah just broadly you you worked a lot with Elon yeah I guess one benefit is like a workout at XAI you got a chance to interact more with Elon so so I was very fortunate to get a few retreats from him and that's that was quite fun and he he also worked very closely with his people like people imagine online like he's very hands-on there are two things one so I was actually looking up uh Elon retweeting you or pull it up he talked about you you tweeting that you have a really good voice mode I don't know I don't know him oh I also did it I actually so I would DM you feedback on voice mode because I was like well really good and I'm like oh the sucks but um I don't know anything you want to talk about about your voice mode building it was a team you worked on as well but that's actually not a part of the team I worked on yeah yeah probably worked on more of the like a video no but a rock voice actually got like very good I a this one those things were like first of all you can speak at 2X which is fun which I listened to 2X so I like to speak at 2X but also I think like the interruption was better than Gemini I don't know how it compares to Chaddu BT real time now but like you know as far as like driving was concerned like having grock in my Tesla and like driving I think it was like it's really good experience you know you like twice more but also just a crazy reach value 50 million viewers are just saying yes true oh my god but you know it's it's pretty cool how fast it came out I guess the other thing is the safety aspect of video mode anything interesting to talk about there so actually spicy question a lot of the countries very easy they don't allow like a generative data generative AI videos based on watermarks so in all of the those countries crocking imagine had watermarks and a lot of the lots of take-downs of the videos were also happening extremely fast I mean it's it's part of running a social platform but also good transfers nicely to the genii side do you have a perspective on synth ID versus other kinds of watermarking yeah I guess it's going to be it's going to be a hard-earned harder to to detect yeah these things so society one thing is previously it was only google and now not like a lot of different ways of developing it as a limitation is like the technology is a paper was out there and people can reverse engineer like how to get rid of it and it's I think even as it advance it's still still possible to reverse engineer it yeah so if you're interested you can go into rid it and people have taken out the exact like uh I don't know what do you call it mask or pattern that google applies and then you can apply it onto any google generated photo and you can reverse out the synth ID yeah and it's it's also are there and harder to just charge by eyes they remember like a couple years ago there's are like a six fingers or something is very obvious my my current is actually the audio feel like the audio is really lacking my way to tell it something is the engineering outside of like okay I think I've seen enough of a decent eye the audio matchup especially of Sora is not great it's all similar style but there's you you know there's those are minor reactions I think the the point is that like actually my closest reference to this is also in google fellow because I think he did like the adversarial game thing where like it's like okay here's a picture of a zebra then you like change one pixel and it becomes a better right this is like a classic computer vision issue yeah if you think about how how these models were were trained like I like I mentioned before like again was in the training process the objective again is you is a model channel into image and the model there's a judge to tell like if the image is real or not so model is trained to make the image more real so as a model become more and more advanced it's going to be harder and harder for me personally now I have to judge by if the these videos have logical sense this video have a warm model yeah no I also like the the audio is too nice like two studio quality the lighting is too good the skin is too clear you know basically the lack of imperfections yeah do we have a good way to do reasoning in diffusion like as that let's separate video generators from world models or in you know we really know how to apply it to order aggressive language models is there a parallel for diffusion video gen world models like on that point right is yes the thing on video agents yeah that's a good question yeah actually I have a I had a pretty big claim the the in the the visual intelligence are actually mostly coming from language like these video models especially from now since the diffusion model technology is more mature the like every time you see there's something improvement on these models I would say mostly is it's again comes from language model not not coming from the video model itself like the video research model themselves in cosmos that could be typically the these models they have two parts like there's a there's a problem through writer or the prompt up simpler part I think in uh in cosmos we use llama or we use mixed mixture and the cosmos video model itself is only 70 and the model the language model is a prompt writer it's it's bigger than that so the prompt for writers task is to take take user instruction and convert it to extremely detailed description of the video so because the video the video the social models I will describe the there kind of dumb because they they take the input instruction literally because in the training process remember that we have to describe the video as as detail as possible when we're creating the synthetic text pair so these models they take those kind of instruction to each other's videos so in when you're taking the user instructions user instruction you really are simple just say a cat or something if you put a cat in in the video models it would take that instruction literally that they would literally show a cat in maybe a white background because you didn't describe the background the cat is not moving because you didn't describe it it takes the instruction quite literally it's kind of it's kind of dumb and the the prompt for writer is actually a much bigger model we show a language model that takes the using instruction and the expanded so the thinking process you mentioned is from there so if you if you look at like typically image like a generator the image in certain minutes it's not all like a pick search generation a lot of time is spending thinking so prompting we're writing now have evolved to like not only just a thinking it can it can also be an agent agentic model for example say you want you wanted to generate the image of today's news so it's likely we'll go to fetch today's news online and then like process some medicine then organize a layout and generate it another thing quite interesting is if I'm not mistaken these are it's no longer a diffusion model right it's order aggressively or is there still there are different approaches for example like a channel on me since it said it's on me I believe it's a single model maybe it's something like a it's a language model with a diffusion head or something like a language model do the thinking do the agentic tool calling and then it would use the diffusion head to channels the image in the end there there were also approaches that cosmos where you you have a separate language model and separate each other models and there there are also like a purely language model like you you're discreet high as the images and then your channel is an image as discreet tokens so there are different approaches I would say like one of one of the claims that seen for why these approaches struggle is because a lot of the benefits for how we currently learn reasoning with language models is you basically iteratively generate reason you have your thought and then you work on that answer right so if you have like omnimodel and then diffusion head you can't feed that back in the continued reasoning right so you can't go like text image text image you can't reason on the output and then go back to diffusion but I guess in the new Gemini Omni you would be able to as long as you have.
Yeah I'm not sure if they have that process I guess it's definitely possible in the Omni paradigm yeah so if you think about like traditional multimodal language model they would have a VIT in coder that can't encode the image so you say how the diffusion has they can change the image and then put that back into the VIT in coder encode that and then do do the iterative refinement if the result yeah I think you have to jointly train the VIT and the diffusion to make that somewhat feasible because otherwise you're kind of like mismatching or feeding in-slop yeah the individual stage of training you might be able to freeze it but anyway also just on your earlier I wanted to also make explicit we do know the nano banana and GPT image are autoregressive language model with diffusion head as far as I can tell from your description of grock image it is not it is it is end to end but I think there's there's different approaches right like you started off saying prompt rewriters is like the big part of the intelligence and even on that I think everyone should try using an early diffusion model if you've used stable diffusion one or whatever if you've seen the prompts like you know ultra high res 4k this dial like oh my god the first time I tried one you don't talk to them like likely to models right you're prompting is very you know come on separate really talking in the labels that were in the data set right yeah but basically like I'm just trying to make the point that prompt rewriters and then image is different from autoregressive language model with diffusion head yeah they're different things yeah just wanted to establish as to like the common part is like the the image part so it's it's quite surprising that a lot of the improvement came from the language that just thinking the the tool calling so I still remember like in Cosmos I had generated a happy sheep thinking if there's already any rewriting it looks so CGI and after rewriting it looks it looks so beautiful I think without any joint reigning yeah actually without any joint training it's with rewriting it's already much better seeing a very interesting thing I guess what happened is the video agents mostly language models they'll cause these alternatives model easier it's a separate model or a decision had whatever as tool so these model can iteratively refine the results or even like a general longer content through the very long trend of thought it's actually very similar to how human create art so so we don't we don't general the pixels directly we we it's really draw something on and I think through this process like the these models not only use diffusion diffusion as one of the tool it can also use traditional tool it can also use image editing tools from Photoshop you can use video video editor, FSM pack whatever to take combination of these and the generative AI technology as a set of tool and they can they can iteratively create a better much better video for production grid quality if you look at existing professional creators they don't they don't end at generating a video from from these models they will take this video to to their editor and edit here and there so much post production yeah and sometimes actually like the the reason the video is good it's not really the video model it's actually the editing and yes we also are engaged in the same process as why would you love to use a video editing model yeah actually there's a grock grock imagine agent better that's I was the that was the first attempt in that direction yeah so I I think the like the process would be similar just like the mode yeah you can you can ask it to it's no blog post for it maybe generate a one minute video like which is not possible if you ask the same prompt to video models but this model is really called different tools to do that so yeah this is actually an interesting thing so when in the first release video editing model like a CLX sound sound people try the video editing feature with like added this video to be a minute because they didn't understand how video editing work video editing typically is just a removal ad replace star transfer this kind of thing but that's actually about it requests under the assumption of some video agents so these agents should be able to understand these kind of longhars and tasks to be able to it's really create a long form video I think this is this is really fascinating because it is kind of take it's taking the same direction as first you have these assisted AI assisted coding kind of like tap composition can have copilot and from there you gradually evolve to codex and cloud code where you do things fully automated so in in ancient in carcass imagine ancient mode you can you can still go in there and do stuff by ourselves gradually as a model capability increase it will be able to do everything fully automated yeah I like that okay so it looks like it's still generating also I did notice the crock imagine was always very very fast I don't know if this is something you guys benchmark but like this is just a tangent compared to when I used to use before the latest open AI's image gen and same with Gemini nanobanena I would often times use crock yeah it's in the benchmark somewhere in the imagine API blog post that they have for the speed things it mostly combination of distillation plus inference yeah there there there are a bunch of things like we talk about distillation if you have to talk about thinking if you don't have any thinking budget the model can just think tremendous and then come back to you and also like inference the inference in fried team was very talented and they were able to accelerate a whole lot of it's these models yeah yeah I mean you know my comment on the video agents things like I'm trying to figure out like when people see video agents when you initially told me about your bet on video agents or your your vision for video agents I was a little bit disappointed I was like oh you mean like models are tapped out now we have to do agents but like I do you have to right the question now is how much model training is is it really going to make make a difference versus just building a better harness like you said the models don't have to be jointly trained you can just take it off the shelf frontier reasoning model slap it on the harness give it grow as a tool that's it that's a video agent doesn't seem super satisfying obviously you can coach in and and get some more percentage points of proper performance but like if your central claim that the majority of video or generative media alpha or whatever is actually coming from language and helges and not image diffusion of video diffusion then that is the future I mean it's a cool opportunity to wait if you pop back at the example you know it generated frames sorry to interrupt you know it's been saying like okay I'm going to start stitching these frames together it's using FM pack using this is what you see image prowess well it's doing right like it's also just writing code in the background and just stitching doing an image pass on the final output it feels this satisfying for the people who want to just train models it's interesting right like it's it's also somewhat exciting like you brought up earlier a lot of the gains don't come as much from the video like I think you can see that in the language model space to rate anthropic very very good at coding their multi model not the best right they have basic input PDF but like you know there's clearly a disconnected the quality of their image video processing audio processing yet intelligence very top tier other labs Gemini opening IXI you can add modalities but it's not like they're unlocking crazy capabilities right so it's interesting yeah it's interesting to see that because the video models capability increase actually come from language model being more intelligent I think video agent like it it can unlock more stuff that you might imagine so there is also a few saying so one thing is when we are prompting these models so most of the people over actually not very good at prompting actually language models have a better sense of how to prompt yeah models yeah models no yeah models better so if you jointly train these models maybe there's a model have a better sense of the how to prompt each model like a different different model maybe different another thing is in my not as simple as just like a generous of few clips and slaps them together using FFM pack that you might there might be more like image and video adding to appear in this process say if you want to exactly add a add a blah blah text at this time step the video model video models may not get that intention very precisely but the these were possible using these deterministic tools as well as the video agents can use all all sorts of tools so you don't have to put all of the capabilities into the transition model itself yeah I think that's very true no so for what is worth I think you're right I think that this will be a big category I think probably you are predicting like the next one you're in video it's going to be all this do you have a time time prediction for how when the stuff ramps up like me they restarted is it's not very good yeah so it's so it's so good I think the last one is longer yeah I didn't give me a minute give me 36 seconds but you know are we feeling in now is there going to be is there any timeline predictions you want to make I guess that's the end of this year is this is going to be a big hit so the the inflection point will be there and the the videos generated by video agents can get to like production grid quality say it can be presented and it can be it can be distributed in ads and once that happens I think the enterprise will have much more budget for video models because the agents are inherently more expensive than than the other video models themselves because they do these it's a process they channel to many many variations yeah but once once these models have this past this usability threshold I think it's going to be a exponential gross beyond that yeah I would find a complete right now based on this so I think you're right one thing I'm surprising I'm reflecting on the whole that past our server conversation you I think you're into world models and video generation for video generation sake I think that a lot of other world models people we've interviewed a lot of them uh general intuition and faithily and not not those guys and and moon dream which I think I told you about moon lake like I keep saying moon dream got them moon lake a lot of them actually say like robotics is the end game like embodied robotics like you want real time you want interactive it is to interact with the physical world you're not that concerned about it I think robotics will be a value a big part of it for sure I guess the the process may happen naturally so some of my prediction on robotics is that the problem is physically I might be solved like if it's actually need to be the real world need to be in the real world so it's my it might get solved by a video all of them is very strong video capability so so remember we'd talk about the real time interactive long horizon video once these models so so now these models are just training training on like screen recordings and computer once these models can use computers and understand the future state of computer extremely well the robots might be maybe one of the one of the tools like a very powerful AI can can use so it's a powerful AI might just be able to control the physical embodiment naturally I see that for sure cool I know I know we are coming up in time you had you left one more spicy topic which is why you left XEI for me there's there's so a lot of a lot of research you want to do that you cannot do at as a company but not so like the the priorities and an objective the for company typically can change various tasks it's also the same for XEI so so now it's kind of like the it's a time so there's some research I want to do especially more on language model side like I cannot do at XEI oh okay yeah because you're basically leaving you you had this whole transition from computer vision to what models video generation to now you're like focusing on how lambs but it seems like you know a lot of you think focusing on how lambs you really in the past I would describe that all ties together right like yeah but I don't know well what do you mean by focusing on how lambs is there I realize the fact that the the video models even like in the beginning the game might come from improvement on diffusion technology but this is the point where actually models again come from the language models and it's a huge black pill for anyone who is like spend their career in like generative media I mean that's an extreme view right that you still definitely need a bit of both right yeah this is just it seems like more pressing impactful work to do now on language model do you have any similar predictions are you so you predict the video agents I think you will be right on the language side what are you looking for in the next one year I think one thing pretty pretty interesting I think my be happening soon is the the long-fished models will be like contacts are there and manages own contacts yeah so some like from from the video model side we've been suffering from it's a long-harsan issue like we want to channel video longer and longer and we've been trying to solve the contacts lens issues through various ways but one one thing is just reinforcing train longer contacts lens another is to manage the contact better I think the same thing in language models also going to be happening soon so far as I'm like the long-harsan models they they're not aware of how long their own contacts lenses when they hit like 80% or something the automatic contacts compassion is getting trigger and the model is not aware of that when it's working and some maybe it's good far as a model to to know oh I'm I'm approaching like 80% or something and something also pretty interesting like as far as I'm playing open claw like you every time you type in something like a attempts the current local time is automatically attached to your message so the model actually know what time is it so this is making the the model timer there and also like in two calling the a lot of the intermediate two call results automatically prune so there's like contacts removal contacts addition and the contacts compaction so all of these are from the harness system cells from our experience the accuracy engineering all sets of models get as absorbed into the model themselves I guess that's something very interesting to explore to infinite context maybe but it's it's interesting right it isn't a space of memory and continuing learning I don't know it's also like in the space of agent harness you right you're saying he doesn't want to do it in harness right but models are also being trained on unit using harnesses right so some of it is you could say implicitly leaking in right you know part of that post-training of language models is okay using it in coding harnesses in which case you know when our agent spawned when it's compaction gonna happen it's not explicit like you know you have this much token window which I don't know if you want it to be as that'll change but it's somewhat leaking in there I mean imagining what if the model have access to the whole there's a code that's the agent harness itself and be able to modify it whatever you want say if the agent harness is short enough you can just put in the context lens it's in the system prompt and then the model is say when I want to spawn a future version of myself I can modify the agent harness for example if the agent harness can be when I'm reading a long document I can choose to read the whole thing in chunks and come back smash the summary together or I can just read the first first 200 lines and discard the rest and all all kinds of choices if they can be made by the models themselves it might be very interesting to to see that as a model can like a program some model can program itself online in pest time yep so the self-monifying harnesses also part of open claw and pie but I think there's a lot more work to do there very cool I think part of me is kind of curious you know like I think you are a part of big lab right and there's this career path of a researcher at a big lab which is you are you train models you get more compute you train better models you keep going and somewhat I feel like you're opting out of that and if I were you I'll be like oh I think this is like a bit of a career risk you know I mean I don't have any comment apart from like you're very strongly convicted I think that a lot of people in your shoes would not be doing what you did yeah speaking of my career if I I look back actually there's ever there were a lot of huge transitions so so 10 years ago I was I was doing a research with the rest net authors that I saw you John and Tenson yeah at that time the research for a completely different it was like a mostly conservation like image recognition object detection object tracking I was also doing your net compression at that time it was quite different from knowledge to solution these days and at that time I I wanted to be a professor and I applied when I applied for a PhD I already had a few first author papers at top conferences so I confident we applied the top schools but turns on I got rejected by all of the top PhD programs so I had to I had to go to the industry at that time I was at Facebook Hair Research Fair that's by I'm a con I want to talk with each other but yeah yeah we kind of leave it as far enough as a time yeah I switched to as I time I switched to self surprise learning it was it was quite different from what I was doing in contribution yeah and and after that is Nvidia cosmos so I realized scaling up was extremely important so at Nvidia I was mainly focusing on scaling so one thing is cosmos scaling the video of the research models to to should be then parameters and another thing is I was working on MOUs so the megatron MOUs was the first it was the first framework open source to be able to train these MOUs at very large schools like a hundred buildings, parameters to even train and parameters efficiently at like forty percent that Matthew and going to switching to XDI was trying to work on even larger compute scale even further and yeah looking look at this trajectory I actually worked on a lot of different things so I feel actually busy in using ML it's actually easier to switch than something you think like a lot of people might have managed it out or I work on a work on computer vision I always have to work on computer vision and I cannot switch to language and but before my experience at least at Nvidia I worked on both language model and MOUs and also video models it's actually not the case of others but lots of the core principles how to train large models are largely the same and yeah for me I feel right now it's a bottleneck for video models it's actually the language part the agent which is why I want to go to work more on all lamps one thing is it's a bit of a challenge I don't think it's a huge trumps yeah I mean it could also you I think you have a lot of strong vision there yeah I think that was mostly everything that we wanted to cover you've been very generous with your time and I you know it's really nice that you are able to share all these things now we don't have to go through XDI to to clear everything but also we I think we we really get you in trouble it's not a lot of good stuff about XDI compared to what you just seen the releases right you don't realize how many more love you are to XDI please do more podcasts yeah but thank thank you for sharing it's been very kind and also like I want to hear more of you I think you are going to embark on your next phase you haven't announced what you're doing next but clearly you have you know more vision and more ambition and this path and I think you're you're basically kind of gradient descent into like whatever your final form is thank you yeah yeah I'll share more about my next chapter son okay thank you for having me thanks for coming