NVIDIA AI Podcast · 2025-01-07

NVIDIA's Ming-Yu Liu on World Foundation Models and Cosmos

Hosts: Noah Kravitz

Guests: Ming-Yu Liu

world foundation modelsNVIDIA Cosmosphysical AIroboticsself-driving carshumanoid robotsdiffusion modelsautoregressive modelssynthetic datasimulationCES 2025

Why it matters

Cosmos model weights are released openly for commercial use to support physical AI developers

Key claims

  • NVIDIA Cosmos, announced at CES by Jensen Huang, is a developer platform for world foundation models targeting physical AI
  • World foundation models are physics-focused space-time simulators, distinct from LLMs (text) and generic video models (creative content)
  • Three use cases for physical AI: checkpoint verification, policy pre-training, and test-time planning via future simulation
  • Cosmos ships pretrained models in both diffusion (higher coherence) and autoregressive (faster, easier to integrate) variants, plus tokenizers, fine-tuning scripts, and a video curation toolkit

Episode summary

Summary

Ming-Yu Liu, VP of Research at NVIDIA and IEEE Fellow, joins the NVIDIA AI Podcast to explain world foundation models and NVIDIA's Cosmos platform, which CEO Jensen Huang announced at CES. Liu describes world foundation models as deep learning-based space-time digital simulators that predict future states of the physical world — distinct from LLMs (which generate text) and from generic video foundation models (which prioritize creative use cases). World models are specifically focused on physics-consistent simulation and can be customized for different physical AI setups, such as varying numbers and placements of cameras on a robot or vehicle.

Liu outlines three primary use cases for world foundation models in physical AI: (1) verification of trained policy checkpoints by simulating them across many environments before real-world deployment, (2) pre-training policy models so they require less real-world data, and (3) test-time planning, where a robot simulates multiple futures before choosing an action. The Cosmos development platform bundles pretrained world foundation models in both diffusion and autoregressive variants, video tokenizers, post-training/fine-tuning scripts, and a video curation toolkit. The models are released as open weights for commercial use, with the explicit goal of letting physical AI builders focus on robotics problems rather than training world models from scratch.

Liu notes that diffusion models tend to produce more temporally coherent outputs, while autoregressive models — benefiting from years of GPT-style optimization — run faster and integrate more naturally into physical AI pipelines. NVIDIA is partnering with humanoid robot companies (1X, Figure) and self-driving firms (Wayve, Xpeng) to iterate on Cosmos. Liu acknowledges the field is still in its infancy: robust physics simulation and proper benchmarks remain open problems, and integrating world models effectively into physical AI systems will be a multi-year effort.

  • NVIDIA Cosmos, announced at CES by Jensen Huang, is a developer platform for world foundation models targeting physical AI
  • World foundation models are physics-focused space-time simulators, distinct from LLMs (text) and generic video models (creative content)
  • Three use cases for physical AI: checkpoint verification, policy pre-training, and test-time planning via future simulation
  • Cosmos ships pretrained models in both diffusion (higher coherence) and autoregressive (faster, easier to integrate) variants, plus tokenizers, fine-tuning scripts, and a video curation toolkit
  • Cosmos model weights are released openly for commercial use to support physical AI developers
  • NVIDIA is partnering with humanoid robotics firms (1X, Figure) and self-driving companies (Wayve, Xpeng) on Cosmos adoption
  • Liu concedes the field is in its infancy — robust physics, proper benchmarks, and integration into physical AI stacks remain open challenges
  • Liu expects rapid progress in physical AI thanks to existing large-scale model infrastructure and strong industry investment

Source material

Transcript

[Music] Hello and welcome to the NVIDIA AI Podcast.

I'm your host, Noah Kravitz.

NVIDIA CEO Jensen Wong recently keynoted the CES Consumer Electronics Show conference in Las Vegas, Nevada.

Amongst the many exciting announcements Jensen talked about was NVIDIA Cosmos.

Cosmos is a development platform for world foundation models, which I think we're all going to be talking a lot about in the coming months and years.

What is a world foundation model?

Well, thankfully, we've got an expert here to tell us all about it.

Ming-Yu Liu is vice president of research at NVIDIA.

He's also an IEEE Fellow, and he's here to tell us all about world foundation models, how they work, what they mean, and why we should care about them going forward.

So without further ado, Ming-Yu, thank you so much for joining the NVIDIA AI Podcast, and welcome.

It's great to be here.

So let's start with the basics, if you would.

What is a world foundation model?

Sure.

So world foundation models are deep learning-based, space-time digital simulator that can help us look into the future.

It can simulate visits, it can simulate people's intentions and activities.

It's like data strange of AI.

Imagine many different environments and can simulate the future, so we can make good decisions based on the simulation.

We can leverage world foundation models' imagination and simulation capability to help train physical AI agents.

We can also leverage this capability to help the agents make good decisions during the inference time.

We can generate a virtual world based on test prompts, image prompts, video prompts, action prompts, and layer combinations.

So we call it a world foundation model because it can generate many different worlds and also because it can be customized through different physical AI setups to become a customized world model.

So different physical AI have different number of cameras in different locations.

So we want the world foundation model to be customizable for different physical AI setups so they can use in their settings.

So I want to ask you kind of how a world model is similar or different to an LLM and other types of models.

But I think first I want to back up a step and ask you, how is a world model similar or different to a model that generates video?

Because my understanding, and please correct me when I'm wrong, my understanding is that you can prompt a world model to generate a video, but that video is generated based on the things you were talking about, based on understanding of physics and other things in the physical world.

And it's a different process.

So I don't know what the best way is to kind of unpack it for the listeners, but one place to start might be how does a world model differentiate from an LLM or a generative AI video model?

So a world model is different to LN in the sense that LN is focused on generating text description.

It generates understanding.

A world model is generating simulation and the most common form of simulation is videos.

So they are generating pixels.

And so world models and video foundation models, they are related.

And video foundation model is a general model that generates videos.

It can be for creative use cases.

It can be for other use cases.

In world models, we are focusing on physics aspect of video generation.

Based on your current observation and the intention of the actors in your world, you roll out the future.

So they are related but with a different focus.

Gotcha.

Thank you.

So why do we need the world models?

I mean, I think I know part of the answer to the question.

We're talking about simulating physical AI and all of these amazing things.

Tell us about the need for world foundation models from your perspective.

So I think world foundation models is important to physical AI developers.

Physical AI are systems with AI deployed in the real world.

And different to digital AI, it's physical AI systems that interact with the environment and create damage.

So this could be real hard.

Right.

So a physical AI system might be controlling a robotic arm or some other piece of equipment changing the physical world.

Yeah, I think there are three major use cases for physical AI.

It's all around simulation.

The first one is when you train a physical AI system, you train a deep learning model, you have a thousand checkpoints.

Do you know which one you want to deploy?

Right.

And if you deploy individually, it's going to be very tight consuming.

And so when it's bad, it's going to damage your kitchen.

So with a world model, you can do verification in the simulation.

So you can quickly test out this policy in many, many different kitchens.

And before you deploy in a real kitchen.

And after these verification step, you may be narrowed down to three checkpoints and then you do the real deployments.

So you can find, have an easier life to deploy your physical AI.

It reminds me of when we've had podcasts about drug discovery and the guests talking about the ability to simulate experiments and different molecular combinations and all of that work so that they can narrow it down to the ones that are worth trying in the actual physical lab.

Right.

So it sounds like, you know, similar, like just being able to simulate everything and narrow it down must be such a huge advantage to developers.

Yeah.

And second application is, you know, a world model, if you can predict the futures, you have some kind of understanding of basics.

You might know that action requires to drive the world to the future.

And the policy model, you know, the typical one deploying physical AI is all about putting the action, right action, given the observation.

So one model can be used as initialization to the passing model.

And then, you know, you can train the passing model with less amount of data because the model is already pre-trained with many different observations that's on the data assets.

So without a world model, what's the procedure of training a policy like?

So one procedure is you collect data and then you start to do the supervised by tuning.

Right.

And then you may use.

Yeah.

So it's hands-on, it's manual, you have to get all the data.

It's a lot.

Yeah.

Yeah.

And third one is when a world model is good enough, highly accurate and fast, you know, before a robot taking any actions, you just simulate different futures.

Right.

And the check which one really achieve your goal and take that one.

Yeah.

I have a data strange next to you before you're making any decision.

Right.

Would they degrade?

You mentioned accuracy when the models are fast enough and accurate enough.

And I don't know if it's a fair question to ask.

So ask it, interpret it the best way.

But like, how do you determine accuracy on a, or measure accuracy on a world model?

And is there a benchmark that, you know, different benchmarks you need to hit to deploy in different situations?

Or how does that work?

Yeah.

It's a good question.

So I think a world model development is still in this infancy.

Right.

So people are still trying to figure out the right way to measure the world model performance.

And I think there are several aspects a world model must have.

One is follow the role of physics.

When you're dropping a bow, you should predict it's, you know, in the right position based on the physics.

Right.

Right.

And also in the 3D environment, we have to have algebraic permits.

Right.

So for you turn back and come back, you know, the object should remain there.

Right.

Without any other players, it should remain in the same location.

So there are many different aspects.

I think we need to capture as important parts for the research communities to come out with the right benchmark.

So that the community can move forward in the right location to democratize this important area.

Right.

So speaking of moving forward, maybe we can talk a little bit or you can talk a little bit about Cosmos and what was announced at CES.

So in CES, Jensen announced the Cosmos world model development platform.

It's a developer first world model platform.

So in this platform, there are several components.

One is Btrend world foundation models.

Right.

We have two kinds of world foundation model.

One is based on diffusion.

The other is based on autoregressive.

And we also have tokenizers for the world foundation models.

Tokenizers compress videos into tokens so that transformers can consume for their task.

Right.

In addition to these two, we also provide post training scripts to help physical AI builders to fine tune the pre-trained model to their physical AI setup.

Some cars have A cameras.

Right.

And we would like our world foundation model to produce A views.

And lastly, we also have this video curation toolkit.

So processing videos, a lot of video is ready to computing task.

There are many people need to be processed.

And media getter libraries as they're ready to compute computation code together, want to help the world model developers leverage the library to write data.

Either they want to do their own world models or find new one based on our pre-trained world foundation models.

So the models provided as part of Cosmos, those are open to developers to use.

They open to other businesses, enterprises.

Yes.

So this is an open weight development platform.

So meaning that the model is open weight, the model weights are released before commercial use.

We feel this is important to physical AI builders.

Right.

So physical AI builders, they need to solve tons of problems to be really useful robots, self-driving cars for our society.

There are so many problems and world model is one of them.

And those companies, they may not have the resources or expertise to build a world model.

Right.

And media care about our developers.

And we know many of them are trying to make a huge impact in physical AI.

So we want to help them.

That's why we created this world model development platform for them to leverage so that they can handle other problems.

Right.

And we can contribute our art to the transformation of our society.

Absolutely.

I wanted to ask you, can you explain a little bit about the difference between diffusion models and autoregressive models, particularly in this context?

Why offer both?

What are the use cases and pros and cons?

So autoregressive auto or AR model, it's a model that's a, we did token once at a time, condition on what has been observed.

So GPT is probably the most popular autoregressive model.

We did token at a time.

Division, on the other hand, is a model that we did a set of token together.

Right.

And so iteratively remove noises from these initial tokens.

Right.

Right.

Right.

Yeah.

And the difference is that for AR model, with a significant amount of investment in GPT, there are so many optimizations, so they can run very fast.

Right.

Right.

Right.

And diffusion, because tokens are generated together.

So it's easier to have coherent tokens.

Right.

The generation quality tend to be better.

And both of them are useful for physical eye builders.

So some of them need speed, some of them need high accuracy.

So both are good.

Excellent.

So far, the most successful autoregressive model is based on discrete token prediction, like in GPT.

So you pretty much have a set of integers, tokens, and you produce them during training.

And in the case of World Foundation models, it means you have to organize videos into a set of integers.

And you can imagine it's a challenging compression task.

Right.

And because of this compression, the autoregressive model tend to struggle more on the accuracy, but it has other benefits.

For example, it's setting is easier integrated into the physical AI setup.

Got it.

I'm speaking with Mingyu Lu.

Mingyu is vice president of research at NVIDIA, and he's been telling us about World Foundation models, including the announcement of NVIDIA Cosmos, the developer platform for world models that was announced during Jensen's CES keynote.

So we've been talking a lot about, you've been explaining what a world model is, how it's similar and different to other types of AI models, just now the difference between autoregression and diffusion.

Let's kind of change gears a little bit and talk about the applications.

How will Cosmos, how are our World Foundation models going to impact industries?

Yeah.

So we believe that first of all, the World Foundation model can be used as a synthetic data generation engine to generate different synthetic data.

And like what I said earlier, the world model can also be used as a policy evaluation tool to determine which checkpoint or which policy is a better candidate for you to test out in the physical world.

And also, if you can predict the future, it probably can reconfigure it to predict the action to the future.

So as a policy returning initialization.

And also to have a data strange next to you before any endeavor.

So during the test time, schedule rollout and pick the best decision for each moment.

Are there particular industries?

I know work in factories and industrial work, anything involving robotics, but other specific industries that you see benefiting from world models maybe sooner than others?

Yes, I think the self-driving car industry and the human-knowing robot industry will benefit a lot from these world model developments.

They can simulate different environments that will be difficult to have in the real world to measure the agent's behavior effectively.

So I think these are two very exciting industries the world models can impact.

And Nvidia obviously has a long history, as you were saying, of it's not just about rolling out the hardware, there's the software, the stack, the ecosystem, all the work to support developers.

Because if the devs aren't building world-changing things with the products, then there's a problem, right?

What are some of the partnerships, the ecosystems, relative to world foundation models?

And maybe there are some partners who are already doing some interesting stuff with the tech you can talk about.

Yes, we are working with a couple of human knowing companies and self-driving car companies, including One X, Wabi, Deoto, S10, and many others.

So Nvidia believe in suffering.

We believe that two business comes from suffering.

So working with our partners, we can look at the challenges they are facing to experience their pain and to help us to build a world model platform that is really beneficial to them.

Fantastic, yeah.

So I think this is the important part to make the field move faster.

Absolutely.

All right, so you talked about being able to predict the future and you talked about just now that things moving faster.

What do you see on the horizon?

What's next for world foundation models?

Where do you see this going in the next five years or adjust that timeframe to whatever makes sense?

So I'm trying to be a world model now, try to predict the future.

Exactly, yeah, for Niyong's part.

Yes, I believe we are still in the infancy of world foundation model development.

The model can do physics to some intents, but not well or robust enough.

That's the belief, critical point to make a huge transformation.

It's useful, but we need to make it more useful.

Right.

So the field of AI events is very fast.

Yes.

So from GPT-3 to CHET GPT is just a year or two.

Right, yeah, we forget it's all going so quickly.

Yeah, it's going so fast.

And I believe physical AI development will be very fast too because the infrastructure for large scale model has been established.

So this large scale model transformation, right?

And there's a strong need to have physical AI systems for trying to solve the task for humanoid.

And there are also a lot of investments.

So we have the GWE Foundation and many young researchers want to make a difference.

And we also have GWE need and the investments.

I think this is going to be a very exciting area and since it's going to move very fast.

I don't want to say that it will be solved in five years.

So I think it's still a long way.

And more importantly, we also need to study how to best integrate these role models into the physical AI systems in a way that can really benefit them.

Right.

And does that come through just working with partners out in the field, kind of combining research with application and iterating and learning?

Yeah, I believe so.

I believe in suffering.

So I believe that to hang in hand with our partners, understand their problems is the best way to make progress.

For folks who would like to learn more about any aspects of what we're talking about, there are obviously resources on the NVIDIA site.

And of course, the coverage of Jensen's keynote and the announcements, are there specific places, maybe a research blog, maybe your own blog or social media channels, where people can go to learn more about NVIDIA's work with world models and anything else you think the listeners might find interesting.

Yes.

So we have a wide paper written for the Cosmos role model of iPhone.

And we will come here to download and take a read and let me know how, whether it's useful to you and let me know the feedback and we will try to do better for the next class.

Excellent.

Ming Yu, it was an absolute pleasure talking to you.

I definitely learned more about world models and some of the particulars and the applications going forward.

So I thank you for that.

I'm sure the audience did as well.

But the work that you're doing, as you said, it's early innings and it's all changing so fast.

So we will all keep an eye on the research that you're doing and the applications and best of luck with it.

And I look forward to catching up again and seeing how quickly things evolve from here on out.

Thank you.

Thanks for having me.

It's been fun.

And I hope next time I can share more, maybe more advanced version of the world model.

Absolutely.

Well, thank you again for joining the podcast.

Thank you.

Thank you.

[Music]