NVIDIA AI Podcast · 2025-08-12

NVIDIA's Sanja Fidler on Spatial Intelligence, Cosmos, and Physical AI

Hosts: Noah Kravitz

Guests: Sanja Fidler

Spatial Intelligence LabPhysical AICosmos world foundation model3D Gaussian SplattingOmniverseIsaac robotics platformSIGGRAPHWorld modelsVideo generationRobotics simulationVisual language models

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

Fidler's Spatial Intelligence Lab in Toronto was founded in 2018 and recently rebranded to focus on 3D intelligence for physical AI.

Key claims

Fidler's Spatial Intelligence Lab in Toronto was founded in 2018 and recently rebranded to focus on 3D intelligence for physical AI.
Physical AI framed as the next major industry wave, likely larger than generative/agentic AI, with the thesis that all moving devices will become autonomous.
Four-pillar charter for spatial intelligence: virtual world creation/modeling, physics simulation, 3D understanding, and action/control.
Evolution of NVIDIA world models: GameGAN (Pac-Man) → DriveGAN (driving) → Video LDM (2023, latent diffusion for video) → Cosmos (World Foundation Model targeting physics from real-world video).

Episode summary

Summary

In episode 269 of the NVIDIA AI Podcast, host Noah Kravitz speaks with Sanja Fidler, VP of AI Research at NVIDIA and head of the NVIDIA Spatial Intelligence Lab in Toronto, which she founded in 2018. Fidler frames spatial intelligence as "intelligence in 3D"—the 3D counterpart to LLMs and vision-language models—and positions it as the foundation for physical AI, which she argues will likely surpass generative and agentic AI in scale, echoing Jensen Huang's view that anything that moves will eventually be autonomous.

Fidler traces her lab's trajectory from early content-creation work to full-stack physical AI: the four pillars of modeling (virtual world creation), physics-based simulation, 3D understanding, and action. She walks through the evolution of NVIDIA's world models—from GameGAN (Pac-Man) and DriveGAN to Video LDM in 2023—culminating in Cosmos, NVIDIA's World Foundation platform trained on real-world videos specifically targeting physics for physical AI rather than creative video. She emphasizes that Cosmos is intended to work alongside Omniverse and traditional physics solvers rather than replace them.

At SIGGRAPH, NVIDIA announced 3D GRUT, adding retracing capabilities to 3D Gaussian Splatting and integrating it with Omniverse and Isaac so users can scan a room with a phone and immediately use it to train robots. Fidler identifies visual language models (VLMs) as the key reasoning breakthrough enabling robots to handle the long tail of novel scenarios, and argues the winning formula will combine traditional physics solvers, world models like Cosmos, and VLM-based reasoning. She closes with a call to the graphics community to help solve open physical-AI problems and describes the breakthrough she's waiting for: useful home robots in her lifetime.

Fidler's Spatial Intelligence Lab in Toronto was founded in 2018 and recently rebranded to focus on 3D intelligence for physical AI.
Physical AI framed as the next major industry wave, likely larger than generative/agentic AI, with the thesis that all moving devices will become autonomous.
Four-pillar charter for spatial intelligence: virtual world creation/modeling, physics simulation, 3D understanding, and action/control.
Evolution of NVIDIA world models: GameGAN (Pac-Man) → DriveGAN (driving) → Video LDM (2023, latent diffusion for video) → Cosmos (World Foundation Model targeting physics from real-world video).
Cosmos is trained on unedited real-world recordings and specifically doubles down on physics rather than creative video, with physically simulated graphics used mainly for benchmarks.
At SIGGRAPH, NVIDIA announced 3D GRUT—retracing-enabled 3D Gaussian Splatting integrated into Omniverse and Isaac, enabling phone-based room scans for immediate robot training.
VLMs (visual language models) identified as the critical breakthrough letting robots reason about novel, long-tail physical-world scenarios.
Fidler argues physical-AI success will require combining traditional physics solvers, world models like Cosmos, and VLM-based reasoning rather than any single approach.

Source material

Transcript

[Music] Hello, and welcome to the NVIDIA AI Podcast.

I'm your host, Noah Kravitz.

This past August, three of NVIDIA's research leaders gave a special address at SIGGRAPH, the annual International Computer Graphics and Interactive Techniques Conference that's been running since 1974.

One of those people is here with us today.

Sanja Fidler is VP of AI Research at NVIDIA, where she leads the NVIDIA Spatial Intelligence Lab in Toronto, Ontario, Canada.

Sanja is here to tell us about the lab, to talk about the research she's most excited about right now, including what was presented at SIGGRAPH, and to share a little bit about her own journey through the worlds of research and artificial intelligence.

So without further ado, let's get to it.

Sanja Fidler, welcome, and thanks for joining the AI Podcast.

Hi Noah, and hi audience.

I'm very excited to be on this AI Podcast.

We are very excited to have you.

Thanks for taking the time.

There's a lot going on, obviously, and congratulations on the special address and everything else at SIGGRAPH.

So we wanted to start with a little bit about your own journey.

You followed your passion for computer vision and artificial intelligence across Europe, and into North America.

Can you tell us a little bit about what first got you interested in the field, and how your journey took you to Toronto?

So maybe I'll start with my youth, and there were actually three important break points that led me to where I am.

And the first one actually starts with my dad.

So my dad would sit on a chair next to my sister and me and tell us bad time stories.

And surprisingly, she was very good at it.

She was a scientist, and so he would tell us stories about scientists.

For example, he would tell us about Nikola Tesla, who was born in Croatia, and my mom was also born in Croatia.

I was born in Slovenia.

And was this in Slovenia?

Where did you grow up?

I grew up in Slovenia, so my dad was born in Slovenia.

My dad was born in Croatia, my dad was born in Slovenia, and I was born in Slovenia.

So he would tell us stories about how at a young age, Nikola jumped from the roof of their house holding an open umbrella thinking he would fly.

And every night there would be a new episode about his inventions, about the creation of radio, alternating current.

Obviously, we didn't understand what he made it sound like.

And the competition with Thomas Edison, it would be almost like a Netflix series.

And for a challenge, it was very exciting.

So I just did not wait to hear more until the next day.

So kind of like my childhood, what heroes were not the movie stars or music stars, they were scientists.

That really kind of shaped me.

So perhaps not surprising, you know, one day I appear in front of my parents and proclaim I want to be an inventor.

And there was even a photo of me and my sister, actually my sister dressed as a robot.

Quite a bit of money for her to put some cardboard boxes around her.

And maybe not surprising, she became an economist.

So this, you know, moment pretty much set on my profession, I was going to enter and this was very young age.

The second moment was really thanks to my mom.

And this was in primary school, I was very young.

And at some point, I got pretty ill.

It was something like COVID almost, I think it was called whooping cough or something.

So I think it was home fever, coughing two to three months.

And I basically missed a lot of school.

I missed like, you know, fractions.

Big chapter on math.

So I come back to school.

And of course, I didn't understand anything they were talking about.

And I developed some sort of resistance in going, going to school.

And before I met this, I threw a tent from like crying, hysterically on the floor, hate math, I don't want to go back to school.

And my mom is actually a teacher.

And of course, you know, having school hating child that was not an option.

Yeah, it happens.

So even though she was an English teacher, she would sit down with me and work with me on the math.

And she made it very interesting.

So she had this really nice way of teaching me through giving me puzzles, math puzzles.

And I began to like really love it, you know, as kids understand things, they also love it.

And I think to this day, what drives me at the core is solving problems.

And I think that's still kind of stuck with me.

It's pretty much subtle, you know, what I was studying, I was determined age, I don't know, 12, 13, I was going to study math.

And the third moment was really, you know, kind of thanks to my grandma.

So I was already doing my PhD.

And I decided to work on computer vision.

I saw this one talk, it started actually with math, even my PhD.

And I saw this talk on someone recognizing cats and dogs.

And I was very early AI at that point.

And he just kind of spoke to me, you know, I was always kind of dreaming of robots and computer vision felt like the first step to do.

So, you know, I was there doing my PhD.

And my grandma, you know, she was a very smart woman.

She was actually one of the first female plastic surgeons in Yugoslavia.

Oh, wow.

Yeah, she was always telling me these stories, you know, how she graduated medical, and the day they were graduating, and they were out having fun, and, you know, sirens came out, World War Two started, and, you know, she tried to basically just go to the operating room, and that was for years.

And, you know, like, basically fear became alien to her.

And wasn't for me, really, right?

So I studied my PhD in Slovenia, really, for the fear of living, living right to the wide open world alone, as a woman, that was somehow not encouraged, my mom would scare the hell out of me.

So towards the end of my PhD, and I was kind of working on this AI, like something similar to Deep Networks, just my own take on it.

I was presenting at a conference and a famous professor at UC Berkeley, stopped by the poster, really likes it and invites me to visit his group at Berkeley.

And, you know, I was beyond excited, but I still carried this kind of weight of fear and ambition.

And I talked to my grandma and she said, you know, Sonia, don't listen to your mom, just go.

Actually, she passed away a few months later, that was January 13, 2009.

And the next thing I remember, I'm sitting on a plane, I look at my plane tickets to California.

And it was January 13, 2010.

It was exactly one year later.

Exactly a year.

Exactly.

I'm not kidding.

Like, this was exactly, it was only- It was meant to be.

Meant to be.

I was both scared and excited, but, you know, two of my life was- Right.

Was that your first time traveling abroad?

No, I would go before, you know, visit New York with my family.

Okay.

This was the first time I went alone and leaving.

Very different.

You know, after my bags and here it was, you know.

Yeah.

It was scary, but- And landed in Berkeley of all places.

I spent there a few months, me seven, eight months, and came back, graduated, and then I did my postdoc and I was at U of T.

So that's kind of what brought me to Toronto.

Right.

Amazing.

I feel like the graphics and interactive industry owes a big thank you to many members of your family for all the inspiring you and then Grandma kind of giving you that nudge and everything.

That's amazing.

Why Toronto?

What was the link that brought you to Toronto?

Yeah.

Actually, the U of T University of Toronto was doing really great stuff in deep learning.

And like I said before, that was kind of my PhD.

I was really inspired by doing this hierarchy representation to recognize objects.

I was reading all this, like, in a neuroscience paper that basically said this is how the brain works, right?

Yeah.

And I was just like, when I was isolated, I kind of had my own take of how that would look like.

And then I was reading this deep learning paper.

It was, papers was really appealing to me and was kind of, you know, like going between Berkeley and U of T.

And I decided to go to U of T to kind of like, learn, learn from, you know, Jeff Hinton and people like that.

And that's why I landed here.

Yeah.

Amazing choice.

And so you've been in Toronto since?

Yeah.

I mean, I was there for a postdoc and then I got like a research assistant professorship in Chicago.

So I did a small stop, you know, there for a year and a half.

Then I was positioned faculty position open at U of T and then came back 2014.

Amazing.

And so now you head up the NVIDIA Spatial Intelligence Lab in Toronto.

That's right.

Yeah.

I joined NVIDIA.

That was 2018.

So about seven years ago?

Seven years ago.

I actually met Jensen at a computer vision conference.

That was 2017.

And we had a really great chat about simulation.

I was already working on simulation for robotics at the time.

I was telling him about it.

And I think he was also thinking about it.

So it was great conversation.

And then later he gave me a call or we went on a call and he said, you know, come work with me.

And I had other options, but the fact that he said, come work with me and not for me, just tell me everything about joining.

And that was it.

What a great story.

That's fantastic.

So tell us about the lab.

What is for those listening who might not fully get the term, what does spatial intelligence mean?

And what's the charter of your team?

What are you doing at the lab?

And how you may have just said, sorry, I was imagining Jensen, you know, that whole conversation.

So but when was the lab founded?

2018, maybe that was basically studied with me.

And then we slowly grew and also increased scope.

So we recently renamed ourselves to spatial intelligence.

I would say it's a new, encompassing word.

So spatial intelligence essentially denotes intelligence in 3D, right?

Intelligence in the 3D world.

So the same as we have LLMs representing intelligence in language, you have this all this family of visual language models for intelligence into the images.

Now we need to build the same capabilities, but in 3D.

And the question is, of course, you know what that is and why maybe I'll motivate with robots, because really, you know, that's one of the prime motivations.

So at the end of the day, like robots need to operate in the physical world, in our world.

And this world is three dimensional and conforms to the laws of physics.

And there's humans inside, right, that we need to interact with.

You know, we typically hear the term such AI that operates in a real physical world as physical AI.

So I'll maybe use that term quite a lot, right?

Yeah, physical AI is really kind of the upcoming big industry, very likely larger than generative and agentic AI, you know, you know, Jensen typically says everything that moves all devices that move will be autonomous.

So that's kind of the vision.

So a robot to operate in the real world, obviously needs to understand the world.

What am I seeing?

What is everything I'm seeing doing?

How is it going to react to my, my action, right?

So understanding, it needs to act, you know, if I want to drive you from A to B, make you dinner, you know, I need to actually like control that robot to make an action.

But then there are two other capabilities needed that are perhaps a bit less obvious.

So basically, it's 3D virtual world creation and modeling and simulation.

And the reason is that robots need to have like a virtual playground that almost perfectly or like we would like it to mimic the real world as faithfully as possible, where basically they can train their skills and also test their skills before we're going to deploy them in the real world.

Like this is basically the critical thing we need to solve for deployment of robots.

Basically, spatial intelligence kind of comprises this kind of four core capabilities, which is modeling, so creation of virtual world, but then also, you know, modeling it how it evolves in time based on our action, understanding and action in 3D world.

And other applications are more than robots, you know, architecture, construction, gaming, everyone that kind of has 3D data, 3D world data.

We first started with this virtual world creation, so content creation.

And then we as next go because in order to develop the spatial intelligence, you also need physics, which evolves in time and understand.

A year or so ago, maybe last, there's so much has happened with generative and particular in the past few years, that it's kind of blurs together sometimes when I talk about it.

But I remember when video models started coming out, the first, you know, Sora from OpenAI and some of the other ones, and discussion around, well, these video models are actually also physics simulations.

You know, we're discovering we thought we were making a video model, but now we're realizing that, you know, there are properties of physics happening inside of the videos that are output and and all of these things.

What makes a good physics model?

And when you're talking about modeling things that are going to happen in the future, I've also heard that described as, well, what an AI model does is really predicting what's going to happen in the future, right?

And if it's a video that's outputted sort of frame by frame, how do you think about the four things you just described relating to one another?

And I don't know, maybe you can talk a little bit about the physical AI in particular and how the evolution of how these models came to be, you know, so accurate that we can now use them in simulations.

Yeah.

So, in video cosmos and the models you're describing, like Sora, V of 3, and so on, learn their capabilities from videos.

And especially in video cosmos is kind of targeting physical AI, which really means that it's doubling down on modeling physics, not necessarily the creative aspects, but physics, capturing how our world works.

So it's forming this world simulation capabilities by learning purely with videos, and we specifically target collecting videos that are real world recordings.

You know, there's no human editing involved.

And if there's any graphics data, it's actually all physically simulated.

Okay.

How we're using physics is mainly for benchmarks, actually.

So you want to create, because you have full control, right?

I can have two bouncing balls, three bouncing balls with this material, you know, and more complex world.

And there you can really go like, you know, every single test, how good are you at that?

How good are you at that?

And that's our test, and you kind of hill climb that performance.

Right.

Yeah.

It's an evolution of models, right?

So the first world model came out, I think it was Jurgen Schmidthuber, right?

2019.

It was almost parallel to us.

Our scheme like a few months later, where the idea was really kind of like AI replaces the game engine kind of, you know, it's crazy, you have the user interaction next frame is not human written code.

It's the air.

It's generated.

Yeah.

Obviously, that was early on, it was, I forgot exactly what they were using.

We were using GAN, ours were called game GAN.

So we train it on Pac-Man, you know, so you could actually play Pac-Man on a keyboard, like the screeners for AI.

We had an episode of the podcast with somebody who created GAN Theft Auto.

So like Grand Theft Auto, but being generated.

Oh, that was yours.

Okay.

That was our stuff.

Cool.

That's cool.

Yeah, yeah, great.

I forgive me, I don't remember offhand who the guest was, but yep, that was so cool.

We used the code, so, you know, people just got crazy, and it was amazing to see what, you know, where it went.

Yeah, we actually also applied it to driving.

That was 2021, it was called DriveGAN, you know, some technology, but just a lot of autonomous driving videos, and it almost kind of became a driving simulator, you know.

Cosmos really took the new heights, but at the time was kind of like imagining how this could be useful for physical applications.

Also, that was all kind of GAN based with all kind of known limitations.

And, you know, in the meantime, diffusion models came out, and it was clear that, you know, like that's also the next big leap in video modeling.

And actually, in 2023, we kind of partnered up with some of the students that did the latent diffusion that really was kind of a big breakthrough in images, because you didn't model pixels anymore, but this kind of latent codes made it significantly more efficient.

So we kind of applied that and extended that to video, and that led to video LDM, which really became, you know, you could see the future by looking at those results.

Obviously, it was not SORAI yet, or, you know, Cosmos, but like, we were onto something, right?

And then, you know, the industry actually kind of switched to this latent diffusion architecture.

And then, you know, then it's about scaling, and obviously, the architecture changed a little bit behind the scenes, and data and so on.

And that basically is creating the modern age models.

So I understand that your lab has grown recently.

Can you talk a little bit about the new areas that the lab's now encompassing, and how that kind of furthers the overall goals, the overall charter of the lab?

Yeah, yeah.

So when I joined, we joined Rev's organization, and Rev was building Omniverse.

Omniverse is this, you know, like state-of-the-art simulation platform where robots can be robots, as John says it.

Right.

And talking to Rev at the time, he mentioned, you know, it was a huge team working on it.

Obviously, they were able to render really fast, you know, they had this real-time ray tracing and so on.

So really kind of the key missing piece was content.

And in mind, this was like 2018, right?

I was like, baby rhymes for that.

And that's how we started.

I said, okay, like, how can we actually make this platform workable, especially for physical AI, where it's really about modeling the world, which is messy, diverse, you know, like, it's really like challenging.

So we started with content and, yeah, we developed, you know, a bunch of techniques for that.

And through, you know, through kind of the period of our lab, we became more and more ambitious.

And, you know, we realized that the pipeline for physical AI, or this 3D spatial intelligence also needs to change because you need to have, you know, better physics algorithms, physical algorithms interact with each other, plastic, whether there's water inside, I can put it on fire.

And, you know, there is no cheating, like, in a game where I can kind of stage it, like, it needs to be all similar.

It's real.

It needs to feel real, right?

I can put my finger on it.

Bad things happen, right?

Like, the robot, if it's training, then it needs to kind of experience it in this way.

So, you know, I was clear, kind of, that we need the next evolution of physics and I can join the team.

And also, you know, perception is obviously important.

And now I joined the team and she was, she's very interested in 3D perception, but going towards open world, meaning, you know, like, anything, anything in this room, I should be able to recognize it and understand my affordances with it.

And then, you know, that can lead to a better action.

So, we expanded the team basically, like, by building blocks that we actually need.

Right.

You know, building the full stack for spatial intelligence.

And you mentioned Omniverse.

Your lab has been very involved with the creation of Omniverse.

What are some of the innovations, some of the research breakthroughs you mentioned, you know, physics models improving.

What are some of the other innovations that really made Omniverse possible and helped it grow into what it is today?

Yeah, I think, you know, first of all, Omniverse is created by many teams at NVIDIA.

True.

Right.

Much, much, much larger than any single team.

Really, kind of, the vision of Jensen and Rev.

It has a mountain of technology, you know, for real-time ray tracing power by this DLSS that makes, you know, AI in the loop, AI powered physics, you know, solvers, like I was saying.

So, that's just scratching the surface.

And I really can't take credit for any of that.

So, I can maybe tell you a little bit about where, what we were thinking when we started with our 3D content creation work.

Okay.

And I would really say that we doubled down on two directions.

We both turned out to be very important in the end.

And it's really kind of this perseverance through time that created something of value.

So, the first one was, okay, you know, clearly there's a graphics pipeline.

We know everything and how that works.

So, why don't we lift images and videos to 3D to be fully compatible with existing graphics pipelines?

And we really doubled down on differentiable rendering as this foundational technology.

Meaning, you know, graphics goes from 3D and renders to images.

If this is differentiable, meaning kind of like amenable to AI.

So, this path led to, you know, one of the first image to 3D models that we'll call downwards.

One of the first generative models of 3D assets, GA3D.

And as the latest achievement, we also made foundational improvements for 3D Gaussian Splats.

I don't know whether I need to explain that in further detail, but essentially it's a really, you know, like a new neural graphics primitive that you can easily optimize from videos.

And we added retracing capabilities to it.

And at SIGGRAPH, we actually announced integration of, we call it 3D GRUT, 3D Groot Omniverse.

So, basically, now you can download Omniverse or Isaac, which basically helps you train robots.

You can scan, you know, with your phone or webinar, this environment, and boom, you have it in Isaac and you can start, you know, training robots just here.

Like, there is no, you know, you don't take weeks for it.

It's amazing.

It all makes sense in terms of, you know, looking at the way you're describing the way things have built up and the building blocks and adding features.

And, oh, cool, that makes sense.

And then I sort of listen, you describe, like, oh, take your phone, wave it around the room, and now the robot can train in the room.

And it's still, it's just so exciting.

It's so mind-blowing.

It's very cool.

Yeah, yeah, it's exciting.

But that's basically what you want, right?

Like, scale.

I want to just go and take what's up here, and then same, right?

And boom, the robot is training.

So, the second one, the second path is, we kind of saw some fundamental limitations of this graphics pipeline, because, you know, you need to also model agents and physics, like, it's all kind of, you know, so felt daunting.

So, we also made this bold approach of AI that is basically the world model, right?

That does the whole content creation, world simulation based on user interaction, and all it's one.

And that was the chain of models that I described earlier, right?

So, like, two different things that all now kind of, like, came together in, like, really, I think, useful capabilities.

Yeah.

So, how has the advent of AI and 3D content creation, and sort of specifically in workflows, changed the way that people get the work done, the way that researchers or designers can create objects and create scenes and kind of manipulate things?

What's the impact of AI been so far on these workflows?

Yeah, I think, like, this technology really democratizes access to these tools.

And basically, it gives everyone the chance to become a creator.

You know, I have no idea how to use 3D software.

I mean, try it a few times, but now I could be reasonable, you know?

I could be reasonable if I wanted to, you know, actually use robotics.

I can reasonably get, you know, this object in a simulated world.

Yeah.

The cool thing is that it also gives additional superpowers, you write, to creators that have the talent, you know, so, artists, designers, they can actually use this technology to now do many more creative things.

I have seen so much amazing stuff coming out that I wouldn't even think of.

I think it's really kind of in powering to the entire population in different ways, which is great to see.

We had Danny Wu from Canva on recently.

He's the head of AI products there, if I got that right.

And he was describing a similar thing, but kind of more on a level I could relate to because I write, I talk, I mostly work with words.

I can't draw or paint to save my life, you know?

And so that ability, having that superpower, if I want to see how something might look, an idea, it lets me do that now, right?

And so I can only imagine it, the 3D physical world with, you know, 3D design, talking about simulations, the stuff you've seen must be pretty cool.

Yeah, I think so.

So talking about this 3D world and physical AI, and you spoke to it a little bit earlier, but how are all of these advances with the technology and computer vision included enabling robotics, autonomous vehicles?

You talked about it a little bit, but maybe you can kind of put a point on, you know, how physical AI has really started to take off.

Yeah, if it's anything that people take away from this talk, is that physical AI can scale through real world trial and error.

Like, it's simply not possible to put my car out there or a robot out there, and he's going to mess up my kitchen here by bumping in everything and so on, right?

This is super expensive, unsafe, and it's just going to take us forever to get there, right?

So simulation is really the answer here, right?

And if we do it right, if we are actually able to use computer vision and other techniques to basically go somehow create, you know, these virtual worlds that feel real, then it's possible to train this kind of parallel virtual universe and safely, essentially basically, you know, accelerating time before we can deploy robots and also bring in the overall cause down, right?

Because now we are doing it in the cloud as opposed to having this very...

To remodel your kitchen after every test, yeah.

So what are some of the methods that are key to making simulations physically accurate or more physically accurate as we go?

Yeah, I think like the jury is still out how exactly to achieve, you know, physically accurate, something I can completely trust simulation at the scale, diversity, and realism of the real world.

It's hard in traditional way with this like different fields, excellers, you know, that's hard.

For the world models, it's also kind of hard.

There's still hallucinations and, you know, sometimes objects disappear, go one in another, and obviously that's going to keep improving.

So the likely success is going to come in some sort of a combination of both.

And obviously, we're going to keep pushing on each direction as good as possible.

And maybe in between, until we reach a point, there is a combination of that, right?

Using these world models with this more traditional approach that really make sure that physics and simulation is correct.

Yeah, the other very important message is that in computer vision and robotics, the really big breakthrough is a VLM, so visual language model, that is able to reason.

Okay.

This is basically, you know, how humans navigate the long tail of very diverse and rare scenarios, so the physical world.

So we're kind of bringing that knowledge from language into, you know, the physical world.

We're encountering completely new situations that we have never seen before in training, and now the VLM could come to our rescue, basically like really solve this long tail.

And that is really kind of the discontinuity from before, right?

That is a tool that we have now that before was missing.

Right.

So that's probably the most bold statement I can make right now.

Fair enough.

Do the traditional methods get baked into these models, or how do you go about combining them, the two approaches?

Yeah.

What you could do, for example, is you can use kind of the traditional way, which is also not full of AI everywhere, right?

To kind of have a course simulation in 3D with solvers that we know how to model certain effects, right?

So you can make that simulation.

You can render it out, and that becomes a guidance to our world model.

You know, I think that is important.

Tony is telling me, "Ah, I show you roughly here and here," and then it becomes much more feasible to create 3D pixels out of that, both in time and space.

So that's kind of like what we're thinking right now.

Yeah.

Cool.

I'm speaking with Sanya Fidler.

Sanya is a Vice President of AI Research at NVIDIA, and we're talking about the work that her Spatial Intelligence Lab in Toronto has been doing, along with the evolution of AI and models and solvers in the mix and all the things that go into making these models more accurate so we can rely on them and trust them, as Sanya, as you were saying.

We've also talked a little bit about SIGGRAPH, and I mentioned at the top that you gave the keynote a special research address alongside some other invidians at this year's SIGGRAPH.

What are some of the notable things from NVIDIA's presence at the show that maybe we can impart to listeners here?

What are some of the things that they should take away from what NVIDIA did at SIGGRAPH?

Yeah.

This year at SIGGRAPH, we really tried to send a message on physical AI in the keynote.

The reason is because this is a really important area with big impact, and the SIGGRAPH community has a lot to give.

A lot to give.

A lot of expertise is already there.

The Gaussian is flat, nerves, that all comes out of that.

We discussed simulation is key.

I literally mean simulation is key.

Everyone in the audience should feel empowered to help us in this quest of robotics.

The cool thing is that it feels also early stage.

Like I said before, it's open-ended.

We don't know yet.

We are hypothesizing.

We really hope the audience connects with, "Here's a new challenge for you."

What do I do next?

Graphics is very mature, but here is a new challenge for you that maybe needs to think outside the box.

I think the key to success, we suspect, will be the combination of Cosmos.

This is NVIDIA's World Foundation platform.

This is both video generation, so simulation of the world, as well as reasoning.

Reasoning about the laws of physics is reasoning about all the agents in the scene, the scenarios, and so on, and physics simulation.

All these three pieces interacting together, that's our bet of creating really physically, but also semantically accurate simulations of the real world in the future.

I think that's really what I hope the SIGGRAPH audience takes away from the keynote.

That spirit of...

It even goes back to what you said about when you met Jensen, and he invited you to come work with him.

That spirit of collaboration, open source being what it is, conferences obviously, but we've had guests across all different industries come on the show and talk about how important it is to share research and trade notes with other people who, on the other side of the world, working in other industries, etc.

As AI continues to touch and evolve and change virtually every industry you can think of, how important is that, or what are you getting from this experience of working across so many industries?

Does it feel like AI is bringing industries together, or does it feel like different industries are hunkering down and siloing in their own approach to how they use these emerging technologies?

Yeah, it's definitely bringing them together, right?

Because the workflows essentially are very similar.

It's very similar, and the difference is the data and the domain expertise that's different.

Actually, there is even sharing how I do autonomous driving versus humanized robots versus factory simulation architecture.

There are some commonalities between things that could be shared, and having this data-driven approach to simulation could really bring industries together and benefit from one another and build tech that can essentially make all of these industries better.

Open source, you mentioned open source.

I am a believer in open source, and it's great to see Nvidia is also a big believer in open source.

Like I said, in a lot of areas, we're also still early on, and that's the only way to keep progress going and really build up these capabilities.

Absolutely.

Even though in a lot of ways the time frame that we've been talking about the last seven, eight years in particular isn't all that long in AI terms, and particularly in this recent generative AI revolution, it's a long time.

So you've been doing this almost since the beginning of early object recognition and AI, now through to everything we're talking about today.

What's next on the horizon?

Is there a breakthrough that you're either waiting for, or maybe more secretly thinking, "I think this is going to happen soon."

Whether it's something that's particular to your own work that you're doing or more broadly, what's the next big breakthrough in AI that you're looking forward to or maybe just hoping to see?

Well, I think it's going to be robots.

I started the story with my sister in a cart.

Your sister, of course.

Me dreaming about a robot taking the dog out in the morning and gathering a switch.

My parents made me do, and I really like to sleep in the morning.

That was kind of the early dream.

My grandma lived 30 years alone.

My grandpa died quite early.

So the first talks I gave as a faculty were all started with a grandma, and a wally cute-like robot in a kitchen talking to each other and the robot helping.

She's a common thread of, "Let's build this technology because it can be really powerful and useful."

I now believe after many years in this field that we're likely going to see that in our lifetime.

Robots in some four autonomous cars are already out there to some extent, and more is coming.

So that's the breakthrough I'm looking forward to.

Have you ever had a robot in your home with you for a period of time?

You ever lived with a robot?

Have you ever had a robot that kind of wipes the floor?

You have something that does more than that?

So, Sonia, as we look to wrap up here, and this has been fantastic, again, thank you for taking the time.

What advice would you give to researchers out there who are interested in the work that the Spatial Intelligence Lab is doing, might be interested in collaborating, working with you in some way, joining the lab, collaborating from afar?

And then in particular, what are some of the skills and research areas that you think are becoming increasingly important now and will continue to be, at least for the next few years?

Yeah, actually, the bar that we have is both low and high, so I'll explain what I mean.

I think what we are looking for is people with immense passion.

I feel like I still haven't lost the passion of the first day.

I wake up and I am excited.

So I think the passion is what drives us forward.

This is only a podcast, but the passion comes through in your voice.

I'd say you're doing all right.

Yeah, I mean, that's what drives it because, you know, as a researcher, life is not easy.

Most of the time, things don't work.

It could be six months, a year, where you don't get that result.

So you're really that kind of passion and energy that makes you keep going.

I think wanting and having the ability to go technically deep is very important.

You know, not jump from one thing to another.

Things will hurt, but let's learn really the fundamentals to the level we need so we can innovate.

And I guess maybe to my first point, the high level of perseverance, right?

We want to keep going and there is no wall thick enough, right?

The rest, I think we can teach people.

Like if you have these basic things, a lot of the other stuff comes along.

Yeah.

So in terms of like, you know, it's mostly also interest, right?

So we are very interested in this 3D world modeling and understanding 3D worlds.

So people that are interested in shared passion for the same topic, you know, please contact us.

Fantastic.

For listeners who want to learn more about the lab, about the work we've been talking about, where are the best places to go online?

Is there a homepage for the lab?

Is it on the NVIDIA site?

Any social media handles to follow?

Where would you send them?

It's all on NVIDIA website.

Probably if you search for a special intelligence lab in NVIDIA or Toronto GI in NVIDIA, which is our own name, it should pop out.

Yeah.

Fabulous.

Sonia, again, this has been great.

I think that the story, just imagining, you know, your dad telling you those stories as a kid and with your sister, and then the advice your grandma gave you.

Overcoming your fear, get out there.

Just absolutely fantastic.

Congratulations to you and all the team, everyone you work with for all the work you've been doing and SIGGRAPH, of course.

And we really look forward to following your progress in the future.

Best of luck.

Yeah, thanks.

It was really fun talking to you.

[Music]