The a16z Show · 2025-08-16

Google DeepMind on Genie 3: Real-Time Interactive World Models

Hosts: Anjney Midha, Marco Mascorro, Justine Moore

Guests: Shlomi Fruchter, Jack Parker-Holder

world modelsGenie 3real-time video generationinteractive AIroboticsembodied AIsim-to-realfoundation modelsVeo comparison

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

Genie 3 creates real-time interactive worlds from text prompts with minute-long memory

Key claims

Genie 3 generates interactive, persistent worlds in real time from text prompts at minute-plus length
Special memory was a deliberate design goal but still surprised researchers when it worked—targeted conflicting objectives of real-time, high resolution, and long memory simultaneously
The model generates frame-by-frame without explicit 3D representations (no NeRF or Gaussian splatting), which the team credits with enabling generalization
Text-based prompting in Genie 3 replaced Genie 2's image-prompting approach, dramatically improving controllability and text adherence

Episode summary

Summary

Google DeepMind researchers Shlomi Fruchter and Jack Parker-Holder join a16z partners Anjney Midha, Marco Mascorro, and Justine Moore to discuss Genie 3, a foundation model that generates fully interactive, persistent worlds in real time from text prompts. Unlike Genie 2, which relied on image prompting, Genie 3 takes text directly as input and frames each frame individually without an explicit 3D representation such as NeRF or Gaussian splatting, an approach the team says is critical for generalization.

The standout capability discussed is Genie 3's "special memory": minute-plus world persistence where objects a user has looked away from reappear correctly. Fruchter and Parker-Holder describe this as both a deliberate design goal and a surprising result when it worked, noting that conflicting objectives—real-time generation, high resolution, and long memory—were set as a deliberate technical challenge. They highlight emergent physical behaviors that emerged with scale, including realistic water, lighting, and terrain-aware interactions (characters slowing on uphill ski slopes, swimming in water, reacting appropriately to puddles).

The team positions Genie 3 as an environment model rather than an agent, designed to be composable with other agents (Demis Hassabis publicly referenced combining it with DeepMind's SIMA agent). They see major applications in robotics training—potentially closing the sim-to-real gap—embodied AI for AGI, entertainment, and education. Genie 3 is currently a research preview with no public timeline for developer access, and the team emphasizes they don't yet know which downstream applications will matter most. They also distinguish Genie 3 from Veo 3, noting different priorities (Veo 3 has audio and higher cinema-quality threshold; Genie 3 is interactive and lacks audio).

Genie 3 generates interactive, persistent worlds in real time from text prompts at minute-plus length
Special memory was a deliberate design goal but still surprised researchers when it worked—targeted conflicting objectives of real-time, high resolution, and long memory simultaneously
The model generates frame-by-frame without explicit 3D representations (no NeRF or Gaussian splatting), which the team credits with enabling generalization
Text-based prompting in Genie 3 replaced Genie 2's image-prompting approach, dramatically improving controllability and text adherence
Emergent physical behaviors with scale include realistic water, lighting, and terrain-aware agent interactions (skiing speed, swimming in water, puddle reactions)
Genie 3 is designed as an environment, not an agent, and is composable with other agents like DeepMind's SIMA for robotics and embodied AI applications
Team positions this as a step toward AGI via embodied agents; sees potential to bridge sim-to-real gap in robotics by combining data-driven realism with simulation flexibility
Currently a research preview with no public release timeline; explicitly distinct from Veo 3 (which has audio and cinema-quality focus)

Source material

Transcript

All of the applications basically stem from the ability to generate a world with that, just from a few words.

You look at it and you're like, "There's a world that's generated in front of your eyes and it's amazing that it's happening."

I was very excited about how far can we push that.

And it's at the point where, like, a human who is not an expert will watch it and think it looks real, right?

And I think that's pretty incredible.

Genie 3 from Google DeepMind can create fully interactive, persistent worlds in real time from just a few words.

Today, we're joined by the team behind it “Schlamme Früchter and Jack Parker Holder from Google DeepMind plus Anjane Midha, Marco Mascoro, and Justine Moore from A16Z‬.

We'll talk about how it works, the special memory that keeps worlds consistent, the surprising behaviors it's learned, and where world models are headed next.

Let's get into it.

Jack Schlamme, Genie 3 has taken over the internet.

We're honored to have you on the podcast today.

Has the response surprised you?

Reflect a little bit about the reaction.

We weren't sure how big it's going to be, but today it felt definitely that we have something that was for a long time coming, basically being able to generate environments in real time.

I think a lot of work that was done in Google DeepMind and outside pointed to that direction, but we really wanted to make it happen and I hope we have.

Team, why don't we reflect internally a little bit about what we found so game-changing about Genie 3 and why we're so excited to have this conversation?

Yeah, for sure.

I mean, first of all, it's an amazing model.

I think there's a lot of excitement around the special memory, the consistency across all the frames.

I think this is the first time I can see you can have some sort of interactive way of doing this stuff with videos because it used to be like you would do one problem and you would have 15 seconds of a video, but now you can actually have some sort of interactive kind of element to it, which I think is very exciting.

So can you elaborate a little bit more on your insights on this?

Like how was like, for example, figuring out what data you should collect, how you make it very interactive and keeping the flow of the whole video, which I thought was phenomenal.

Sure.

Yeah.

So I think you kind of highlighted a few capabilities, sort of the length of the generation, the consistency of the world, maybe diversity as well of the kind of things you can generate.

I think the main thing is that obviously we made progress in quite a few different fronts in separate efforts.

So we had this G2 project that was much more sort of like 3D environments that it could generate and it wasn't super high quality.

It felt like it coming from G1, but it wasn't the same quality as things like VO2, which the state of the art video model at the time came out in December, roughly exactly the same time it came out a week later.

And G2 and obviously internally, there was a lot of discussion between the two projects about the different directions we were pursuing.

And then Shlomi had also worked on Game and Gen, which is the Doom paper as people know it, which I think you guys also wrote a nice piece straight after that came out.

So I think that also attracted a lot of attention.

And so that we felt that across these different projects, we had quite a lot of interesting things that would naturally kind of combine.

And we could basically take the most ambitious version of the combined project and see if it was possible.

And fortunately it was.

And quite I think the timeline is probably the bit that surprised many of us, because obviously we set ourselves these goals and we tried very hard to achieve them.

But you never be totally sure how it's going to actually feel when you've got to that point.

I think it ended up being something that resonated with people a lot more than maybe we expected, but we always believe us.

Yeah, I'll just add to this that I think there is time.

So component is really important.

Not many people experience it firsthand, but really try it in the release to at least have a few trusted testers interact with it and also get the feel of it by adding these overlays that show what happens, how people can use the keyboard to control it.

And I think there is something magical about the real time aspect.

I felt it for the first time when our model, like Game Engine model started working fast enough and we were just like, oh my God, it's actually I can actually walk around.

And it was a bit of a wow moment.

And yeah, I think there is something when it responds immediately that is really magical.

I think that's kind of sparked the imagination of many people when the Doom kind of simulation came out and here we really wanted to push it to somewhere we weren't sure it's going to work.

So it was definitely the edge of what's possible.

I think that's how we felt.

So we just said, yeah, let's try and see if we can make it happen.

I think you guys, I don't know if this was on purpose or not, but you perfectly timed it when everyone on X and Reddit and everywhere was making those videos of like characters walking through games, but they obviously weren't interactive.

They weren't real time.

And then you guys came out with this release that was like, now this is an actual product and it blew folks away.

I'm curious because you can imagine so many different applications for this, right?

Like more controllable video generation or making it much easier to create games, even personal gaming where someone's just kind of creating their own world.

They walk through like RL environments for agents, robotics.

Are there any particular use cases that you're most excited about?

I think all of the applications basically stem from the ability to generate a world just from a few words.

And I think for me, kind of like this potential when I started looking at video models, I think it was pretty early when I think was one of the models were like imagined video, which was modeled by Google research.

But there are a lot of models that they were very basic compared to what we have today, but the ability to simulate something like you look at it and like there's a world that's generated in front of your eyes.

And it's amazing that it's happening.

And I think at this point, I was very excited about how far can we push that, right?

So I think there was one way to do it.

And Gini is definitely another way to make it a bit more interactive.

So I think all of the applications basically stem from this core capability.

So it can be entertainment, of course, as you said, it can be training agents, it can be helping agents to reason about the world, education.

So I don't think any particular application is more important or than others.

I think it's really up to how developers in the future will be on top of that.

Yeah, I would give basically the same answer in the end of a different journey to get there.

Which is I personally myself worked in reinforcement learning for a few years before starting the GE project in 2022.

And the motivation originally was like that in RL at the time, we had this problem where we'd say which environment should we try and solve, right?

Because once you've already done go, which people thought was years or decades away, and then that was solved in 2016 was solved, but we reached super human 11 2016.

And then Starcraft three years later, which is not particularly long time for something incrementally significant.

So it's 2021 time, it was a question of what should we try and do with RL, we know that the algorithms can learn superhuman capabilities if they have the right environment, but we don't know what the environment should be.

And so we are working on designing our own ones like with colors.

And then instead, it seemed like the more promising path when you had the first text image models coming out, whereas like what if we just think long term, what's the way to really unlock unlimited environments.

That being said, over the course of the project, and originally we started it, I guess in 2022, it was very focused on that one application, but it seems quite clear now that this could have a big impact on all those other areas you mentioned, right?

So I think it's like language models in 2021, maybe you probably wouldn't have guessed like an IMO gold medal a few years later, but come that fast as a direct application of that technology, right?

It was probably, oh, it can help me with my emails or whatever it was.

And I think it's really cool to build these kind of new class of foundation models and then see what people can imagine doing with it.

And that's one of the exciting things about sharing the research preview, right?

So you got this kind of feedback.

So we're hoping a lot of these things can happen.

One of the things in the research preview post, Jack, that blew me away was this, and it wasn't even your first GIF, I think in the blog post, it was either second or third.

You had this visual of somebody painting the wall with the paintbrush and then the character moves out of, right?

Like out of, to a different part of the wall paints and then moves back.

And the original paint is still there.

And I didn't believe it.

I was like, there's no way.

And then I read and you're right, it's described as a special memory.

So the persistence part for me, I'm not taking away from all the other stuff.

The interactivity is amazing.

But I think, broadly speaking, folks expected that at some point video generation, for example, would become real time.

When I saw the genie three posts, it was like, okay, they actually went and did it.

But the special memory, the persistence was when I kind of sat up in my chair and I was like, how did that happen?

Could you talk a little bit about when did you discover that as an emergent property or was that a specific design goal?

What's the backstory on that?

Because that feels like a big unlock, Jack.

Why don't we start with you?

Yeah.

So that's a great question.

I'll say a few things.

So the TLDR is it was totally planned for, but still incredibly surprising when it worked that well.

Right.

So that specific sample, when I saw it, it was hard to believe.

I actually wasn't sure that the model generating for a second, I was like, that took me to watch it a few times and I'd really check and freeze the frames and look back and check that it was the same.

But go back a few steps.

So obviously genie two had some memory, right?

So this got kind of lost because I mean, genie two came at a time when there were lots of announcements, very exciting announcements being VO two only a few days later.

It was a busy time of the year.

And the main headline act was that we could generate new worlds at all.

Right.

So that was the thing that we wanted to emphasize, but it did have a few seconds of memory.

And we had a couple of examples.

Like I created a robot near a pyramid, looked away, looked back in the pyramids there, but it's like kind of blurry.

It's not perfect.

But some other models around the same time or more recently didn't have this feature.

Right.

So people kind of indexed to that because they didn't notice the early signs of it in the genie two work.

And then for genie three, we basically went much more ambitious on the same sort of approach.

Right.

And we made it like a headline goal for ourselves is can we make the memory be what it is?

Right.

We said we want minute plus memory and real time and high resolution all in the same model.

And those are kind of conflicting objectives.

Right.

So we sell ourselves this kind of technical challenge.

And we said, if we target this, then it's just about feasible and it'll be pretty incredible.

And then you still don't know obviously it's going to pan out.

So then when you get to the end of the research one, seven months later to see the samples, it still is quite mind blowing to be honest.

So yeah, it's kind of planned for, but still pretty cool and exciting when you see it because it's like, you know, they research projects on like, sure things are there.

So one thing that we didn't want to do and we didn't want to build an explicit representation, right.

So there are definitely methods that are able to achieve consistency.

And they did that through an explicit some 3D, you know, nurse and gosh, I'm splatting and other methods that basically say, okay, if we know how the world looks like, we use this kind of like prior assumptions on how the word remains static pretty much, then we can build representation and then what we're looking at.

So that's great, I think for some applications, but we didn't want to go down this path because we felt it's somewhat limiting.

And they think so we can definitely say that the model doesn't do that.

And that generates like frame by frame.

And we think this was, this is really key for the generalization to actually work.

Every time someone interacts with it for the first time and they like test, they look away and then look back.

I'm always like holding my breath.

And then it looks back into the same and I'm like, whoa, it's still really, it's really cool.

And how long is this special memory?

I don't know if you can talk about it.

You mentioned a minute plus, but is there some sort of like measure that you have?

Is it like, can you keep it for half an hour or what is the limit on that?

There is no like fundamental limitation, but currently the current design will limit it to one minute of this type of memories.

Yeah.

It's also a real time trade off for the guests as well.

We felt that because of the breadth and the other capabilities that like a minute was sufficient.

So this version, like it's quite a significant leap, but obviously eventually you'd want to.

One more question related on the between G and E1 to like, for example, NLM's like you have deep secret one, like they saw on this paper, like the longer they keep it running, they suddenly will see like these interesting behaviors.

Like the model was like reasoning or like would give like a, Oh, I'm wrong in this.

I should self correct.

Do you see anything in kind of like this scaling from two to three, do you see any sort of like interesting behavior that you were not expecting that suddenly just appear by increasing the amount of data and the amount of compute?

Yeah, I'll just say, I think there is a bit of like overall definitely like many generative models would see that improvements happen with scale.

I think that's not secret.

And I don't think it's not the same type of intelligence.

I would say like NLM as I'm not sure if reasoning is the right term, but we do see that some definitely things like it can infer from few approach like a door and it makes sense for the agents to maybe open it.

So you might see that it's starting to do that, for example, or there's some like better word understanding that happens over time and it just like things look better and more realistic.

So I think these are the trends that we've observed.

Yeah.

And from GE two to three, it's I think the real world capabilities really increased.

Right.

So on the physics side, some of the water simulations, you can see some of the lighting as well, like a really breathtaking.

I think we have this example of the storm on the blog.

And that one I think is super cool.

And it's at the point where like a human who is not an expert will watch it and think it looks real.

Right.

And I think that's pretty incredible.

Whereas with GE two, it was like, it kind of understands roughly what these things should do, but you know, it's not real, right?

You can look at it and you can clearly see that it's sort of not completely photorealistic.

So I think that's quite a big leap on the quality in that side.

Yeah.

One of the things that was really cool in all the examples was the water is sort of a great way to see like, does it understand like what the world is and how objects interact.

And that example, someone posted of the feet going in the puddle was amazing.

But then there was also that example of like a cartoon character.

It was more of an animated style who was like running across this kind of green patch of land and then ran into this blue kind of wavy thing that looks like water.

And he started swimming, which I thought was really interesting.

Like were there particular things you had to do around that for the model to be able to understand like how characters should interact in different environments and different styles?

What you're basically describing is like the real breadth of different kind of environment terrains and worlds and things like that, like water or walking on sand versus going downhill and snow and how the agents sort of interactions should differ given the terrain that they're in.

And I think that that really is a property of scale and breadth of training.

So this is very much like an emergent thing.

I don't think there's anything like really specific we do for this, right?

You again, like you hope the model hasn't learned this because it should have like a general world knowledge.

It doesn't always work perfectly, but in general, it's pretty good.

Like so for the skiing examples, you do go fast when you go downhill and then when you turn triangle back uphill, it's very slow, if not at all possible.

When you go into water, obviously, you hope, as you said, that the agent will start swimming and splashing and this does typically happen.

When you look down near a puddle, hopefully you're wearing Wellington boots, like this kind of stuff does just kind of make sense.

And I think it feels pretty magical because it very much aligns with what you were thinking about the world and the models just generated it all.

So yeah, that's also one of the really exciting things for sure.

Yeah, and on all that, I want one kind of trade off that typically we have is that we want the models to do two things.

We want them all to create the world in a way that looks consistent.

So like Jack said, like if you walk in drain or in pulse, then probably wearing boots.

But if we provide it with a different description or like the prompt is saying something else, we want it to still follow the prompt.

And there is some tension here because some things are very unlikely, right?

You might say I want to wear flip flops and jump in the rain or whatever.

Then the model still has to try and create something that is very unlikely.

And that's where typically video models maybe find it more challenging.

And that's where our models might find it more challenging, but still it's still successful to a surprising degree to go into this kind of low probability area.

And I think that's really in a way that's what we want.

Like many people, they don't want to just look at the video that looks like their own, no, maybe this wrong.

But something a bit more exciting.

And that's where I think this is the magic of the models that they can take you to places that maybe are not so likely to be in reality.

The text following is really amazing in this model.

And that does feel really magical.

I think there's something that the VO does really well as well, right?

Like pretty much what you asked for.

It's really well aligned with text.

And we have that with genie3.

So you could describe very specific worlds and really arbitrary silly things.

And it pretty much works.

We actually had this discussion because people were very disappointed to find out that the video I made of my dog actually was not my dog's photograph.

I just described her in text.

And yeah, I don't know if that's a big secret, but it looks exactly like her.

And the model just kind of knows, right?

I think that's pretty amazing.

So I think that that's actually a really important capability that we didn't have with genie2 as well, right?

Because we relied on image prompting.

And so there was some transfer issue where you rely on, imagine, to generate the image.

And that often does look really good, but it's not necessarily a good image for starting the world.

Whereas going directly from text, you get the controllability to print anything you want.

Plus it just kind of naturally works because it's in the correct space for the model to do its thing.

And that's something really powerful.

And why is that, Jack?

What do you think led to such a massive instruction following a text adherence game?

Because it's a pretty hard thing to do.

Well, I mean, our team had never really worked on this.

And so genie1 and 2 both worked with image prompting.

And so obviously, for this next phase, we leveraged a lot of the research done internally on other projects.

And personnel-wise, I mean, Shlomi is obviously been co-leading the VO project.

And so we were able to kind of build on a lot of other work and ideas internally.

And that basically allowed us to kind of turbocharge progress.

So if we'd done this sort of by incrementally building ourselves on an island, it would have taken, I think, a lot longer than being part of Google DeepMind where we have these teams that have a lot of knowledge in different areas and sort of lean and build on.

Which I think is super exciting about our being in the company right now is that we have so many experts in different areas that we can seek out advice and help from.

And Shlomi, a question for you on that is, you know, having led the VO3 work, which is kind of mind-blowing.

Is there a reason why this is Gini 3 and not like VO3 real time?

So I think it's definitely a bit different, right?

Like Gini allows you to navigate environment and then maybe take actions, right?

And that's not something that VEL at this point can do.

But there are other aspects that are different that Gini doesn't have, right?

Gini doesn't have audio, for example.

So we just think it's, while definitely there are potential similarities, it's sufficiently different.

Also another thing is that at this point Gini 3 is not available as a product.

And we do think about it as like a product that is kind of like mainstream and became very popular and what the future holds, I don't know.

But I mean, at this point, we just felt it's sufficiently different in terms of what capabilities and how we think about this.

So Gini 3 is pretty much a research preview, right?

It's not something we are releasing at this point.

Yeah, something we think about a lot is what are the edges of a modality?

We're talking about this all the time, which is, you know, the lines start blurring pretty quickly between real time image and video and then real time video and interactive, whatever, world generation, world model.

I don't think we have a good word for what Gini 3 is yet, but you guys call that world model, which is, I think, a great term.

But in your mind, like, where does the video generation modality stop and real time worlds start?

And do you think in the future, are these converging into basically one modality?

Or if you had to predict over the next few years, do you guys think actually, yeah, these will diverge into completely different disciplines?

It seems like they share kind of one parent today, which is, you know, video generation.

But where is the world going?

Do you think are these two completely different fields?

From my perspective, they're different.

So I would say modality is one thing, right?

We have text, we have audio.

Even within audio, there are different types of sub-modalities.

Speech is not the same as music.

We have different products for music generation.

And we have other models for speech generation, speech understanding.

So even within one modality, you can have different flavors.

And then of course, you have video and other things.

So I think basically, I would say the modality is one dimension and another is how fast or how quickly we can create new samples and completely orthogonal.

Maybe the direction is or dimension is how much control we have, right?

So I think we kind of picked a specific direction or a specific vector in the space for Jimmy Free.

I think different products, different models can try and go in different directions.

I think the space is pretty big, and there are a lot of trade-offs to be made.

So yeah, I don't know, I think it really depends.

Some people believe there is one model that will do everything.

I think there is still open-ended what's the best way.

We're in a place where engineering is a big part of our research, right?

And actually making those like it's not a paper, right?

Where we want to build something that people can actually use.

So I think this really makes it like an abstract idea is go to get you to some point, but to actually build things, you have to make some concrete decisions.

And I think it forces you to decide what you want to do and what you're doing.

Yeah, I think this is a really interesting point, Mike.

And ultimately, it has to be driven by like technical decisions.

And also like the goals, right?

So if you look at the models right now, we obviously made a choice that we won VF3 and G3 to be separate projects this year, right?

And if you look at them both as they are right now, they have very different capabilities that the other model does not have.

And technically to combine all of that already into one model, right, would be I think very challenging to, I mean VF3 is a higher quality threshold than G3, right?

And it has very different priorities, right?

So then the natural things you could say, oh, well, you know, what if we just took these together and combine them?

But that may not be the best next step for either of those two models, right?

So it may not be the case that the thing that the other one has is actually the most compelling thing for a completely different experience.

And I think that given the breadth of interest in both models, right, there's actually quite a small set of people that are like really actively using both and they tend to be more folks like yourself who are just more broadly interested in AI, right?

Rather than like really downstream use cases.

So like you mentioned agent training for one, which is like a very sort of like high action frequency requires more egocentric sort of, I guess, more like worlds where tasks can be achieved, but doesn't require, you know, like high quality cinema style videos you could generate with the BO model, right?

It's quite different.

And then on the filmmaking element, I mean, I'm also sure that G3 is really there at this point.

And that would be necessarily the goal.

I don't know.

On filmmaking, Justine can do some pretty incredible things with the filmmaking tools today.

You'd be surprised.

Give me access and I will make amazing films with G3.

I guess that did kind of get to one of my questions though, which is the work you guys are doing is incredible.

And you clearly probably have so much going on in your brains just to coordinate training these models and managing these teams.

How much do you also have to think about like, what are the downstream use cases of the model when you're training it?

Because you could imagine a world in which you're just like, we don't really know or care what people are going to do with it yet.

We're just going to go in the research direction.

We think we should go and see what happens.

But based on how you guys are talking about it, it sounds like you've also been pretty thoughtful around what are the different capabilities or features needed for different potential use cases, at least of different models.

Yeah, I'll say that basically, we have some applications in mind, but that's not what drives the research.

It's more about how far can we push in this particular direction?

Can we make all of that work?

Like really great quality, really fast generation, real time, very controllable.

I think that's kind of what drives us.

I think there are two to have developed journey free.

And the applications kind of follow.

And I don't think, to be honest, I don't know what would be the applications for.

I think we're very surprised.

I'd like to mention like, they're free.

Well, people find new ways in how it can be useful.

And to prompt it, we have like visual stuff, people just discovered it, right?

We didn't even think about it initially.

So I expect kind of the same thing.

And I think that's why I am excited for more people to be able to have such in the future.

And in general, our approach is to make sure that over time, there is more access to them, what this rebuilds.

And I think that's the only way to discover what's the right potential.

I guess one somewhat related to that, like how do you think going forward, like genie four or five or any other models, like what is like top of mind right now?

Like if you wanted, for example, to focus on, I don't know, like seems like gaming could be one of the applications having multiplayer type of games, we have two special memories or two different completely views, but that some point they merge.

How are you thinking of like going forward?

Like, what's next?

Is it like scaling these models?

Or more data more compute?

Is it creating this sort of like, multi universe type of things where you're, we have multiple players, multiple people looking at the same model, putting different views?

What's top of mind for you guys?

Top of mind, I think for the next few days might be a vacation.

After that, maybe walking my dog in the real world.

And then I think you mentioned a bunch of really interesting things to be honest.

And like, I think we are, we're still collecting a lot of feedback on this current model.

Right.

And I think that in general, we are most interested in building just the most capable models, right?

And so we would hope to have even broader impact in future and really enable other teams to do cool things with it, right?

Both internally and externally.

And for me, it's like, I just started this with like a very, very focused vision about AGI.

And I still think honestly, for my, what I'm excited about for AGI, and which is more embodied agents, I really believe this is the fastest path to getting these agents in the real world.

And I think we made a big step towards that.

But I'm still like, sometimes even more excited about applications I never thought of that come up from other people seeing the model, right?

So I think it's kind of this like trade off of, you know, obviously you want to focus on some applications, but then you want to be open minded about others.

And I think that's the real joy of building models like this, right?

As you get to see all of these people can be way more creative than me with it.

So I think that there's always really cool things that we can do.

And I honestly don't really, can't really tell you in one year what the biggest application will be.

But we'll definitely be trying to build better models.

Yeah, I'm really excited.

But I think we're only as impressive, you know, maybe the model is I think they're very far from actually simulating the world actually, and being able to do kind of put a person in there and then do whatever they want.

And I mean, when I say far, doesn't mean it's far in terms of, you know, calendar time, because we are really even accelerate the timeline, but it feels like there is more work to do to get there.

And I think I just imagine like, once we can actually, you know, whatever the form factor would be, but step into this world, and just kind of like maybe tell it how we want to, what you want to experience.

There's so many applications.

Imagine, for example, someone is afraid of talking to people on a stage or in a podcast, right?

They can simulate that, right?

Or you can have someone who is like afraid of spiders, they can maybe actually see themselves getting over that.

So that's like, you know, just just one example of something that actually my wife thought about it.

It's not my idea.

So I think it's really like there's so many things, right?

So I think this is just it's all hinges on the ability to simulate the world and maybe put ourselves in it, maybe seeing yourself from the side, and potentially having agents interacting with things and yeah, the realism and really making it work in the way that is similar to our world, I think it's really key.

I am actually personally petrified of skiing and the model is already quite good at that.

So I might when things quiet and down spend some time, because I promised my wife that our children would grow up knowing how to ski.

And we're getting close to the age where I have to live up to my promise and I'm not sure if I want to do it yet.

So we have to improve the model for you, Jack, so you can actually get that in distribution.

I hope so.

We were just talking about before we started that, we might see applications like in robotics, I mean, you were talking about embodied AI and like now like limitation in robotics is the data, right?

Like how much data you can collect and now probably you can just generate a lot of different scenes that you were not able to do before purely from like just recording videos or so.

So I think that's another thing that is pretty exciting.

And I mean, congrats on the on the on the model.

It's phenomenal.

On the robotics application, there was a conversation that I was listening to from Demis yesterday where he was talking about your guys' work on Genie 3.

And he mentioned that there's a there's an agent, I think you guys call it SIMA, right, which can then interact with the Genie agent.

And as I was hearing him describe it, which was kind of breaking my mind, which is that you had one simulation agent asking the world, asking the Genie agent to essentially create a real time environment for it to interact in.

Right.

Which was when I realized, oh, the way you guys have built it, it's it's composable with other agents.

Can you talk a little bit about why that's so important for robotics, like Marco was saying, and what are the major limited limitations today that you think we'd have to overcome as a space to make the robotics sort of progress, the rate of progress in robotics much faster than it is now?

So we designed it to be an environment rather than an agent, right?

So so Genie 3 is very much like an environment model.

Like we don't see it as like an agent itself that can like think and act in the world.

It's more just a general purpose sort of simulator in a sense, right?

That can actually simulate experiences for agents.

And we know that like, learning from experience is a really important paradigm for agents, right?

That's how we got Alpha Go because the agent Alpha Go learned by playing Go by itself, trying new things, right?

And then learning from feedback, the reinforcement learning, learning to improve itself, and actually discover new things like it discovered new moves that move 37 that humans didn't think was it was a worthwhile move, right?

But but actually Alpha Go learned that it was because it could experience and try things for itself.

And then robotics, we have this paradigm right now, where there's some data driven approaches, right?

Where you can collect data in a quite laborious way.

But it looks like the downstream tasks and looks real.

And there's not so much of a mismatch between the two domains.

Or you can you can learn in simulation, right?

But the robotic simulations are even the best ones.

And we have some of the best ones.

I did my move, Majoko, right, which we work with.

They're still quite far away from the real world, right?

And you have a sim to real gap.

But even the same to real gap itself, I think is kind of like, poorly named, because what people consider to be real in robotics is typically still a lab or some very constrained environment where you've got a bunch of spotlights on a robot and then tons of researchers crowding around watching, you know, was really real for me is that I mainly represent it's the ability to walk my dog when I'm too busy to hold the lead, cross the street, you know, see someone who's scared of dogs, know to go around them, see someone a ball, change directions, like all these challenging situations in the real world, right?

And of course, you still have gripping you still have these other other tasks, but you need to really discover your own behaviors from your own experience, right?

And that's that doing that in physical embodied worlds is super challenging, because there's so many reasons why firstly, that could be expensive to collect data in those in those settings, you have to keep moving the robot back to where it started every time it doesn't do something right.

And also it could be unsafe, right?

So there's many reasons why we can't really do learning from experience in the physical world, right?

So we do it in simulation.

But really what we think with genie three is it's the best of both right because you're taking a real world data driven approach, right?

But then you've got the ability to learn in simulation.

So it kind of combines the good parts of each of those.

And so that's why I think it could be super powerful, not just for for about example, but I really love this idea of having when it rains in London a lot, not having to take my dog for the second walk would be great.

And as you can see, we build a model basically for Jack personnel.

locations.

That's what driving the project is a point.

Yeah, but clearly, I just think clearly, Jackie, it's time to move to California.

Yeah, that solution.

I mean, I personally love California, but my wife's not my most like a bit sorry.

We're gonna answer.

Yeah, just just to touch on, you know, maybe find a point on the robot's kind of like robotics part, I think they're like, it's definitely you know, robotics means it's more than visual, right?

Like we need to be able to I think this is an important point.

We want we can drive the decisions of the robot by looking around but still it has to connect to the situation decide where to move or to respond to the environment.

So I think there are definitely some gaps.

But still at the core of the problem being able to reason about the environment.

We think this is something that's just a word models or general purpose word models such as genie free can really help with and and maybe with future research, we can actually bridge those gaps of physical like understanding and actually getting responses physical responses from the world, which is a very interesting direction to explore.

One last question from my side, the another of you can answer this, but like, is he gonna become public?

Like, can developers access it at some point?

Or is there like some sort of idea on this?

So as you can see, we are very excited about having more people accessing it.

So we're definitely want to make it happen.

There is no kind of a concrete timeline at the moment.

But you know, I'm sure once we have more to share, we will do awesome.

One of the things I've been thinking about a lot is we see sort of with every like modality like you know, maybe first lms and then image and video and audio.

There's like early kind of glimmers of something really exciting in a project or a research preview.

And then there's like a ton of data and compute and researchers kind of poured out the problem.

And you hopefully see this sort of like exponential progress till you eventually get to the point where like, you're out of data or the improvements don't come as easily.

I'm wondering for your thoughts, like where we are on sort of that curve for world models?

That's a really question.

I actually have a super hand wavy, somewhat swerving answer, right?

And I think it's actually both.

So I think the current capabilities are actually already quite compelling.

And so you could make the case that like, if what you wanted was a minute of photorealistic any world generation with memory, that could actually be the end goal, right?

And two or three years ago, I probably would have said that was a five year goal.

And so at that point, if you just wanted to improve that, I think you probably end up with this maybe like, I think the jump from genie two to genie three was absolutely massive.

And went from being like, kind of a cool bit of research that was like, showing signs of life, something that could already be very compelling.

But I think there's a lot more that you can do with this and show me kind of references to himself, right?

Like it's not the case that you're dropping yourself in the world, right?

And like, it's like the real being in the real world, for example.

It's actually quite different to that.

When you do, you know, take a minute to look away from the screen.

It's quite a bit richer out there.

And that's just for the real world.

We also want this ability to generate completely new things, right?

So I think we've got a huge gap to close, right?

With new capabilities that we want to add.

But I think it's made a bit different to language models, or actually maybe it is similar to language models, but with language models, there's been like lots of new steps that have actually come on top, like that maybe we didn't think were possible.

We thought things were plateauing.

And then a new idea came that made a significant change.

And that has happened a couple of times in the past few years.

So I think that there's a few more of those left for sure.

My final question for you guys is, are we living in a simulation?

Oh, yeah, that's every entry as to...

My thinking about that is actually, yeah, I thought about that a little bit.

I think that if we live in a simulation, my take is that it doesn't run on our current hardware, because it's analog and not like now it's continuous.

All of the observations are continuous and there is nothing like...

But maybe the quantum level is some limitation of our...

You wanted to go philosophically, so you go.

It's some kind of like a hardware limitation of the simulation we run on.

So yeah, take it or leave it.

That's a great answer.

Clearly it's all a work for the TPU team to do.

Yeah, maybe quantum computing will be actually running our actual simulation.

So yeah.

That's a great place to wrap.

Shlomi, Jack, thank you so much for coming on the podcast.

Thank you guys.

Thanks guys.

Thanks for having us.

Thanks for listening to the A16Z podcast.

If you enjoyed the episode, let us know by leaving a review at ratethispodcast.com/a16z.

We've got more great conversations coming your way.

See you next time.

As a reminder, the content here is for informational purposes only.

Should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund.

Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.

For more details, including a link to our investments, please see a16z.com/disclosures.