Latent Space · 2025-03-04

How Claude 3.7 Plays Pokémon

Hosts: Alessio (Decibel partner/CTO), Vibhu (Latent Space Discord community)

Guests: David Hershey (Anthropic)

Claude Plays PokémonClaude 3.7 Sonnetagent benchmarkslong-horizon taskstool-using agentsvision limitationsprompt engineeringgame-playing AIextended thinking / reasoning models

Why it matters

As models get smarter, Hershey has been *deleting* prompt instructions.

Key claims

  • David Hershey (Anthropic) built "Claude Plays Pokémon" starting June 2024 as both a personal obsession and agent benchmark, iterating through every Claude model release
  • Architecture is deliberately simple: three tools (button presses, knowledge base memory, Navigator for spatial pathing), with 30-message conversation rolls that summarize at ~100K tokens max
  • Claude 3.7 Sonnet shows major improvements—finally beat Brock for the first time in 8 months—but is currently stuck in Mt. Moon for 52+ hours
  • As models get smarter, Hershey has been *deleting* prompt instructions; best results come from giving the model free rein rather than imposing human intuitions about how to solve problems

Episode summary

Summary

David Hershey from Anthropic joins Latent Space to discuss "Claude Plays Pokémon," an agent harness he built to test Claude's long-horizon task abilities by having it play Pokémon Red. He started the project in June 2024 with Claude 3.5 Sonnet and has iterated through every subsequent model release, using Pokémon as both a nostalgic obsession and a pragmatic benchmark for agent capabilities. The architecture is intentionally simple: a tool-using agent with three tools (button presses, a knowledge base for long-term memory, and a Navigator tool that patches Claude's spatial navigation weaknesses), rolling out a conversation of ~30 messages before summarization, peaking around 100K tokens per turn.

A core insight Hershey shares is that as models improve, he's been *deleting* prompt instructions rather than adding them—the best gains come from letting the model use its own intuitions rather than imposing the builder's. Claude 3.7 Sonnet showed clear improvements (it finally beat Brock, caught Pokémon, and navigated out of the lab), but vision and spatial reasoning remain glaring weaknesses—it sometimes walks into walls for 12 hours straight. Hershey sees Pokémon as a useful, if expensive (thousands of dollars in tokens), qualitative proxy for agent progress, and is excited to see how the underlying capability improvements translate to real-world agent applications.

  • David Hershey (Anthropic) built "Claude Plays Pokémon" starting June 2024 as both a personal obsession and agent benchmark, iterating through every Claude model release
  • Architecture is deliberately simple: three tools (button presses, knowledge base memory, Navigator for spatial pathing), with 30-message conversation rolls that summarize at ~100K tokens max
  • Claude 3.7 Sonnet shows major improvements—finally beat Brock for the first time in 8 months—but is currently stuck in Mt. Moon for 52+ hours
  • As models get smarter, Hershey has been *deleting* prompt instructions; best results come from giving the model free rein rather than imposing human intuitions about how to solve problems
  • Vision and spatial reasoning are the biggest weaknesses—Claude can't reliably understand Game Boy screens, sometimes confuses its own character, and walks into walls for hours
  • Pokemon knowledge from training is a double-edged sword: Claude hallucinates things like finding Professor Oak or remembering gym layouts, which often derails it
  • Adding nickname instructions made Claude more emotionally attached and protective of its Pokémon—an emergent quirk
  • Pokemon serves as a reasonable (if expensive) eval for long-horizon agent capabilities, with gym badges functioning as natural progress milestones

Source material

Transcript

Hey everyone, welcome back to another Latent Space lightning pod.

This is Alessio, partner in CTO at Decibel.

There's no swix today.

We got a special co-host, Vibhu, which if you're a part of the Latent Space community on Discord, you've definitely seen.

Welcome Vibhu as a co-host for slime.

What's up guys?

In that we had David Hershey from Anthropic today, who's the person behind Claude Plays Pokemon.

It's funny, I saw we first DM'd about playing Magic the Gathering together, NSF.

I really like that.

On all of the different nerd angles you can get me.

And then people were like, David is the person doing this, and I was like, okay, I'll DM him, and then, yeah, it was cool.

We already had a touch point.

So welcome to the show.

This is our second Anthropic episode.

We had Eric Schlons from the sweet agent before, so welcome.

Thank you.

Glad to be here.

Excited to talk Pokemon.

Yeah, so let's give a little background on this.

So, Sonic 3.7 came out a couple weeks ago.

I don't know.

Time goes by so quickly.

Monday.

I don't know, man.

It feels like two weeks ago.

And then you had this Claude Plays Pokemon thing that kind of went viral, where if people remember, there used to be this thing called Twitch Plays Pokemon, where people could go on Twitch and kind of type in the chat and then busy like figure out what the next section that the emulator would take us.

What you've done instead is given it to Claude and basically have Claude figure out how to walk through it.

I'm looking at it right now.

So far, it's been stuck in Mountain Moon for 52 hours.

Poor guy.

It probably met a 15,000 Zubettes.

So, yeah, let's talk about what gave you the idea for it, kind of the origin story that we're getting, go through the implementation.

Totally.

Yeah, so I actually started working on it in like June of last year for the first time.

And for me, so I work with customers at Anthropic and I just like really wanted to have some way for myself to be able to like experiment with agents, like in a real way, some framework, some harness where I could actually just like go to town and try some different things and see what actually worked to get Claude to do like pretty long running tasks in general.

And so I like had that in one hand and then I was like, okay, what is the thing that will make me the most addicted to making this work?

Like how will I grind the hardest actually trying this?

And Pokemon was like a pretty clear answer.

Someone else at Anthropic could actually like tried once to hook it up.

So I had a little bit of like the shell of what I needed to actually put it together then to like kick off what became an obsession a little bit in the coming months.

So yeah, like I played with it in June in the switch trying things out.

This is like Sonic 3.5 came out in June of last year, which is when I started kicked it around.

It's very good, but like, you can see like kind of signs of life, but like not much really happened.

And then ever since then, as we released new models, it's sort of been like the way that I get to know one of our new models a little bit, right?

So we released the new version of Sonic 3.5 in October and like use this to like really kind of see like what's it better at and it got better.

Like you could see it start to like, it could get out of the house somewhat reliably, which was not always true.

And it got a starter and I like even named it sometimes like it was like doing stuff.

Not great, but like it, it could move along the way to like, I'm just like, we have a quad place Pokemon Slack channel.

Like I'm sort of just like giving people updates.

So over time, as I'm like posting gifs and up-par always updates, like I'm, it's just like slightly growing a popularity of a cult following internally of people who are somewhat interested.

But then like, uh, you know, a couple of weeks ago, I was bashing an early version of Sonic 3.7 and it just like, you could just tell it had like, it was a little different.

It's clearly not still good.

As you said at the top, like it's, it's in my moon for its 50 something an hour.

This is a little bit worse than average from what I've seen so far by now.

But like, this is like, you know, about on brand.

It doesn't really have a great sense of direction.

It's pretty bad at seeing the screen stuff like that, but like, it plays the game, you know, like it gets Pokemon, it catches Pokemon, like it caught its first Pokemon.

It got out of the variety of the first time, like a whole bunch of stuff happened for the first time or like could squint and see a thing play in the game.

And yeah, like posting updates, obviously internally, it was very fun.

Like people were just like kind of going wild at the fact that this was actually happening finally.

And it was like entertaining enough that I could kind of see it.

And the other side is like, we kind of just like got finally a sense that this was like an actually useful way to measure what was going on with this model.

I mean, like, there's one thing that it's like fun and fun follow along, but like internally, like I think we got more of a sense that like, you could actually use this as a bit of a measuring stick for what's going on in the model.

I've spent, you know, how many hours I spent staring at quadpipes.

I have to have seen and read like millions of words that Claude has generated in the course of playing Pokemon over the last eight months.

So like, you can kind of get a feel for like, what's actually going better or what's it getting better at and that kind of thing.

And with this particular release, like I think the fact that it got this much better at this kind of reflects a lot of things that we wanted to be true about the model to begin with.

And those sort of lined up were like, okay, maybe this is like an interesting way to actually tell people about what's going on here for a crowd that maybe doesn't like quite know as much about software engineering and all the other ways we've told people about agents in the past.

Yeah.

Were there any other games that you consider to me seems like Pokemon is good because it's like, you know, isometric, you know, it's kind of like flat so you can score it and it's it doesn't have too many hidden facts about objects, you know, kind of like everything it's described.

Did you consider anything else or was Pokemon just kind of like by far and away the first choice?

I didn't, but it's mainly because like Pokemon was the first game I ever got as a kid, right?

It's like purely coming out of my own nostalgia.

But also like the choice play Pokemon, like I was also something that I cared a lot about a decade ago or whatever that was.

At least it's not a decade ago.

I think it's actually a decade ago, I'm sorry.

Yeah, painfully.

11 years ago.

February 2014.

Yeah, it's nuts.

Pokemon Red is 20 years ago.

Like 20, 25 at least.

So yeah, for me, it was that like since then there have been a lot of people in the probably like, oh, we can do this, we can do this, we can do this.

I think there's like a lot of fun things you can do.

Pokemon is actually really nice because like, if you don't do anything for five seconds, like there's typically not a consequence by the nature of like doing inference on a model every snapshot of time.

It's actually a pretty good game to be able to do this with.

But yeah, it was mostly just like my love for Pokemon coming through here.

You put together a very nice architecture diagram.

Do you want to screen share that so people on YouTube can follow along and then we'll put in the show notes.

If you are just listening.

I know that Vivo had a bunch of questions on that too.

Yeah, let's do it.

Very, very straightforward questions.

Basically, can we just double click into all of it?

Yeah, yeah, yeah.

It's easy.

I found it off Twitch and like no one was talking about it.

So I started sharing it around and I lost the original source.

But basically, everything in here is like pure gold.

The memory is a little interesting.

But yeah, if you want to just go through high level.

Yeah, you got it.

Yeah, I wanted to like preface that I do not claim this is like the world's most incredible agent harness.

In fact, like, I explicitly have like tried not to like hyper engineer this to be like the best chance that exists to be Pokemon.

I think it'd be like trivial to build a better pure program to be Pokemon with quad in the loop.

This is like meant to be some combination of like, understand what called good at and benchmark like an understand quad alongside a simple agent harness.

So what that boils down to is this is like a pretty straightforward tool using agent from my perspective is how I would frame it.

So at the end of the day, like the core loop is just like having a conversation that rolls out.

And it's essentially like you build the prompt, including like everything we've had up till now, you call the model, it sends back some tool use typically, you resolve those tools, then talk about summarization, but like basically some, a few different mechanisms to maintain the information you need to do something long running inside the context window.

So like what this boils down to is like when you think about what an actual prompt looks like, it rolls out kind of like this, you've got tool definitions, which describe three tools that I'll get to in a second, a short system prompt, it's like pretty boring.

It basically tells the model how to use the tools.

And like, there are about six facts about Pokemon that I give it and like a few corrective things that I've seen it do like really horribly wrong.

I'm like, Hey, you might want to consider doing this a little bit better.

But it's like really not a lot of system prompting going on.

We have that knowledge base, which referred to you, I'll talk about this is the main way it stores like long term concepts and memories as it's operating over time.

And then the bulk of things is this conversation history, which is it's like a chain of tool use, there's no like user interjections at all, for the most part.

So it's like, go, and then the model uses the tool, and then it gets resolved back.

And then it uses another tool, it gets resolved back.

So pretty straightforward.

Feel free to like cut me off too, if you've got questions along the way, but otherwise, I'm gonna keep rocking.

Yeah, yeah, go ahead.

Cool.

Okay, so most of the money of this is just like in the tools themselves.

When you think about what's going on, it's really like, it can press buttons, and it can like mess with its knowledge base.

And that's about it.

I'll talk about Navigator separately, because that's like a patch for how it actually can deal with some of its vision deficiencies.

Using the emulators, this will like execute a sequence of button presses, it'll say like press A, B left, right, whatever.

It gets back a screenshot, and screenshot overlaid with coordinates of the game.

These coordinates are used for this Navigator tool that I'll describe in a second, but it's just basically like, help quad get a slightly better spatial sense of what's going on on a Game Boy screen.

I've been through a lot with Sorry, does that come with the emulator?

Or are you adding those in?

I add that in.

Okay.

I have somewhat extensively reverse engineered Pokemon read by this point to like extract roughly every bit of possible information from it.

I don't use most of it, but like I have essentially everything you could know about the current state of the game I have exposed programmatically to be able to tinker with at this point.

I was just reading this diagram like yep, you just get what spaces are walkable based on what's stored in RAM.

And I'm like, Oh, you definitely reverse engineered this little I mean, yeah.

Good news is we also released cloud code this week, if you saw that.

And that has been, this would all not be possible without the help of having quad also go figure out how to do all of this for me because I could have done it.

But there's a lot of like tedious here are addresses in memory map that to a Python program that I had no interest in doing.

So thank goodness for quad code.

So yeah, it gives me two screenshots, it gives like a small blurb of state, which I read straight from the game.

There's a lot of this here, actually, like funny enough, the thing that matters is location.

Quad will like pretty aggressively hallucinate that it succeeded in transitioning between zones.

If you don't like tell it it did not.

This just comes down to like literal vision issues.

And and so like most of the patching of extra help I've given it been like attempts to make it so they could still play despite not being very good seeing Game Boy screens in particular.

And then it gives like a handful of like reminders.

This is this reminders, this is a decent amount of work, but it's like things like, yeah, remember to use knowledge base occasionally, and we can we tell if it gets like stuck, for example.

So if you detect that like hasn't moved in 30 spots or 30 time steps, I once saw it see like a red box on the screen that was like the doormat and then goes a text box and spend 12 hours pressing A overnight to try to clear the text box, which you see that happen once and you add in some some helpful reminders to not do that.

How much knowledge does the model have about the game itself?

You know, so for example, types, right?

They don't know about types, weaknesses and things like that.

Or how much are you trying to put into it?

Yeah, if you go to quad.ai, like it, it will tell you about like some stuff.

I have not yet decided if the knowledge that it has about Pokemon is helpful or harmful towards it playing the game.

Like half of the time when it's like, Oh, I know this about Pokemon, it then like uses that to hallucinate something.

So for example, the beginning of the run on Twitch, you saw like go out of the lab and see like this NPC in the bottom of Pallet Town and be like, it's Professor Oak, I found him.

And it's like very much not Professor Oak, but like the fact that it has like indexed on this concept is like a little stuff like that, that it's like unclear to me where it is.

But it clearly has some information about it.

There's like a million game guides about Pokemon sitting on the internet.

It's unsurprising that like there's a decent amount of information there.

I don't really give it a lot of extra information.

It picks things up.

I watched on this trim the other day, like it tried to use Thunder Shock on a geodude and it failed.

And it's like, Hmm, I forgot about that.

That does not work.

And so like clearly there's like, it knows some stuff.

It's not perfect.

It picks some stuff up as it goes through the run.

Ideally for me, like I think it's just interesting to see like what it actually learns as it's playing.

So the more it does that is the more I'm like actually interested in it.

Yeah, the one of our Discord members, Nanjung, he had a good question about the sense of self.

Yeah, like sometimes it gets confused.

Who is the actual playable character in the in the scene?

Like how do you steer that?

Uh, yeah, I think like sometimes it gets confused.

It can be applied to many things in quad playing Pokemon, in particular when it's trying to like look at the screen and understand what's going on.

So I have like attempted to prompt it all sorts of ways.

Like you are at this exact coordinate and you're in the middle of the screen and you're wearing a red hat and things like that.

And like that's all neat.

But quad doesn't particularly understand like the middle of a Game Boy screen and a whole bunch of concepts like that, which means like you can prompt all around everywhere.

But like this kind of like spatial awareness and where something is with respect to something else is something that quad still just like not great at in its current incarnation.

So one of the side of it sometimes leaves track of who it is on the screen and thinks there's something else there.

I'll keep tracking through this.

So I hinted at this like other tool that I give it called Navigator.

And this is just like the only other patch that I have for the vision issue.

So Navigator basically what it does is like quad can say it wants to go to one of these coordinates that we provide in the screenshot.

And then we like automatically press the buttons to get there.

It has to be something on the screen.

Like I'm not trying to let quad just like navigate a whole map by asking to politely.

But one thing you'll notice if you run it without this tool is if like quad wants to get from one side of a wall to another side of the wall, it like happily just tries to walk through the wall repeatedly because it doesn't quite have the concept of like what's between it.

And I spent a lot of time like prompting around this and it just like isn't it's just not it's one of those things not very good at.

So in order to make it somewhat fun to learn from quad playing Pokemon at all, we use this Navigator tool which like helps it actually get around a little bit better.

So since we covered a bit about the different tools, the prompting and the strategies, I'm curious how many tokens all this is using like there's a part to conversation history and truncating parts of them once it is in state.

Yeah, like yeah, at a high level, how many tokens is this using and then can we kind of go into where those are coming from what's being truncated?

Yeah, you got it.

When you like think about the prompts here, essentially like every step, something that looks like this gets sent.

So if you just go through what each of these looks like, everything in the system prompt is probably like 1000 tokens, pretty small, like a handful of paragraphs knowledge base, I let get up to like 8000 tokens.

So I put some like arbitrary cap on it so it doesn't go to like quad will write put a whole bunch of BS in there if you just let it keep writing stuff.

So like the cap helps constrain it to like try to think about what's actually important a little bit.

And then the conversation history, I haven't like kind of finicky, but it basically rolls out 30 messages.

That's actually like something you can tune.

I've tuned it to be 30 messages about like the best performance I've gotten.

And so what that means is it basically like, use the tool, get a response back, use tool, get response back, it's allowed to do that 30 times.

And then at that point, it triggers the summary, which takes that conversation history, summarizes it makes it the first user message.

And then we kind of roll back out again.

So the bulk of the tokens end up being in the conversation history once it's as longest.

In fact, like this, the bulk past that ends up being these screenshots, which are scaled up a decent amount to fit in.

I do actually like, I allowed to see a number of the previous screenshots, but not all of them, because you start like it ends up being a ton of context.

If you'd let it see like even 30 turns worth of screenshots.

So I'd trim out a few.

That's where the bulk of the actual tokens are.

So in practice, this rollout ends up like at max ending up around a hundred thousand tokens, I think is where it is like the longest message you ever send to the API on one of these turns.

And it will, it will fluctuate in like summarization, depending on the state of knowledge base, probably between like 5,000 and a hundred thousand tokens.

And is that like per Xion state of the game?

And roughly, do you have like a high level ballpark estimate of how long this would, how much and how long it costs to run this?

Like, let's say people want to compete.

Yeah.

Yeah.

Like how much?

I think you'd really want to think about running this as a side project in terms of the impact on your personal wallet and how much you care about Pokemon.

It's not clear to me that without the blessing of anthropic, I would have decided to take on, take on this project for my own wallet's sake.

Especially if you want to like experiment and like try 10 different things.

I mean, it's, it's costly.

I don't know.

Like I haven't spent a lot of time on the exact number.

It's not that hard to estimate.

If you like, I just told you a bunch of numbers, you can kind of back it out.

But like, I think to like do a lot of experimentation, there's like at least thousands of dollars of tokens being consumed.

So it's not a, it is not a, a cheap rollout.

Yeah.

But yeah, in this game also how some people use tokens, it's not terrible.

How many turns are you keeping in memory before you summarize?

It's 30 right now.

Yeah.

I've tried more and less.

I think like one thing you see a lot when you talk to people building agents is there's like some effective context length that actually like has the model be the smartest.

And that seems to vary slightly model by model, but, but for this model, for whatever purpose, like this 30 message work better than 20 and better than 40.

So kind of plot in between those that it worked pretty reasonably.

Yeah.

Does that change based on location?

Like how many would you want to give it to get it out of Monmoone?

So we got it.

We got to bring it with plot home.

We can let him same or 57 hours.

I actually am not sure it does.

Like I, I've, I've tried posting, like, I can have a ton of screenshots, like 20 or 30 screenshots at a time, be able to see.

And it's like not obvious that like that temporal concept is actually super relevant, relevant to it.

And again, this is just like, trust me as someone who has spent like a lot of hours obsessing over this, you can try to prompt quad a lot of different ways to understand how to navigate better.

And anything short telling it exactly what to do does not improve.

It's like actual navigation.

It's just like not a skill.

It's great at it's like good enough to, to like random walk its way through some of the complex mazes and in like good, easy areas.

It's pretty good at popping around.

But yeah, I think I can tell you if there was like a way to prompt this slightly different that would navigate better.

I would believe there is something, but it is not like, it is not an easy lift.

Yeah.

Yeah.

I asked, I just asked quad AI right now, how do you get through a mambo in Pokemon Red?

It does have, it does have a plan.

I don't, I don't know.

I don't know if it's the right, I don't know if it's the right plan.

I have seen it come up with a lot of answers to that question.

And most of them, right?

This is part of the pain when I talk about, I'm not sure if it's knowledge is better or worse.

Like usually, usually fixate like, Oh, I know the exit is on the Eastern wall and it just like spends 12 hours trying that.

Yeah.

Yeah.

It's like unclear to me that, that we're actually not just like harming it by having it.

Think it knows the answer.

Yeah.

I think that's the interesting part, right?

Like you don't want it to just know the answer.

Like model clearly knows a lot about the game.

There's like EV, IV, maxing.

Pokemon was very, very extreme.

But like, if that's what you wanted, we could just hook it up to a knowledge base, like hook it up to a guide.

You know, how to be Pokemon red, but the interesting piece here is actually like, can it figure out what to do without just memorizing the path through?

That's exactly right.

Like, and that's part of why, you know, I don't know, part of what I've realized putting sound in the world is people will draw their line of where purity is anywhere on the spectrum.

Like, is it, is this cheating idea?

Maybe I'm, who knows?

I'm like, frankly, like, I don't particularly care.

The main insight that I have is like, when we put this out, like, you learn a lot about what the model is good and bad at by staring at it.

And that's kind of what I like about it.

So evaluating the model is kind of separate than your emulator and how it can use an emulator, right?

Like we can always improve those things.

I'm curious, as you switched from 3.5 to 3.7 and sort of reasoning models, were there any degradations there?

Like did it, did it kind of get worse at anything?

And was the prompting somewhat consistent?

Like a lot of what we've seen with different reasoning models is like, you kind of prompt them differently, right?

You tell them what to do, let them figure it out.

But yeah, any any insights there?

Yeah, yeah, that's a good question.

One thing that's nice about 3.7 is on it is with like this hybrid reasoning model.

So like, it kind of can do the old thing and the new thing.

And it's actually pretty good at just like being an out of the box model, and having this like thinking mode where it can spend time reasoning.

So I didn't like really run into any like serious degradations.

The one thing I'll say is like, literally every model that has come out with Pokemon, like the main change that I have made to this agent is deleting prompt stuff.

Like, there's a whole bunch of like band-aidy prompt stuff I've added in the past.

It's like trying to like steer it away from doing a lot of the things that it got horribly stuck doing in the past.

And as the models get better, I found that just like making sure it's as simple as possible and giving them as much sort of like free rein to try to solve a problem as possible is useful.

And like the way I think about this is I'm like less confident over time that I understand exactly how a model is intelligent, right?

Like, it's capable of all of these like ridiculous things.

It does PhD level stuff in some ways and like is unable to screen as well as a four year old in other ways.

But like my confidence in like exactly what I need to tell it to do to be smart at playing Pokemon is actually like really small.

If I tell it, this is the way you need to solve this problem, that might not actually be the best way for 3.7 to solve this problem.

It's like just different than I am in terms of how it thinks about these things.

I found that just like kind of like pulling some of the unnecessary instructions where I tried to like use my intuitions about what would make the model better out of the prompt over time is the thing that just like sort of consistently as models got smarter, gotten more juice out of this.

I was watching the stream yesterday or the day before and it was a very tense battle.

I think they were like down to like two HP each and like the opposing Pokemon like missed a scratch or something and it didn't die.

And like you could tell like if I was like, wow, it was like very dramatic.

And I was talking about the game.

How high?

Yeah, is there any thought being put into like trying to have it more like do you prompt it to be more rational to let it know that it's not a real life that it's a game is like it feels like it gets very distressed when they're actually the Pokemon are actually going to die.

It's funny.

They it knows it's like you're playing Pokemon Red, like it does know that and it has a sense of that but it clearly you wrote some attachment.

I'll tell you a fun story.

We tell it to nickname its Pokemon now it will occasionally do without it but it's like more fun if it nicknames its Pokemon.

So that's like in the prompt is like it's fun if you nickname Pokemon you should consider it.

And one thing we found when we started doing that is it got more protective of the Pokemon it nicknamed like it's pretty obvious like when it catches a Pokemon now that it has a nickname it will like go heal it right away if it's hurt and that did not ever happen before.

Which is pretty like so there's some cute little things cute quirks about quad who really wants to protect its precious nicknamed Pokemon which is great.

So I will say it's kind of normal like like when I was five playing Pokemon Red and you know I had 2 HP in the midst of scratch that meant everything that was existential I agree I agree completely.

How about skilled transitioning?

So one question that I had so you're playing Pokemon Red right so you don't play silver or gold next.

Have you thought about how models can kind of learn from these games and like store these learnings and then use them again in the future?

I'm sure it's not part of the project today but curious your thoughts.

I've thought about it only a little bit which is like I think there's some like interesting when you actually read one of the knowledge bases that it has gained like on some of the longer rollouts when they're good like there's actually some like pretty decent tidbits about how it should act and try and do things and like some of the ways it succeeded and actually one of the things that's most unique about 3.7 Sonnet that I've seen is like it will have like meta commentary on what it's good at and bad at and it's knowledge base like I misperceived this thing and so like I need to be careful doing that again.

You'll occasionally see show up there which is um which pretty cool so like I could imagine there being some way to like translate that knowledge base from one game to another.

I think my knowledge base is frankly like kind of kludgy of an implementation right now like it's like more or less a python dictionary that's appended to the prompt and I think like you could you could find better ways if like your goal is to transfer across games and things like that to manage the knowledge base the quad can actually like use more or well in different scenarios um but there's definitely pieces there that like I think it would get be off on a better foot on the next pokemon game if it had that or even if like I were to restart the stream it would like have some some tidbits that it would probably like uh speed up if it like had access to things that I learned in the past that it's interesting yeah yeah I always think of that in card games you know like you have the idea of like temple and a card game and it's like you know it's the same magic as it is and you know star wars flesh and blood all these different things I feel like games is similar where like learnings you get from pokemon you can bring over to similar kind of like open world games and I think it's also like particularly interesting for some of the things that are like how quad learns how to play a game in general where it's like a pressing too many buttons at once is a bad idea like I watch drop what's going on that kind of thing like definitely is stuff that it has learned that is like interesting in a meta way uh that it's like hard to give it that sense of self necessarily in training I think sometimes like it's hard for it to know like what it's getting bad at in some scenarios but it's interesting to think about how it can learn across things well like uh some of this also is due to a simulator right so a lot of what's learning is how do I use a simulator what am I good and bad at but the model internally should know quite a bit about pokemon right like if you've played pokemon going from pokemon red to emerald to diamond having played the first one doesn't help you that much in the second right you kind of get the general concept you get what types are good against other types and the model model knows a good bit of this right but it's still interesting to show this is more so like it's shows that knowledge base has kind of helped with understanding how to use the emulator right like it struggled and then it figured it out so you know with pokemon it's like this thing can now learn how to use them yeah which is pretty cool that has been like part of what's been fun seeing all my progress on this thing I had a bit of a follow-up question to the last one with elastio so if people want to blow thousands of dollars and want to you know improve this a little bit is there anything else that you'd want to see done whether that's like improve emulator try different stuff is this just anything that like anyone watching this you'd you kind of hint them towards what you'd want to work on what they don't work on yeah no doubt if I had to guess like the the biggest lift that exists around this is probably something around the memory which I don't think is like hyper optimized right now the nice thing about the memory is like it's always in the prompt like it's it doesn't go away like some sometimes if you leave it up to quad to try to like read and load and save to memory bases like it will under utilize it or forget things but I think there's probably something there I will say all of the many many hours I've spent tweaking around the edges of this thing nothing quite does it like a new model though like fundamentally I think the limitations right now are like some smarts things like I've seen uh and I mean this in the kindest way but I've seen a lot of people in twitch tell me about ways that they could fix the navigation capabilities with a better prompt they are people would be welcome to try but I would guess that would be like a somewhat fruitless avenue I don't think I think it's just not very good at understanding at the first time I'll give you a very quick anecdote which I think is like my favorite for like why this is particularly hard I have this clip of quad leaving oaks lab and being like great I left oaks lab now I need to go up to the north end to go to route one and it just like hits up on the d-pad and goes straight back into the lab and it's like shoot I'm back in the lab I need to leave and it hits down it's like great I'm out of the lab now I can go up to route one it's straight up it just like goes up and down 12 times and it's like you're not you're not fixing that with a prompt it just literally doesn't get it it doesn't understand and so it's pretty hard to make like little around the edges changes that like make a huge huge difference yeah I mean I've always been fascinated by the fact that twitch plays pokemon actually beat the game yeah from a just look at it and you're like this cannot possibly work because you have people turn to sabotage you too in the chat not everybody's trying to solve it what what so I I just like that up it took 16 days and seven hours for twitch plays pokemon to be read how how close do you think we are to a model that can beat it in less than 16 days and do you think you need like some core like model really big jumps or like do you think it's like we're close I think I think there is model stuff at least from quad like I'm confident there's model stuff that needs to happen for it to be like really capable I can have like four spots in the game stuck in my head it's like I think there's literally no hope it's going to get through that so I think there's like a gap that's mostly around like a divvie to like see and navigate and remember visually like what's going on that I just don't think is like we've figured out yet so to me that's like a pretty big gap I do expect like I think it's going to keep getting better like I have no reason to believe that this is not just like a fundamental like ability to scale learn and understand problems thing that I think is getting better as we train models to be more capable it's sort of these like long horizon tasks like I actually do think this is like a pretty reasonable proxy of that and I think it will continue to get better for a little while I don't know if there are like affordances around images and videos and stuff like that that we need to figure out to make it work it's like unclear to me if that's true or not um but yeah I think we have a little ways before we can beat the game in 16 days I do not have a lot of faith that the uh current stream is gonna gonna be standing in victory road in uh 13 days what's been your favorite moment from like building this to thinking of the idea of just seeing it play any any like major highlight uh I think like the the hypest I have been is uh when it beat Brock the first time where I was just like you know I've been doing this for eight months and then like a few weeks ago like I kick off a run wake up the next morning and it's like oh my god oh my god and I and it was the the other good thing about it's like I woke up at 8 a.m and I checked my I have it send me updates to slack um this is like ridiculous things but um it's like literally like about to start the Brock battle like I open my phone it's like oh this is like happening right now and it's like a pretty hype way to start a day I think that was my uh highlight I have a lot of like other cute things like some of the cute nicknames it's done over time and things like that are are endearing but but that was like the peak hype for me it was like we beat a gym leader like we've got a badge like Claude's doing it you know and then in the follow-up so I noticed that you mentioned it eventually started beating multiple gym leaders were these all the same run was it different ones was it yeah I have like I the the run that you saw that's like on the graph we put out alongside like in our research blog is like a single run that I have watched like get through at least surges gym and then it got a little past that and the reason that that's where we stop reporting is because that's like the physical amount of time that occurred between when I started it when we launched the model so that's like uh it was a very hyper hyper up-to-date graph on on the best run we had so awesome um I know we're running out of time my last question is are we gonna work on magic on cloud place magic next or maybe we can do like the magic arena intro yeah uh funny story there was a project I did right before I joined in thropic that was like training an open source model to like slightly be better at picking draft or cards in a draft like I was training it on like the 17 lands data that exists to like learn how to how to pick cards out of a packs a little bit better uh and I did talk about that in my interview so you get hired at an anthropic so so if I've put time into this I'm ready I am ready for that project too that I have that code sitting around as well somewhere I'm really getting it all my nerd nerd ml slash slash gaming hobbies here yeah no I'm ready I don't know if you're planning on open sourcing any of the pokemon stuff but if you want to work in open source on the magic stuff I'll be happy to take a elaborate awesome we talked about it I don't I don't know yet what the plan is there's like a certain amount of like this is not my day job that I have to figure out how I want to uh yeah and deal with that uh we'll see yeah um awesome David any parting thoughts anything people have missed no I think like the one thing I do like to drive home when I've been talking about this is like I really do think like this is just demonstrating like a thing that is going to make agents better with this model you know like this is a very fun way to see it but like I think the thing is that it like has some ability to like course correct update and figure things out a little bit better than models have in the past and even if there's like stuff it's dumb at like it tends to have an ability to like power through it in a new way and so I think what it's exciting to me is just like I think there will be some real world stuff that comes out of this model once people play with it and I'm pretty excited to see like how people take the skills we put on display a little bit here or or lack thereof in some cases and figure out how to turn them into actual agents that do stuff I have a quick last question on that actually is there any guidance or any way that you like quantitatively measure the evals of this system like a lot of it is vibes a lot of it is how far it gets where it gets stuck but like are there are there any lessons or any specifics about how you measure how it actually does so I've done a lot of like little small tests of like put it in this scenario and see what it does but I like frankly the best test I have is just like run it 10 times on diff on this configuration and like see how quickly it progresses through milestones of the game it's the best thing about games right like it's why the games are such a useful thing they there's literal like benchmarks of gym badges that are moments of progress in a game which are like ways to evaluate what happens and so I think like how quickly it's able to make progress is actually a pretty recent or a reasonable like eval if a slightly expensive one to calculate it's an integration test not a unit test.

Wow awesome David thank you for joining thank you Veeboo for filling in on the whole site too.

Yeah my pleasure thank you for having us I appreciate it.

Awesome good to see you