OpenAI Podcast · 2026-05-14

OpenAI Discusses Image Generation Breakthrough with DALL·E 2.0

Hosts: Andrew Maine

Guests: Adele Lee, Kenji Hata

image generationDALL·E 2.0photorealismmultilingual AIcreative AI agentseducation technologyAI product developmentOpenAI Codex integration

Read summary Jump to transcript Original episode

Why it matters

DALL·E 2.0 offers a major leap in image generation quality, with improved photorealism, text fidelity, and multilingual support.

Key claims

DALL·E 2.0 offers a major leap in image generation quality, with improved photorealism, text fidelity, and multilingual support.
The model generates over 1.5 billion images per week on ChatGPT, fueling diverse viral trends and emergent use cases globally.
Significant improvements include the ability to generate complex infographics, consistent multi-image outputs, and panoramic/360-degree images.
Integration with OpenAI Codex enables zero-shot app and game design workflows, combining visual creativity with coding capabilities.

Briefing memo

Summary

In this episode of the OpenAI Podcast, product lead Adele Lee and researcher Kenji Hata discuss the major advancements in OpenAI's image generation model, DALL·E 2.0. They highlight how the new model represents a paradigm shift with significant improvements in photorealism, text rendering, multilingual capabilities, and creative flexibility. The model now generates over 1.5 billion images weekly on ChatGPT, supporting a wide range of use cases from viral social media trends to professional and educational applications.

The guests emphasize the model's ability to produce high-fidelity images with accurate text, consistent multi-image outputs, and support for various aspect ratios including 360-degree panoramas. They also discuss how DALL·E 2.0 is being integrated into workflows for creative professionals, educators, and developers, including synergy with OpenAI's Codex for app and game design. The conversation touches on future directions such as building creative agents that act as personal assistants to users, enhancing image editing, and improving composition and personalization.

DALL·E 2.0 offers a major leap in image generation quality, with improved photorealism, text fidelity, and multilingual support.
The model generates over 1.5 billion images per week on ChatGPT, fueling diverse viral trends and emergent use cases globally.
Significant improvements include the ability to generate complex infographics, consistent multi-image outputs, and panoramic/360-degree images.
Integration with OpenAI Codex enables zero-shot app and game design workflows, combining visual creativity with coding capabilities.
Educational use cases are expanding, with the model helping create personalized, multilingual study guides and complex scientific visualizations.
Future plans include developing creative agents that understand user preferences and act as personal designers or planners.
Prompt engineering remains important, with users able to guide style and composition for professional or personal outputs.
OpenAI continues to refine model speed and token efficiency while enhancing aesthetic quality and user control.

Source material

Transcript

Hello, I'm Andrew Maine, and this is the opening iPodcast.

On today's episode, we're talking about images 2.0 with researcher Kenji Hata and product lead Adele Lee.

They'll discuss why the new model represents such a major leap forward, the evaluations that mattered most during development, and what people are creating with it now that it's widely available.

If Adele was a stone age, as Imogen 2.0 is the Renaissance, it's not only great artistically and aesthetically, but it also incorporates, you know, science, or architecture.

All in one image.

We looked at, we're like, all right, this is better than images one, right?

Adele, tell me a little bit about how you became a product manager here.

So I joined OpenAI a little over two years ago, and before OpenAI, I was an investor of my entire career.

Oh, wow.

So I was in private equity and spent three years at that red point ventures, investing in AI and software companies, and when I first joined OpenAI, it was for a completely different role.

I was thinking about how do we build out our data and compute infrastructure.

And over time, made my way over to the product side, and for the last six months have been working on image done.

It's interesting how you style yourself going from one role and find yourself into this space here, which is kind of cool, you know, to think about the idea that you have this sort of, you know, ability to be useful in different ways.

Absolutely.

And I think the role or product manager is just to do the job that needs to be done no matter what it is.

And for Imogen, particular, it's been really awesome to flex a lot of different muscles when it comes to building products, working with researchers like Henji, but also thinking about like, what is the gap in the market today that we want to fill and what is the opportunity that we want to grasp here?

It's not the same market that it was a year ago when we first released image done 1.0.

Now it's a very different landscape.

They're multiple image generation makers out there.

And chatGBT is a very different company and product itself too.

And so really thinking about the evolution of image done and it's role within chatGBT.

Has been really, really exciting to me.

Kenji, how did you end up working on images?

Actually, like, I first started at OpenAI.

I also started about two years ago.

I was working on like some random audio project initially, just as my first project.

And then at the time, I just found my way, just working on helping them work on images 1.0, the prior to the launch.

And so gradually, I moved more and more on to the project and that I became full time on it, basically.

What is the reception been like right now for the model?

And the last two weeks since we launched the model, usage is up more than 50%.

More than 1.5 billion images are generated every week on chatGBT.

And we've seen viral trends emerge across the world, all the way from trends in Asia for color analysis and stickers to US, where crayon and scribble we're going viral.

But also a lot of people exploring emergent use cases.

And I think it shows the dynamic range of the model, but also how people are able to visually grasp the advancement of the model almost immediately.

I think the visual communication and reaction that we've seen from our users for them to say, hey, this is the best highest fidelity, highest quality and a static model that we've seen has been really awesome.

This felt like a really big shift, almost worthy of maybe not even being an images to, but almost like just a new paradigm because it's just a capabilities are through the roof.

What made them possible?

When we started working on this project, I think we sat down and we discussed, what is the stop change of capability and use cases that we wanted to build towards?

And we believe that image generation has the ability to do so much more than what it does today.

You could distill every single output or a visual content that you see today into an image.

And so that was the mandate that we sought out to improve.

And with this 2.0 model, we've improved on various different dimensions.

One is text rendering, the ability for text on a page is so much better fidelity.

The language and words actually make sense and they're actual words.

The second of all is multi-lingual, so we've really focused on making the model work in various different languages and we're already seeing that people across the world in Asia and Europe are really resonating with these advancements.

The third is photorealism.

I think we really saw a lot of feedback from our previous models that the output wasn't very realistic or altered their face or their bodies.

And so one of our mandates was how do we actually make the image feel like more like yourself?

And so all the things that you think that the model knows it does because it hasn't viewed the knowledge of the world into its conscience and is able to visually communicate that back to user.

And so putting that all together, I think we really get a state of the art image generation model that is the best aesthetic model out there on the market right now that really represents a new paradigm for image generation, which is a huge part of I think AI progress at the large that we have the opportunity to work on here.

We often listen back and listen to feedback on social media too, so we kind of just take all these things and basically are just aware of it and try to make sure that they're mitigated or completely fixed in some cases in the next iteration.

What kind of use cases are you seeing?

What do you see in people do with the style?

I think one that's particularly close to the research team as a general is like infographics, text.

I think text in images is like so much better now it is.

So I think it just opens up a lot more productive use cases and from like the research side we kind of think, you know, image generation used to always be about fun and maybe like unproductive things.

But now we're really seeing steps forward into productivity and image generation for any type of use case that you can imagine it for.

So you mentioned text.

I remember the early models, no disrespect to chimpanzees, but getting to let's spell like open AI even looked like a chimped edit.

And then now I'm looking at pages of text and finally detailed stuff.

And I know that as models get smarter, variable binding ability to put things next to each other improves, but this was just a big improvement.

Yeah.

But I don't think it's like completely unexpected.

I think you see a lot of growth in between, well, first you see between $83 and, you know, GPD images one, there's, if you ask for a grid of random objects, you go from maybe like five to eight in $3 to maybe around 16 in images one.

And then with 1.5, we went to about 25 to 36 consistently and I think now we could probably do over 100.

I think this is like a test that we might do internally is just, we just have to chat with GPD, give me a list of 100 random objects, right?

And then we just send that to our image generator and see how many are correct.

And usually, you know, it'll get almost all 100 correct.

And that's, but you see the, the constant growth over time.

So I don't think it's like completely unexpected, it's just a steady base.

That was a test I used to use for like the real-world models back on like A&A, Babish and Curried, like list 100 science fiction books.

And then some of them would get by time I got to like 22, which is start repeating stuff.

Yeah.

It's a row as the model reached the end of it.

So we've seen stuff too, like 360 360 degree panorama.

How did that happen?

Yeah, that really came from the merchant capability of the model, which is the ability to render images in any aspect ratio.

We discovered that people were generating really long, amazing panoramic, you know, skinny bookmarkes as well.

And one of the cool capabilities with the model is that not only were you able to generate images in this panoramic aspect ratio, but you'd also run your images in the style of 360.

And we saw that I was really fun to actually view these images in a 360 world itself.

And so that was a really fun feature that we ended up adding into the product and it's available on chat GPT on web and mobile right now.

First thing I did was I made a version of Docs Plane poker.

Yeah.

So you can sit there like you're one of the dogs looking around in there, which was not something I expected, but it's fun.

Yeah.

I mean, it's really awesome to see how people are exploring new use cases and fun things that they're creating with the model, even far beyond what we expected users to be using it for.

I think when we were designing in the model, we were really deliberate in understanding what people really wanted to see from image generation.

There was a lot of latent demand and image generation, you know, people were mostly using it for personal use cases, but we definitely saw a lot of inklings of people wanting to push the model in certain directions if the model wasn't good at.

So text rendering was definitely one of those dimensions that we really wanted to improve on, multi-lingual was another.

And I think world understanding generally is so much better in this model.

And that typically means that, you know, now people online are sharing a bunch of examples of thumb creating image shun for all different kinds of use cases that we didn't even think, you know, existed out there.

So I think the model's understanding of aesthetic beauty across multiple different outputs, whether that it's like a fun meme, an image for a five year old versus a professional consulting deck, the expansion of opportunity and outputs has been amazing to see in this latest model.

It's funny to how one of the things that was trending was taking popular images or photos of people and then having the model make like kind of janky looking Microsoft paint versions.

Yes.

And do you think that was somebody you would see was that people are going to use as incredibly capable tool to then go make, you know, decily looking things?

Yeah, it's funny because it takes a lot of intelligence to actually create something that is imperfect.

That's why I tell people all the time.

Yeah.

And it's definitely very interesting in the viral trends that we're seeing online right now.

One thing that I think people are really striving for is authenticity, imperfection, nostalgia.

We're seeing that in the MS paint prompt, crayons, all different kinds of generations that people are creating and that really feels like the theme of consumers is they want to interact with AI in a very authentic and perfect way.

They want to show their imperfections and use AI to help make them look good, but also show a more fun and goofy side of themselves.

And I think that's self-expression, VAI, something that we're really excited about.

And you know, I think it's really part of our mission as a company to make it easier for people to learn more and distribute that intelligence, but also letting them express the version of themselves that maybe wasn't possible before.

Kenji was there a moment with this model where you're saying yourself well.

I think this is ready to go.

You know, as it's training, we take a checkpoint and then like we just sample from it, right?

And just see, okay, how good is this thing?

And I think like, we just sampled a checkpoint with my model, an image and we looked at it and we're like, all right, this is better than images one, you're just like, okay.

I remember watching the iteration of one of the really versions of Dali and how first it was sort of the whisk piece of weird sort of the tendrils sort of thing and talking to one of the researchers like, is that going to go away?

Is like, I think two, probably, two runs away from that and then just like that.

The ability to predict that was amazing to be an all of a sudden and got crisp and clear.

And then also like, looking at years ago, I'd played with like, you know, gans and like doing those things, you have to squint and say, I think it's a pickup truck or something like that.

So it's interesting what you see is you say, okay, this just all of a sudden got much better.

Yeah.

I mean, it was just very obvious.

You just, you just take the early checkpoint, you just sample an image from it and then you just sample an image from, you know, images one and you just look at the two and you're just, there's just, why do I like this garbage?

I don't know what the image was.

It might have just been like a picture of like a woman at a, on the seaside, like, you know, overlooking the seaside.

We just looked at it and we're like, all right, there's like no, no, no, no.

Yeah.

That was the big, the big jump was the photo realism of going from something that looked that was more of a glossy idealized magazine cover to somebody like a really good photograph.

So, help me understand, like, besides just more compute, how did this happen?

How did you get a model that's much better and also that doesn't take an hour to generate an image?

The times are still, I remember in the dolly days, like we would literally have to, you know, tell, tell us what you want and then an hour later it'd be on Instagram to now these things are in chat GPT and it's faster.

How is it getting both more intelligent and you're maintained in the same speed?

I think we learned a lot in each release, like between one, one, one, one, five, now two.

And so we take each of the learnings that we've made and we, you know, like, for example, speed, right, you know, one of the things is, like, oh, can we make the model more token efficient or something like that?

And, you know, we did a lot of work to make it, to make it produce very good images with less tokens.

I think the post-training for this model was very interesting in the sense that we really had to think about not only does the model understand world knowledge and how things look in, you know, science, concepts, math, et cetera, and image, but also what is the taste that we'll resonate with users, you know, what makes the model or output beautiful?

How do you make it look realistic?

These are all questions that we had to grapple with when we were post-training this model.

Because I think that one of the things that was really important for us was that the model was the strongest aesthetic model out there right now, which means that it has more creativity in various different outputs no matter what that output is, if it's a professional output or a personal output.

And so that range of training and the range of use case, I think, made training this model of very interesting problem.

Do you have any personal favorite benchmark tests you like to do, things you say?

I want to see it make an image of this.

I have a eval that I call the Mimi eval.

Okay.

It's essentially a hundred photos of myself and my friends and my family.

I put everyone in goofy positions.

I have about a card or birthday for every single person.

And I think it's a really great eval in the sense that you only know that people around your, you know, faces the best.

You also want to create funny things with the model and do things that are relevant.

And so one thing for me is the product manager that I testing is not only is the raw capability of the model really great, but also does tragedy be to understand what I want in that context.

You know, tragedy be even members, you know, that I have a brother that I have a mom and dad.

And what they like to do.

And so does the model accurately know how to insert pieces of personalization in the moment that matter in the images?

These are things that I'm testing for.

I'll let you.

Besides the grid one I mentioned earlier, that's probably the one I've used the most for a while, I think, Divya and I were doing a lot about what we were trying to in real hard to push on that.

Just basically, I know Divya's favorite one was like a woman holding an or a jug of orange juice.

I don't know if you see it.

Yeah.

And it's like so many, and so a woman holding a jug of orange juice.

Well, actually, I feel like the researchers had a more standard set of images.

Like, then they like to bleed on.

Yeah, and you get like the stand.

Can you do somebody writing with their left hand or watch on their right hand and a clock showing this?

I think the big, the big leap of the images, like probably one or 1.5 was like a half full glass of wine.

Oh, the wine glass folded in the room.

Yeah, yeah, yeah, yeah, exactly.

And there were ways, I was able to prompt it to do it, but it was, uh, it was really how to get to really descriptive like, you know, read the liquid inside this.

This one is so fun to prompt.

There was a thing people said, oh, can it do like, you know, can it do like pixel accurate pixel image style art?

And somebody was like, no, it can't.

And I want to hear that.

I'm like, okay, let's try.

And I found out if I gave it like a a 64 by 64 grid.

And I said, go, go draw the art in there.

It did.

It just was able to put art into there.

And that was amazing to see those kinds of results.

And that's the prompt ability of this is insane.

How do you plan for that?

Does this just happen?

You're like, oh, wow, this is better to understandiness.

People come to image shine with very vague prompts.

Yeah.

Make it better.

Make me look better.

You know, make me cuter.

All these things are really vague.

And I think it's really the job of the model and the harness to still that into actually what users want.

And I think that's a personality of the model that we trained over time.

That we've really hardest to power for.

And honestly, I think it also yields a lot of really surprising results that people may not expect.

And that surprise is just part of the fun of using image shine.

I've seen like two kinds of prompting sort of emerge.

And I remember back with Dali, I thought like, oh, I'm a prompt engineer.

I'll be great at this.

Like, I'll be really good at this.

And I'm going to make a raccoon in space, but like, fuel proud.

And then I'd see an artist, somebody who was into prompt engineer, somebody who actually came from that world and had watched them use their language.

And they were doing amazing things.

Yeah.

And that seems like it's that still holding true.

Definitely.

I mean, we work with a group of artists very closely when we develop this model.

And we're very inspired by artists, designers, marketers, all these different professions that I think have a different way of approaching their profession.

And one of the things that was very important for us is we wanted to take the inspiration as well as the best practices for those professions and distill that into the way that people interact with the model.

And so that's something we've deliberately tried to focus on.

One hack that I've seen work really well is the ability to upload inspiration or context into the model.

And the model has an incredible ability to take the spirit of that context and translate it into the output.

But it's interesting because I think that a lot of people worry that, oh, I just push in a button, I get something beautiful.

And each model that gets better, it's easier, as you said, to not have to put a lot of effort into it.

But when people do put effort into it, they are getting even more amazing results.

And it seems like actually that if you're artistically inclined, you're getting even greater control because now it, like you said, it understands more about what you're talking about when you talk about depth of field than these other things or whatever you're trying to do.

And as you mentioned, it was exciting to see with earlier models, artists who said, oh, I gave it my originals and it gave me these variations.

And I know which one works and just seeing that as this real creative amplifier.

Yeah, definitely.

I think having creative direction or taste or judgment and bringing that to the model is the best way to push it further.

I think one thing about this model that I'm really excited about is how it expands the creative outlet for people.

I think the ability to create multiple different styles or types or variations have never been easier than with this image and model.

And I think it's also understanding of different contexts, like the way that it's able to shift what it's like to be generating an architectural diagram all the way to the aesthetics of a children's book.

The ability for it to move so seamlessly across these factors has been really awesome.

The ability to do great infographics and diagrams is very powerful.

What kind of feedback have you been getting from people in research and education?

We actually have an internal alpha channel where we test our models.

And in that there's like a sub channel, dread dedicated specifically towards educators of any level like elementary school students all the way up to graduate level.

One of the coolest things I saw was there was a biology professor and he put like these graduate level textbook rendering pages of things I had no clue about and he said it was perfectly accurate.

I think the ability for the model to distill very complex topics into something that is really easy to understand within an image is one of its strongest capabilities.

And we've seen this with students, with teachers who are using image on to learn different concepts to also help them create study guides, to help also create personalized content.

I think personalized learning is a huge trend that we're very passionate about.

And I think the image on model helps you as a teacher create something that every kid can understand in their own language and own preference and that is something that we're really excited about.

We're thinking about this in the context of also how do we bring more of the elements of image done and to chat GPT out large so that when people are trying to learn concepts, we're teaching them with image done.

I remember when I was in school and kind of prior to a lot of kind of multimedia blowing up posters were a big thing, classroom posters explaining stuff.

This really reminded me of how powerful an infographic can be because it allows you to bring as much attention as you want to it.

And you can spend the time looking at it and seeing it and you can put a lot more detail into it.

I think one really awesome visual shift that I've seen with image on is that now in internal presentations over 50% of the slides are created with image on.

And that permeation of communication via images is so powerful when you're trying to explain your concepts or illustrate what you mean.

And I think infographics and the text rendering capability as well as the composition of the text on the page is incredibly powerful with this model.

The model's understanding of not only what to say but how to present it is a superpower and I'm really excited about future explorations of this where we can think about how do we make this even better, how do we improve the composition, the different kinds of outputs, and also make it editable in the product.

These are directions that we were really excited about.

How do you see the progression of this?

This is great, but typically any time I talk to somebody to open an eye about what they're working on, they're like, yeah, that's good.

But I think we're still super early and exploring all the different these cases that people are really trying to push the model in.

And so one of the things that we're really excited about is what is that next stage for image on which is to create the creative agent.

Ultimately, the agent that can work alongside you, be your creative assistant, and really understand how you work, what your preferences are, what is that put that you want to get to?

And build the product in model ecosystem that helps users kind of have a personal interior designer, personal architect, personal, you know, wedding planner, etc.

All in one image on.

I tell you nothing, it was kind of amazing.

It was like, all right, books.

And so like every now and every book come out, I've got to change my social media headers.

And I just when I said, oh, find my book cover and write, you know, create a appropriate size social media header that can put on an X or Facebook or whatever.

Like, let's say first shot, first shot, write aspect ratio, everything.

We basically did that from the start or trained the models.

It'd be good at that from the start.

I remember like I worked on the initial dearest of, of every, basically, it could do any aspect ratio that you ask.

Yeah.

Yeah, you can now really just easily specify the outcome that you want.

Yeah.

In the case of yourself, you're like, I want promotional material.

I don't have an idea.

I didn't specify exactly what I wanted.

But the model was able to do the research and then give it to you in the style and asset ratio that was relevant to you.

And that's super powerful.

We're already singless.

You know, you're, you're an author.

I've talked to real estate agents who are using image end to help them create listings for their apartments or stage their listings.

YouTube creators have talked to me about using image end further thumbnails and promotional content.

I've talked to top artists who want to use image end to connect with their fans.

And I think the ability for all different kinds of professions to start to use image end to help them with visual creation is super powerful, especially if you're working in a visual and creative industry.

Image end is such a hack and your professional toolk.

I think it has to be a part of everyone's everyday workflow in the future.

This does feel like the, I think it feels like the first time where anything I can reasonably come up with, it does a pretty good job of it.

We think it's a new paradigm for an generation altogether.

Like if, you know, we set this in a launch video of Dolly was a stone age as image end to point out as the Renaissance.

Yeah.

And I think that is so true because the model, it's not only great artistically and aesthetically, but it also incorporates, you know, science or architecture all in one image together.

And I think that composition and knowledge of the model has just means that the outputs are so much more trustworthy or more powerful and enable so many more use cases.

I think that image and codecs is also an amazing intersection of the capabilities that we're setting out to create with both image and as well as coding agents.

So many people are using image end as a first stop to designing a new website or creating a new app.

And I think that intersection of having a really strong aesthetic model, which is image generation in combination with strong coding abilities means that now you're able to zero shot really amazing apps from scratch with both of these tools.

Yeah.

I asked it and codecs.

I said, I took my website and said, could you make me, like, had the image in, could you create me some, you know, some different concepts for it?

And I did these context sheets.

And as for context sheets, did that give me like four images there?

And I said, oh, the one on the upper right, can you go make that?

And I watched codecs go make that, which was like this feels like magic.

And then they've implemented as part of pets.

And so like if you're using codecs and you say, hey, I want to have like, I love Raven.

So I have like a Raven.

I said, can you make a Raven?

And then I watched it pull up the image into an iterate and make the sprites for it.

Yeah.

Sprace sheets are going viral.

Yeah.

Same with game design.

People are loving using image down to help them create new worlds.

Any any hints on how to do better sprite sheets?

I mean, I've tried to make, you know, Jiff's internally.

And I think if I just use like the thinking mode or codecs, yeah, you basically just ask it to generate one initial sprite.

It's really good.

And then you can just say, can you make the rest?

The consistency across multi images has been amazing.

We've seen a lot of people try creating 10 page comic books with consistent storylines, you know, multi page slides.

I think that consistency of characters and aesthetics is completely unique to the model.

That was an example to where there were a lot of workflows out there for working on the image models that you had that were kind of janky, but you had to figure out how to do.

And it's great now because I can do stuff where I can like create characters and say, make a character sheet with the different poses and stuff and just go feed it back in and say, okay, now doing this, now doing that, now doing that.

And that's just such an often.

Sometimes what we need is obviously smarter model, but like context length did so much for chat GPT, did so much for coding.

And if an image model that's able to reliably reference these references, isn't quite a way capable.

Yeah, for sure.

And we're still trying to improve that as well.

It's not perfect today.

We're really trying to develop this visual creation layer for people because every single person you have an aesthetic or personal style or preference.

And we're really trying to view that into the product that we're building.

So that people can get to that, but that they're wanting easier and faster with image on.

Any parting prompt tips for people?

Well, one of the things I would suggest people try is image done thinking.

So if you navigate to the thinking or pro models, we have a more powerful version of image on in that experience.

And in that model, you actually are able to search the web, analyze files, leverage tools under the hood, which then yields a better quality and higher composition photo.

And the suggestion that I have for prompting that experience is be open-ended.

I think the model will go and do the exploration itself to understand and try to reason and find information that matters.

And I also think giving it a sense of an aesthetic is also super helpful using grounding that in a style has been really full for a great result.

Good one, good one.

I think just being very particular about the style or what you like in general, for me, I like minimalist infographics.

Sometimes I think the model can be a little dense.

And so I just maybe I'm just a simplistic kind of guy.

So I just like very clean, a very clean look.

So I like that.

Adele, Kenji, thank you very much.