Lenny's Podcast · 2025-10-23

AI Engineering 101 with Chip Huyen

Hosts: Lenny Rachitsky

Guests: Chip Huyen

AI engineeringpre-training vs post-trainingRAGRLHFfine-tuningevalsAI product developmentvoice/multimodal AItest-time computedeveloper productivity

Why it matters

Pre-training is encoding statistical language distributions.

Key claims

  • Pre-training is encoding statistical language distributions; post-training (SFT, RLHF, verifiable rewards) is where frontier labs now differentiate, since internet data is largely saturated.
  • For RAG systems, data preparation—chunk sizing, contextual metadata, hypothetical questions, rewriting docs in Q&A format—matters far more than vector database choice.
  • The data-labeling industry has fragile economics: many small providers serve only a few frontier-lab customers, giving labs significant pricing leverage.
  • Evals aren't all-or-nothing—pick battles based on whether failures are catastrophic; depth-of-coverage across the pipeline beats chasing a fixed count of metrics.

Episode summary

Summary

Chip Huyen, author of O'Reilly's most-read book 'AI Engineering' and former core developer on NVIDIA's NeMo platform, joins Lenny's Podcast for a technical primer on building production AI products. She explains the core building blocks—pre-training vs. post-training, fine-tuning, RAG, and RLHF—in accessible terms, drawing an analogy to Sherlock Holmes decoding language statistics. The conversation challenges the AI hype cycle, noting that while pre-training gains may be slowing, post-training and inference-time compute are driving the next wave of improvements.

A centerpiece is Chip's viral chart contrasting what developers think improves AI apps (chasing the latest model, agonizing over vector databases) versus what actually does (talking to users, better data prep, end-to-end workflow optimization). She argues data preparation—not framework choice—is the dominant lever for RAG quality, and unpacks the economics of the AI data-labeling industry, where many startups serve just a handful of frontier labs.

On adoption, Chip is pragmatic: most companies can't measure productivity gains cleanly, and randomized trials inside firms show senior engineers gain the most from coding agents, while mid-tier engineers follow. She forecasts organizational restructuring—blurring product, engineering, and evals—plus rapid advancement in voice and multimodal AI, where latency, interruption handling, and disclosure regulations remain hard problems. Closing on a personal note, Chip shares she's writing her first novel and draws a parallel between predicting a reader's emotional reaction and predicting the next token.

  • Pre-training is encoding statistical language distributions; post-training (SFT, RLHF, verifiable rewards) is where frontier labs now differentiate, since internet data is largely saturated.
  • For RAG systems, data preparation—chunk sizing, contextual metadata, hypothetical questions, rewriting docs in Q&A format—matters far more than vector database choice.
  • The data-labeling industry has fragile economics: many small providers serve only a few frontier-lab customers, giving labs significant pricing leverage.
  • Evals aren't all-or-nothing—pick battles based on whether failures are catastrophic; depth-of-coverage across the pipeline beats chasing a fixed count of metrics.
  • Randomized trials at companies show senior engineers gain the most productivity from coding agents; mid-tier follow; some senior engineers actively resist AI tooling due to high standards.
  • Chip predicts organizational restructuring: evals blur the line between product, engineering, and growth, and companies are reorganizing around senior reviewers + AI/junior producers.
  • Voice AI faces engineering—not just AI—challenges: multi-hop latency, natural interruption detection, and emerging disclosure regulations for human-sounding bots.
  • Test-time compute (sampling multiple answers, extended reasoning chains) is the new frontier for model gains without changing base model weights.

Source material

Transcript

on Russian, they cut us a lot and a lot.

It’s how do we keep up to date with the latest AI news?

Why’s to keep up to date with the latest AI news?

If you talk to the users, you understand what they want, what they don’t want, look into the feedbacks, then you can actually improve the application.

Wait, wait, wait, wait, wait.

A lot of companies are building AI products.

A lot of companies are not having a good time building AI products.

We are in an ideal crisis.

Now, we have all these really cool tools.

You have to do everything from scratch.

You have your design, you can have your write code, you can have your website.

So in theory, we should see a lot more.

But at the same time, people are like somehow stuck.

They don’t know what to build.

All is AI hype.

The data is actually showing most companies try it.

It doesn’t do a lot.

They stop.

What do you think is the gap here?

It’s really hard to measure productivity.

So I do ask people to ask their managers, "Would you rather have a good effort on your team?"

Very expensive.

Coding agent subscriptions.

Or you get an extra head cow.

Almost everyone, the managers, could say head cow.

But if you ask VP level or someone who manage a lot of teams, they could say, "I want AI assistant."

Because as managers, you are still growing.

So for you having one HR head cow is big.

Whereas for executives, maybe you have more business metrics that you care about.

So you actually think about what actually drives productivity metrics for you.

Today, my guest is Chip Huen.

Unlike a lot of people who share insights into building great AI products and where things are heading.

Chip has built multiple successful AI products, platforms, tools.

Chip was a core developer on Nvidia’s NEMO platform, an AI researcher at Netflix.

She taught machine learning at Stanford.

She’s also a two-time founder and the author of two of the most popular books in the world of AI.

Including her most recent book called AI Engineering, which has been the most read book on the O'Reilly platform since its launch.

She’s also gotten to work with a lot of enterprises on their AI strategies.

And so she gets to see what’s actually happening on the ground inside a lot of different companies.

In her conversation, Chip explains a lot of the basics.

Like, what exactly does pre-training and post-training look like?

What is RAG?

What is reinforcement learning?

What is RLHF?

We also get into everything she’s learned about how to build great AI products.

Including what people think it takes and what it actually takes.

We talk about the most common pitfalls that companies run into, where she’s seeing the most productivity gains, and so much more.

This episode is quite technical, more technical than most conversations I’ve had, and is meant for anyone looking for a more in-depth conversation about AI.

If you enjoy this podcast, don’t forget to subscribe and follow it in your favorite podcasting app or YouTube.

And if you become an annual subscriber of my newsletter, you get a year free of 16 incredible products.

Including Devon, Lovable, Replit, Bolt, N8N, Linear, Superhuman, Descript, Whisperflow, Gamma, Perflexity, Warp, Granola, Magic Patterns, recast, Japier, DN, Mobbin.

Head on over to Lenny’s newsletter.com and click product pass.

With that, I bring you Chip Nguyen after a short word from our sponsors.

This episode is brought to you by Dscout.

Design teams today are expected to move fast, but also to get it right.

That's where Dscout comes in.

Dscout is the all-in-one research platform built for modern product and design teams.

Whether you're writing usability tests, interviews, surveys, or in-the-wild fieldwork, Dscout makes it easy to connect with real users and get real insights fast.

You can even test your Figma prototypes directly inside the platform.

No juggling tools, no chasing ghost participants.

And with the industry's most trusted panel, plus AI powered analysis, your team gets clarity and confidence to build better without slowing down.

So if you're ready to streamline your research, speed of decisions, and design with impact, head to dscout.com to learn more.

That's Dscout.com.

The answers you need to move confidently.

Did you know that I have a whole team that helps me with my podcast and with my newsletter?

I want everyone on my team to be super happy and thrive in their roles.

Just Works knows that your employees are more than just your employees.

They're your people.

My team is spread out across Colorado, Australia, Nepal, West Africa, and San Francisco.

My life would be so incredibly complicated to hire people internationally, to pay people on time and in their local currencies, and to answer their HR questions 24/7.

But with Just Works, it's super easy.

Whether you're setting up your own automated payroll, offering premium benefits, or hiring internationally, Just Works offers simple software and 24/7 human support from small business experts for you and your people.

They do your human resources right so that you can do right by your people.

Just Works.

For your people.

Chip, thank you so much for being here and welcome to the podcast.

Hi Lenny.

I've been a big fan of the podcast for a while, so I'm really excited to be here.

Thank you for having me.

I want to start with this table/chart that you shared on LinkedIn a while ago that went super viral.

And I think it went super viral because it hit a nerve with a lot of people.

And let me just read this and we'll show this on YouTube for people that are watching.

So it's this very simple table you share of what people think will improve AI apps and what actually improves AI apps.

What people think will improve AI apps.

Staying up to date with the latest AI news.

Adopting the newest agentic framework.

Agonizing what vector databases to use.

Constantly evaluating what model is smarter.

Fine-tuning a model.

And then you have what actually improves AI apps.

Talking to users.

Building more reliable platforms.

Preparing better data.

Optimizing end-to-end workflows.

Writing better prompts.

Why do you think this hit such a nerve with people?

And just what if you had to boil it down, what do you think people are missing in my building successful AI apps?

One question that got asked a lot and a lot is that, how do I keep up to date with the latest AI news?

And I'm like, why do you need to keep up to date with the latest AI news?

And I always tell very counterintuitive, but there's so much news out there.

A lot of people also asked me questions like, how do I choose between two different technologies?

Like maybe recently like MCP versus like agent-asian protocol.

And it was like, which one is better?

Or like this or that?

And things are a serious question I asked them.

It's like first, like if how much of the improvements could you get from like optimal solutions versus non-optimal solutions, right?

And sometimes they were like, actually it's not much, right?

And I was like, okay, if it's not much improvement, why do you want to spend so much time debating something that doesn't make that much difference to your performance?

And another question they asked is like, if you adopted a new technology, like how hard it would be to switch that out to another?

And sometimes they were like, oh, I think it could be like a lot of work switching it out.

And I was just like, hmm, let's say here's a new technology.

It hasn't been tested by a lot of people.

And if you adopt it, it would be stuck with it forever.

Like, do you actually want to adopt it?

Right.

And maybe you want to think twice about like overcommit to like a new technology that hasn't been better tested.

I love your just broader advice is just simple, like talk to to build successful apps, talk to users, build better data, write better prompts, optimize the user experience, versus just like what is the latest and greatest?

What's the best model to use right now?

What's happening in AI?

Let me follow this thread of this idea of fine tuning and basically post training.

There's all these terms that people hear in AI.

And I think this is going to be a really good opportunity for people to learn what we're actually talking about.

Since you actually do these things, you build these things, you work with companies doing these things.

And there's a few terms I want to sprinkle in through the conversation.

But let's start with this one.

What's the simplest way for someone to understand what is the difference between pre-training and post training?

And then just how fine tuning fits into that just what fine tuning actually is?

To disclaimer, I don't have like phone visibility into like on what like this big secretive like frontier labs are doing.

But right from what I heard, right?

So I think it's like one is like supervised fine tuning when you have demonstration data.

And you have like a bunch of like experts, like, okay, here's a prop, right?

And here is what the answer should be like.

And you just train it like on like, to like, stimulate, like emulate what the human expert could be like.

And that's also like what a lot of people would like.

So open source models are doing as they do it by distillation.

So instead of having human experts to like write really good starting up grid answers to my prompts, they get like very popular, famous, good models to like generate a response to it and like getting this train smaller model to emulate.

So so it's not that easy people was just like, so that's because like some I really appreciate open source community, by the way, but like going from a happy would you train a models that can emulate a existing good model is very different from like being usually trained a good models like an output for existing good model.

So it's a big step there.

So yeah, so like we have my supervised fine tuning.

And another thing that's like very big, I'm not sure you have guests talking about it already, but reinforcement learning is like everywhere.

Let's pause on that, because I definitely want to spend time on that.

And that's such a cool topic that I want that's merging more and more in my conversations, but just to even summarize the things you just shared, which I think is really, really important stuff.

So the idea here is a model, essentially this algorithm piece of code that someone writes, and say the frontier models are feeding it just like the entire Internet of content.

And basically, it's trying to test itself on predicting in all of the in all across all that data, the next word, essentially, token is a simpler way is the correct way to think about it.

But a simpler way to think about is like the next word in a in text.

And as it gets it wrong, it adjusts these things called weights, essentially, just like is that a simple way to think about it, even though that's even that's just like very surface level.

So I think of language modeling as a way of encoding statistical information about language, right?

So so let's say that we both speak English, so we can get a sense of like, what is most technically likely, I give I say, my favorite color is, then you will tell, okay, that should be another color, like the word blue would be much more likely to appear than the word like, table, right, because statistically, Lewis ball, like, is you can't have to think of color is so so it's a sign get is is is is a way of encoding statistical information.

So like when language modeling miniature, and a large amount of data, like it, you see a lot of languages, a lot of domain, so it can tell like, okay, you may say this sentence, then it user do the prompts, and it would come like with the next, most likely token.

So by the way, it's not a new idea, actually, a video.

So it's an idea comes very, very old, like from the 1951 papers, like English entropy, I think it's like, question, and it's a great paper.

And I think it results a story I really like is from Did you read Sherlock Holmes, by the way?

Yeah, I read a few Sherlock Holmes books.

Yeah.

Yeah.

So this is a story of like when Sherlock Holmes was using this statistical information to like, have sewn a case.

So he was getting so this is this story, there is somebody left message with a lot of a stick figures.

So Sherlock Holmes was like, okay, he knows that in English, the most common letter is e, then the most common stick figure must be e, right?

And then he goes, he stopped like that, it was really so, so the code.

So I think there's language.

So in a way, it's like simple language modeling, right?

But instead of like at a word level, he does it as like, like character level.

And token is something in between, right?

token is not quite a word, but it's bigger than a character.

So let's say we say token, because it helps us like, read what how was reduced vocabulary, because with character is like smallest amount of vocabulary, right?

And I said, I'm fantastic, 26 character, but words can have like millions and millions, right?

Whereas tokens, you can like, be able to like get like the sweet spot will be the two.

So let's say that we have a new word, like, how to say like, podcasting, right?

Let's say it's a new word, but it can divide in a podcast and ink.

So people want to say, okay, podcast, we know the meaning, we know that ink is like, like a verb, like a gerund, whatever it is.

So we even know the word like podcasting.

So that's why the token comes in.

But yeah, that's like the pre tuning is basically like encoding statistical information of language to have you predict what is most likely I think that most likely is the most simple way of doing it.

Because it's more like building distributions of like, okay, so there's a next token could be like more like 90% of the channel, it could be like a color, like 10% of the time could be something else, right?

So you basically distribution, so language could like pick, like depending on your sampling strategy, like do you want it to always pick the most likely token, or do you want it to pick something more creative, you know, so so so I think my sampling strategy, I think is something extremely important.

Again, have your boost of performance in a huge way, and very, very underrated.

Okay, awesome.

So essentially, a model is just code with this whole set of weights, essentially, the statistical model that has learned to predict what comes next after certain words and phrases.

Yeah.

And then post training and fine tuning specifically is doing that same thing.

So pre training, you get like GPT five fine tuning is someone taking GPT five and doing the same sort of thing.

adjusting these weights a little bit for specific use cases on data that they find is necessary to do their very specific use case.

Is that a simple way to think about it?

Yeah, I think like weights is like functions, right?

So let's say it's like you have a maybe has a functions of like, maybe Lenny's height is maybe like one x, like one x plus something like two x like one and plus something is is a weight, right?

So you change it until you fit the correct data, which is like my height and your height, right?

So you can think of the weight is just like a weight like they function.

So you so so you like chain adjust the weights, so they can fit the data, which is a training data.

Awesome.

Okay.

So we're talking about pre training, post training, fine tuning, is there anything else here that's important to share about just like what this is exactly, what people need to understand about these parts of training?

So the vast majority of time, we don't touch on like pre-training model, like as users, we don't write use already done for us.

Yeah, so I think my actually a bit of fun, like, process like when my friend trading models, like charge of play with their pre-training model, and they're horrendous.

They're like saying things like, Oh, my God, it's like, yeah, it's crazy.

So so it's very interesting to look at like how much of like post training can change the motor behavior.

Yeah, and I think that's where like, a lot of time is that a lot of people are spending energy on nowadays, they frontier lab is on like post training, because pre-training, I think, super tuning have been used to like increase the general capacity of a model capabilities of a model.

And it depends on even a lot of data and like model size might you increase, to increase the model capabilities.

And at some point, we are actually like have kept Maxile on the internet data, right?

And people like texted up in Maxile, I think a lot of people are doing like the other data, like audios and videos, and everyone's trying to think of like, what is the new source of data, but we're like post-training, but like middle course of like, it is more of like, everyone can have very similar pre-training data, is that post-training is where they make a big difference nowadays.

This is a good segue to you talked about supervised learning versus unsupervised learning.

I love we're getting into this, by the way, this is super interesting.

So you talked about labeled data, basically supervised learning is AI learning on data that somebody has already labeled and told it here is correct versus incorrect.

For example, this is spam versus not spam.

This is a good short story.

This is not a good short story.

We've had the CEOs of a lot of these companies that do this for labs, Mercor and Scale, Handshake, there's Micro, there's a few others.

So is that essentially what these companies are doing for labs, giving them labeled data, high quality data to drain on?

It is in a way, but I think it's more like a product of big equations.

So there are a lot more different components than that.

So that's why I was talking about unsupervised learning.

I'm not sure if your CEO said you interview bring up like that term.

So the idea is that you want people to like, so like, let's say you have a model, give the model like a prop, right?

And it produces an output, right?

You want to buy, like you want to reinforce, encourage the model to produce an output that is better, right?

So like that, how do we know that the answer is good or bad, right?

So usually people realize on like signals.

So one way to get like a first one good or bad is like human feedback, right?

You have two responses, you can guess this one's better than the other.

And to redo that is because like as humans, we tend to, it's very hard to give like concrete score, but it's easier to compare reasons, right?

Like if you ask me, okay, give this song a score.

I'm not a musician, like, and don't know like how hard it is.

Like, it's like, yeah, I don't know, like what like how 10, I go to six, you know, and then if you ask me again a month from now, and I come to the forgotten, so maybe now seven, only four, I don't know.

But then if you ask me, okay, here are two songs, and which one could you prefer to play for the birthday party?

It was like, okay, I can't pray before this song.

So like comparisons are a lot easier.

So say, so they have a humans, you have human feedback, and then you use this human feedback to treat a reward model.

So like tell like, which, like so and then the free group model happy like, okay, it's a model, I'll produce this response, it's a reward model can score, is this good or bad, and you're trying to buy us reward, like producing better model, the better responses.

Another way is that you can instead of using a human, you can use like AI, right?

Like, because the response is a yes, good, good or bad, right?

Or in terms of things that people are very big on nowadays, like verifiable rewards, which is like, natural.

So basically, they give it a math problem, and a math solutions, like is a model opposite solution is, you know, okay, it's a expected response should be in a for a two, and it doesn't provide for a two, and it's, then it's wrong, right?

And it's not a good response.

So so yes, like, a lot of time, people are using this human labor, like human, human laborers should like, produce like math, like, how say, expert questions, and they say expected answers.

And in the ways of like, design systems that like verifiable so that the models can can be trained on.

Yeah.

Okay, I'm really glad you went there.

This is essentially RLHF, reinforcement learning with human feedback, which is exactly what I wanted to also talk about.

Right?

Yeah.

So I think it's like, it's general, it's like, it's a way of learning, it's like training, it's going to be for learning, and whether it learn from human feedback, or like AI feedback, or like verifiable rewards.

I think they say it's a different way of like, clipping signals.

Awesome.

Yeah, that's, I had this C of enthropic on the podcast, and he talked about their version of RLHF, which is AI driven, reinforcement learning.

I love the way phrase that where you basically, you want to help the model, you want to reinforce correct behavior and correct answers.

And this is the method to do it, whether it's say an engineer, seeing an output from a model being like, no, here's how I would code it differently.

And then training, and it's training a different model that the original model works with to tell it, am I correct or not correct?

Is that right?

Yeah, I think that's a way of looking into it.

And I think that's a space is so exciting nowadays, because there's so many like domain experts task that the model, like the model developers want models to do well on, right?

Let's say you're like accountant, right?

And I really want to use a model to have accounting tasks.

So you need a lot of like accounting data, like examples from my accountant.

So you need to hire a lot of them to like do it, or if you want a physics program, I want to do, and like legal questions and stuff, or like engineering questions, or like somebody was telling me is like, want to do like using like coding for two so scientific problems, and not just like coding to build product, which is another different whole realm of things.

And I also like using very specific toolings, like, yeah, like I'm not sure like what apps you use, but maybe like forwarding app or like QuickBooks or like Google, Excel, like they have very specific like tool specific expert expertise, so you want the models which you learn.

So as I need a lot of like humans expert in this area to like create data to trade them.

And it's a massive thing.

It's like people because everyone wants a lot of data and like once slaps it like unlimited budget.

But whether I think this is also like a little bit of low key, interesting economics, I'm not sure you've talked to like the guests about I thought it's very interesting if I think about because it's very lopsided, right?

Because like they only like a very small numbers of frontier labs, right?

And they want a lot of data.

And there's like a massive amount of like startups or companies providing related data.

So like you can see this company's like this startup like doing later labeling, maybe they have like maybe some like massive AR, but you also like, okay, so how many customers you have?

And they could be like a very small numbers.

I'm not sure I'm not sure you you saw you smiling.

Yeah, we chat about that.

Yeah, so so I'm like a bit of it like, look me I'm easy, right?

Here, I have like a company is growing like crazy, but it's like heavily dependent on like two or three companies.

And at the same time, like if I if I was this company frontier labs, what could be the right economical things for me to do right now, I want a lot of startups, I want to have a lot of providers so we can pick and choose.

And then this providers can also like to compete each other to lower the price and it's so dependent on me.

It was certainly regardless.

So I feel like yeah, so so this economics, a whole economics is very interesting to me.

And I'm curious to see how it plays out.

What I'm hearing is your your bearish on the future of these data labeling companies.

Because as you said, they don't don't have a lot of leverage over pricing, because they have so few customers.

And there's so many people getting into the space.

So basically, even though there's some of the fastest growing companies in the world, you're feeling like there's there's a challenge up ahead.

I'm not sure some bearish on it.

I think I'm curious, because I think things have has a way of work out in ways I don't expect.

So I think that maybe these companies they have a lot of data, maybe they wouldn't be able to use that to like have some insights that helps them like stay ahead of the curve.

And you know, so I don't know.

A very fair answer.

Okay, while we're on this topic, I want to chat about evals, which is a very recurring topic in this podcast.

This is the other piece of data content these companies share that AI labs really need.

Can you just talk about what an eval is the simplest way to understand it and then how this helps models get smarter?

So I think people approach evals, I think they're like two very different problems.

Once is a app builder, right?

And like, can I say have an app that do like, maybe a chatbot?

Very simple announcements.

First thing that came to my mind.

And I want you to know each chatbot is good or bad, right?

So it needs to come away with like eval is a chatbot.

Another thing is, I think of this as a task specific eval design.

So let's say I'm a model developer, and I want to make my model better at curve writing, right?

And it was like, okay, but how do I even measure curve writing, right?

So I even need someone to like, okay, understand curve writing and think about like, what makes a good story, like what makes a story good, and then design the whole data set, and then criteria to evaluate curve writing.

So yeah, so I think that I think it's like more like eval design.

That is very interesting.

Come I will create criteria, come I will highlight how to do it.

And then also like train people, like how to do it effectively.

So I guess, you know, guys, I think eval is really, really fun, because it's extremely creative.

I was looking at like different evals were built in was like, wow, like, is this not dry at all?

It's just like, super, super, super fun.

We had a whole podcast and evals with Hamel, Hamel and Shreya.

And, and that's exactly what they talked about is just it's actually really fun to create evals for for companies especially.

So let's dig into that one a little bit more.

There's this kind of debate online that I don't know how big of a deal this debate is, but it feels like people spend a lot of time thinking about this, this idea of do we need evals for AI products, some of the best companies say they don't really do evals, they just go on vibes, they're just like, is this working well?

Can I feel it or not?

What's your take on just the importance of building evals and the skill of evals for app AI apps, not the model companies?

You don't have to be like, absolutely perfect at things to win.

You just need to be like, good enough and being consistent about it.

Okay, this is not a philosophy I follow.

But like I have worked with enough companies to see that play out.

So when I say like, why can we don't have all right, let's say you are like an executive, right?

And you want to have a new use case.

So here's a use case you started out with built, and it's like, it works well, right?

The customers are somewhat happy, you don't have the exact measure for it.

But like, so the traffic keeps increasing, like people seem happy, people keep buying stuff, right?

And then here's an engineer coming like, okay, we need evals for it.

And so it's like, okay, how much effort do we need to go into evals?

And they were like, okay, maybe like two engineers as much as much.

And it could maybe would improve that and was like, okay, so how much expected gain can I get from it?

And the engineer would be like, oh, maybe you can improve it from like 80% to like 82%, 95%.

And I was like, okay, but it will take like two engineers and we will launch a new future, then it could give me like so much more like improvement, right?

So I think it's like one of them is a VA, sometimes you put in your evals, okay, this is good enough to touch it, like if you do spell out energy on evals, it would like only incremental improvement, where it expands the energy on like another use case.

And maybe it's getting good enough that you can survive check it, right?

So I do things just like maybe it's like that's a debate is about.

I do things just like a lot of time before just like get things to the place when it's like, okay, good enough, people run.

But and then but of course, it's like there's a lot of risks associated with it because if you don't have a clear metric, you have good feasibility to how the applications and models are performing, it might do something very dumb, or it can cause you like, I know, something like crazy can happen.

So so yeah, so so so I do think eva is very, very important if you have if operate a scale, and we're like, failures can have like catastrophic consequences, then you do need to be very tyrannical about like what you put in front of the users, understand different failure modes, like what could go wrong.

And also maybe in this space point of like, it's a feature as of the product is as a competitive advantage, right?

It wants to be the best at it.

So you want to have like a very strong understanding of like where you are, and like where you are with the competitors.

But it's just something that's like more like a low key, okay, something is like, okay, that's not the core, or like it helps with our users, then maybe don't need to be so, so obsessed or like tyrannical about it is okay, that's good enough for now.

And if it fails, and it fails, like, okay, I know it's like it's such a fight.

But like, yeah.

Yeah, I think it's all about like the question of like, which are we don't investment?

I'm a big fan of eva, I love reading eva.

And the say is like, I understand why some people would choose to not focus on eva right away and choose like bringing on new functionalities instead.

Awesome.

That is a really pragmatic answer.

What I'm hearing is evals are great, very important, especially if you're operating at scale, but pick your battles, you don't need to write evals for every little feature.

Something that Hamon Shreya shared is that people need just like, I don't know, five or seven evals for the most important elements of their product.

Is that what you see?

Or do you see a lot more in production that people build in?

I don't think of like, just a fixed number on like the evals, like what was the goal of eva, right?

The goal of eva is to guide the product development.

So like you see eva, because I think our big fan of eva is that it helps you uncover opportunities where the products are doing well.

So it's something we've seen at very office, we look at eva and we realize it's like, okay, perform really poorly on this like specific segment of users.

And then we look into it's like, okay, what's wrong with it?

And it turns out it's like we just like don't have a good messaging to it.

So like maybe we should like just focus on the things of reading poorly and can improve significantly.

Yeah, so I kind of as a number of eva is really depends like we have seen product with like hundreds of different metrics, right?

Like people are like going crazy.

This is because like product is like general, right?

Have different types of like on eva for like, I don't know, like verbosity, have like one eva for like user sensitive data.

And like another is like for length, but like has a number of like, okay, let's just play like a good example, completely simple, like this research.

So so you have the application, you have like views and model should like do deep research for you, right?

Like, okay, like have a prompt and I miss it.

Okay, do me a comprehensive research on all nanny's podcast and help me like some like propose like show me report on what type topics he's interested in, what kind of videos get the most views or like what topics that he's missing on that he should be covering, right?

Like has a kind of like prompt, then how do you evaluate the result, right?

I don't think there's like one like metrics that would help maybe it's just like maybe you have like 100, I think somebody has a benchmark and is it get like 100 expert, like write a bunch of prompts and they go through like all the answers on AI and I do it and it's like it's extremely costly and slow, right?

But if you might have something else for some more like one way was thinking about it.

I was talking to a friend about it and one way is like how do you produce the result of the summary, right?

If first you need to do like gather information and to gather information you need to do a lot of search queries.

You like gather the search results and then from the search results you like aggregate and then maybe say, okay, I'm still missing on this.

You have to do another route or like another route and it's the end of the summary.

So every step of the way you need evaluations, right?

You don't need to do the end-to-end.

So maybe for the search query, my first thing about like, okay, now I write five search queries.

Am I looking to like how good are these search queries?

Like do they like as they like similar to each other because you need five search queries that are very similar like, okay, then it podcasts, then it podcasts last month, then it podcasts like two months ago, right?

It's not very exciting but like if the query is a forecast, like the keywords are like more diverse, right?

And then look at the results of the search query and say you enter the search query like Lenny podcast data labeling and then they come up with like 10 pages, 10 results and then you come up with like, oh, Lenny podcasts on, I don't know, I don't know, like a frontier labs and have like 10 results.

And I know it's a different webpage, like how much of them are overlapping.

Like are we doing both like the breadth, like getting a lot of page but also like do we have depth and also like have relevance because we come up with a search query that is completely irrelevant to the original problem.

So I feel like every aspect of it, it would need a way of evaluating, right?

So I don't think it's like how many evals should I get?

But like how many evals should do I need to get a good coverage, a high confidence in my application's performance and essentially help me understand like where it is not performing well so that I can fix it.

Awesome.

And I'm hearing also just especially for the very core use case, like the most common path people take in your product is where you want to focus.

Yeah, so yeah.

Okay, let me there's one more term I want to cover and I want to go in a somewhat different direction.

RAG, people see this term a lot.

RAG, what does it mean?

So RAG stands for retrieval of mental generations in a sort of not a specific true gen-up AI.

So the idea is just like for a lot of questions we need contacts to answer.

So I think it came pretty, I think it's from the paper 2017.

So someone was like, so they realized it's like for a bunch of like benchmark, when the question answering benchmarks, they realized it's like, okay, if we give the model informations about the questions, the next answer can be much, much better.

So what they do is they try to retrieve information from Wikipedia.

So for question for topics, it's like we retrieve that and then put it into the context and like answer it does much better.

So I feel like it sounds like a no grinner, right?

I mean, like obviously, so I think that's what RAG is as a simplest sense is just like providing the model with a relevant context, so that it can answer the questions.

And it's really like things get like, really, more and more interesting, because traditionally, when it started out of RAG, it's mostly like text.

So we talk about like a lot of ways of like how to prepare data, so that the model can retrieve effectively, let's say it's like, not everything is a Wikipedia page, right?

Like Wikipedia page is pretty contained, and like, you know, okay, everything is valid, it's about a topic.

But a lot of time, your documents are extremely low, right?

And like, they have a weird way of like structures of documents.

Let's say that you have documents about learning podcasts, right?

And in the future, in the beginning documents, like from now on, podcasts wouldn't refer to learning podcasts, right?

So let's say somebody in the future is like, okay, tell me about learning, right?

Many is great.

And because as a rest of the document does not have the term learning, you just don't know, you might not read through it.

And the document is long enough that it's chunked into a different part.

So like the second part, perhaps doesn't have the word many, so you cannot reach it.

So you have to find a way to process data.

So that makes sure that like, it can retrieve the information that's relevant to the query, even though it might not immediately like obvious that is related.

So people come up with like only thing if I think like contextual visual, like giving extra data, that relevant like maybe in a summary metadata, so that it knows all the semibold is like as a hypothetical questions is very interesting, like for even the chunk of like documents, I'm a jetted a bunch of questions that the chunks can help answer.

So that when they have a query is okay, does it match any of the like hypothetical questions?

So again, it can fashion it.

So it's very interesting approach.

Okay, so maybe before I go to the next thing is want to say this like data preparations for rack is extremely important.

And I will say this like in the a lot of the companies that I have seen, it's like the biggest performance in their rack solutions coming from like better data data preparations, not agonizing over like what very databases to use.

We can read database, of course, it's very important to care about like things like latency, or like if you have like very specific access patterns, like read heavy or write heavy, of course, it's like it matters.

But in term of like pure quality answers, right?

I think the data preparation is like hands out.

When you say data preparation, what's an example to make that real and concrete for us to understand?

So like one way is like mentioned as in like, you have like chunks of data.

So we have think about like, how big of H chunk should be, right?

Because if it's like sort of thing about like, it's a context you want to maximize, maybe you can, it's very simple example, right now you want to retrieve like 1000 words, right?

So if H chunk of data is too long, then so if a data chunk is long, then it's more likely to contain more relevant metadata, so it can retrieve more.

But if it's too long, like then you have a thousand words and the chunk is like a thousand words to get a rich one chunk, so it's not very useful, but it's too short, then you can can't retrieve more relevant information like also, it can retrieve a wider range of like documents and chunks.

But at the same time H chunk is too small to contain relevant information.

So we have like very nice like chunk design, like how big H chunk should be.

You add like contextual information like summary, metadata, hypothetical questions.

Somebody was telling me this is like a very big performance they got is that from rewriting their data in the question answering format.

So instead of having like, so they have a podcast, right instead of just chunking the podcast, it's just like reframe, rewrite it into like, here's a question, here's answers, like and produce a lot of them.

It gives AI for that as well.

So that's one example of data processing.

A lot of example, we I see is like for people helping like using AI to have like specific tool news and documentations, right?

And a lot, and we write documentation usually to our document, documentation today is written for human reading.

And AI reading is different because it's different because humans, we have like common sense.

And we kind of know what it is.

So so when things are like, human for human experts, they have the context that AI doesn't quite have.

So somebody told me that like, with a big change they have is like, let's say that you have a have a function, a document, documentation for this, maybe this library, and the library says, okay, the output of this one is like maybe talking for like, I know some crazy term, maybe some temperature for something under grab should be like one, zero or minus one.

And as a human expert, maybe understand the scale and what one in the scale mean.

But like for AI, just really doesn't understand what that means.

So so actually have like another annotation annotation layer for AI.

And so okay, good temperatures equal one means like that is not like it's an actual temperature is like associated with the scale over there.

So exactly having that all this data processing to make it easier for AI to retrieve the relevant information to answer the questions.

This episode is brought to you by persona, the verified identity platform helping organizations onboard users fight fraud and build trust.

We talk a lot on this podcast about the amazing advances in AI, but this can be a double edged sword.

For every wow moments, there are fraudsters using the same tech to wreak havoc, laundering money, taking over employee identities and impersonating businesses.

Persona helps combat these threats with automated user business and employee verification.

Whether you're looking to catch candidate fraud, meet age restrictions or keep your platform safe.

Persona helps you verify users in a way that's tailored to your specific needs.

Best of all, persona makes it easy to know who you're dealing with without adding friction for good users.

This is why leading platforms like Etsy, LinkedIn, Square and Lyft trust persona to secure their platform.

Persona is also offering my listeners 500 free services per month for one full year.

Just head to withpersona.com/leni to get started.

That's withpersona.com/leni.

Thanks again to Persona for sponsoring this episode.

Awesome.

Okay.

So you've talked a bit about how you work with companies on these sorts of things, on their AI strategies, on their AI products, how they build, which tools they build, all these things.

I want to spend a little time here, because a lot of companies are building AI products, a lot of companies are not having a good time building AI products.

Let me ask a few questions along these lines of what you've learned working with companies that are doing this well.

One is just, I guess in terms of AI tool adoption and adoption in general within companies, there's also talk recently of just like all this AI hype, the data is actually showing most companies try it, doesn't do a lot, they stop.

And so there's all this just like maybe this isn't going anywhere.

So in terms of just adoption of tools in AI within companies, what are you seeing there?

For Gen AI in company, I think there are two types of Gen AI toolings that have been, I have seen ones is to internal productivity.

According to Slack chat bot, internal knowledge, a lot of big enterprises has some kind of a wrapper around models, but with access to maybe some different kind of a rack solution.

I think we talk about data, text-based rack, I haven't talked about Asian tech rack, or haven't like machine model rack yet, but this is a whole very exciting area around that.

Yeah, so like basically to allow the employee to like access internal document.

Some ways I'm going to ask like, okay, I'm having a baby, what could be the maternal or paternal policy, right?

Or like, am I having these operations, could the health benefit like cover that?

Or like, I want you to like interview, I want you to like refer my friend, what could be the process for that?

So a lot of this like having chat bot, internal chat bot to help with internal operations.

And another thing, another category is more like customer facing.

So like partner facing, so product customers about chat bot is a big one.

If a hotel chain, you might have like a booking chat bot, which is like somehow massive, like a lot of booking chat bot, because I guess it's, I do have this theory of like a lot of applications companies pursue, because they can measure the concrete outcome.

And I feel like booking on a sales chat bot is very clear, right?

Like what's a conversion rate right now with a chat bot with human operators, and what could be conversion rate with a chat bot.

And it's something that I think is like very clear outcomes and companies are easier to buy into this, these solutions.

So a lot of companies have that like customer facing chat bot.

So yeah, so that is another category of tool.

And I think for customers or external facing tools, because people are driven to people are driven to choose applications with clear outcomes.

So the questions of adopting them is really based on like whether they see the outcome or not, of course, not perfect, because sometimes the outcome can be bad, not because the idea or like the applications idea so is bad, it's just because the process of building it is like not that great.

Yeah, so it's tricky for the internal adoptions of like tooling, so internal productivity is just where it gets tricky.

And we say like a lot of companies, what's a think of AI strategy, like I think of AI strategy is usually have very, have like two key aspects, right?

It's like use cases.

And the second is talent, you might have a great data for great use cases, but you don't have talents and you cannot do it.

So a lot of time in the beginning with Gen AI and it's still something that I'm really admired a lot of companies for that is it's like exactly because okay, we need our employees to be very Gen AI aware, like very AI literate, right?

So what you do is I start like, maybe like adopting a bunch of tools for the team to use, they have an upscaling workshops, like the Anchorage learning, and it's like a really, really good thing.

And it's also willing to spend a lot of money into like adopting like giving people like accessibility subscriptions, personal subscriptions, cloud code subscriptions, like to get the employees to like to be more AI literate.

And the other thing is like a lot of the second in our country, we say, okay, we spend a ton of money as a tooling.

But then we don't see because you can see the usage, it was like, but people don't seem to use them as much.

And what is the issue?

So so yeah, so I think that is, that is tricky.

Yeah.

What do you think is the issue?

Is it just they're not they're like, they don't know how to use them?

Like, what do you think is the gap here?

Do you think we'll get to a place of just like, wow, work is completely different because of AI for a lot of companies?

The main thing is like, it's really hard to measure productivity.

Again, so I talked to a lot of people on their website, first of all, on money on a sample is coding, right?

A lot of companies not using coding agents, or according to the coding.

And I was asking, I was like, I was like, do you do you think that like it helps with your productivity?

And a lot of times the questions are very handwilling.

It's like, okay, I feel like it's been better, right?

And I said, okay, because we have more PRs, we see more code and the immediate correctness, okay, but of course, code number, line of code is not a good metric for that, right?

So it's really, really tricky.

And it's something funny.

So I do ask people to ask their managers because I work with like user-day VP levels, so they have like multiple teams under them.

So I asked them like, okay, do you ask some managers?

Like, okay, would you rather have access?

Would you rather have give everyone on the team like very expensive, coding agent subscriptions?

Or you get an extra head count, right?

And almost everyone could say the managers could say head count.

But if you ask VP level, or like someone who managed a lot of teams, they would say it's like a good one AI assistant tools.

And the reason is that people say like, okay, because as managers, right, because you are still growing, like you're not at the level when you manage hundreds of thousands of people.

So for you, like having one HR head count, it's like, it's big.

So you want that not for productivity reasons, but because you just want to have more people working for you.

Whereas for executive, you care more about like, the, maybe you have more like business metrics that you care about.

So you actually think about what actually drive productivity metrics for you.

So so yeah, so it's tricky.

And I think the size of question of like productivity is not I'm not sure it's like fundamentally is the sub of more productive, but it's just like, we don't have a good way of measuring productivity improvement.

Another thing is also very widely.

And I think that people do tell me that they notice different buckets of employees, like different reactions to AI assistant tools.

Like, first of all, I keep going back to coding because it's, boy is big, and it's like easier to my reasons about.

So it seems like, I have different reports, like, one team would tell me that like, people tell me, okay, amongst all his engineers, he thinks it's like, senior engineers would get the most output, like would be more productive, because it's like, okay, so that person very interesting.

So so he actually divided his team to like three buckets, but he didn't tell them obviously, he was like, okay, here's more like currently like best performing average performing and lowest performing.

And then there's a randomized trial.

So as I give like half of each of each group, like access to like, to like cursor.

And then he was noticed like over time was like, okay, so something funny, like the group that get the biggest performance boost, like in his opinions, like was very close in his team, as a biggest boost, like the senior engine, so the highest performing, so highest performing engineer get the biggest boost out of it.

And then the second group is just like the, the average performing.

So so he so his opinion is like, okay, the highest performing engineers is a normal proactive, they will say no one's a problem.

So I have some problem better.

Whereas the people who are having lowest performing, they only don't care much about work, right?

So like this easier to just like go on autopilot, get it to like Jared, like that code and just like do it and all this don't know how to do it.

Another company, however, they tell me this like, actually, senior engineers are the one most resistant to like using AI as it is tooling, because they said it's like, okay, but AI, because they are more opinionated, and they have very high standards, it was like, okay, but AI code, Jeter code just sucks.

So just like very, very resistant in using this.

So I don't know, I haven't quite be able to reconcile like very different reports on that yet.

This is so interesting.

So just to make sure I'm hearing what your the story so there's a company work with that did a three bucket test with their engineering team, where they created three sorts of groups, the highest performing engineers, mid performing engineers, lowest performing engineers, and gave some of them so they gave some of them access to say cursor was a cursor or did they give them access to it was cursor?

I think by that was cursor.

Okay, cool.

And so within I didn't work with them this morning, a friend company.

Okay, it's a friend's company.

So did they give like half of the higher performing engineers cursor and half not?

Or how did they do the split?

Yeah, so like they give like half of the entire company, but like half for each bucket.

Yeah.

And then they observe the difference in like, I see.

Yeah.

So how did they even do that?

They're just like, okay, you get cursor, you don't get cursors that that's so yeah, I didn't get just a mechanics of it.

But but I was like, I respect you for doing a randomized trial.

That is so cool.

Yeah.

Okay, wow.

How large was this engineering team?

Was it like hundreds of people?

It's not that large.

It's about like maybe three to my 40.

Yeah, 30 to 40.

Okay.

Yeah.

Wow.

Okay.

So they found that the highest performing engineers had the most benefit from using AI tools.

And then behind them was the middle tier engineers and the worst performers were the lowest performers.

Yeah, but it's not the same everywhere.

Right, right, right.

Yeah, different.

Right.

This other example he shared of just senior engineers in this one example are most resistant to changing the way they work, which I get.

I do feel like the most valuable people right now, other than ML researchers, and AI researchers like yourself, are senior engineers, because it feels like junior engineers are just like so much of this is now done by AI, but it's but an engineer that knows what they're doing that understands how things work at a large scale with AI tools, just basically like infinite junior engineers doing their bidding feels like an extremely valuable and powerful asset.

Yeah, I definitely like really appreciate as you see companies like we appreciate engineers who are have a good understanding of the whole systems and be able to have good problem solving skill, like thinking holistically instead of like local, like locally, or when our company have seen the way they work, as I said, told me that they were completely different now.

And like so they actually restructured engineering arc, so that like they get more senior engineer should be more in the peer review, because they like to get like sort of writing guidelines on like what is a good engineering practices, what is a process would be like, maybe like, okay, so they write like a lot of like processes on how to work well.

And then they, and then they have more junior engineers just like produce quotes and, and like submit PR, but senior engineer more in the reviewing case.

So I think it might be prepared for the future.

So another company actually told me something very similar.

So it's like paper in the future, once they only need a very small group of like very, very strong engineers to like create like processes and like reviewing code to get into production, but like get like AI or like junior engineers to produce code.

But then the question becomes like, how does one become a very strong?

Right.

That's right.

That's right.

That's the problem.

Yeah.

So, so I don't know what's the process was thinking about like, yeah.

No one's thinking about it.

It's just, it's a problem.

We won't have any more in 10, 20 years.

There will be no more engineers because no one's hiring junior engineers.

Although I could make the case, junior engineers, people just getting into computer science right now are just native AI native.

And in theory, you could argue they will become really good, really fast.

If they're curious, aren't just, you know, the delegating learning and thinking to AI, but learning had actually using it to learn how to code well and architect correctly.

Like you could argue they will be the most successful engineers in the future.

I do think that what I mentioned is I load into architect.

I think I grouped that in like system thinking.

I do think it's very important skill because I think AI can help automate a lot of like, um, disjointed skills, but like knowing how to like utilize the skills together to solve a problems is, is very, uh, it's, it's, it's hard.

So there's a webinar between, um, Mira Sami was my, one of my favorite professors.

He was a chair of the curriculum at the CS department at Stanford.

So we spent a lot of time thinking about CS education, right?

Like what, what should students learn nowadays in the era of like AI coding?

And then the other person is like Andrew Un, which is of course, it's like a legend in the AI space and Mira Sami, a person like Sami, it's something very interesting.

It's like, he said, like a lot of things that CS is about coding, but it's not like coding is just a means to an end.

Like CS is about system thinking, like using like coding to solve an actual problem and problem solving will never go away because like what AI can automate more stuff, the problems just get bigger.

But as a process of understanding what caused the issue and like how to like design step-by-step solution to it will always be there.

So I think an example of, um, of like, I actually have a lot of issues with like AI for like, um, in the way of like is debugging.

So I'm not sure you use a lot of AI for coding, but like a, something I noticed, I know, so soon for my friends, it's like, it is pretty good when you have very clear, well-defined tasks, maybe write documentations, fix specific features, or like build an app from scratch, right?

Like it doesn't have to interact with a large existing code base, but it added something like a little bit more complicated, um, maybe it would be quite interesting with a lot of components and stuff.

It's usually like not that good.

Um, and, and for example, like I was using AI to like use, um, to deploy an applications, um, and it was testing out a new, uh, hosting service.

I was not familiar with it.

It was like, okay, like usually they form me.

So what the AI does give me is like confidence to try new tool.

Like before what AI is like trying new tools has written out documentations from the beginning, but it was like, okay, just try it out and learn.

So I was testing out this new hosting service and it kept getting a box that was like very, very annoying.

And it was like, okay.

I asked, uh, car codes, like fix it.

And it kept giving me like, it kept changing the way, like maybe change environment variable, fix the code.

Maybe not change from the function to this function, maybe change the language.

Maybe it doesn't process general script, whatever.

And it didn't work.

And it was like, okay, that's it.

I'm also going to read the document, uh, documentation myself and see what's wrong.

And it turns out it's like I'm on another tier, like the fish I want did not, it's not available in this tier.

Right.

So I feel like, okay.

So the issue with cloud code is just trying to focus on fixing things from a very, a different component, whereas the issue is from a different component.

So I think, I think of like, okay, be understanding how different components, um, work together and where the source of issue might come from.

You need to, you need to get a holistic view of it.

And it's made me think, you say, okay, how do we teach AI, my system thinking, like that, right?

And I have all the human experts, like having like write like very much build, go to scaffold, uh, just like, okay, Fox, this guy problem, look into this, look into that, look into that.

And then stuff.

So, so I think that could be one way, but that's what made me think is like, how do we teach humans, my system thinking?

Um, yeah.

So, so yeah.

So I think it's very interesting, um, skill.

I do think it's very important.

That's exactly the same insight Brad Taylor shared on the podcast.

He's the co-founder Sierra.

He created Google maps.

He was CEO of Salesforce, quip, a few other things.

And I asked him just like, should people learn to code?

And his point is exactly what you said, which is learning, taking computer science classes is not about learning Java and Python.

It's learning how systems work and how code operates and how software works broadly, not just here's like a function to do a thing.

One thing that I wanted to help people understand you, you're this book called AI engineering, which is essentially helping people understand this new genre of engineer.

And you have this really simple way of thinking about the difference between an ML engineer and an AI engineer, which has a really good corollary to product managers.

Now of just like an AI product manager versus a non-AI product manager, the way you describe it and fill in what I'm missing is just ML engineers built models themselves.

AI engineers use existing models to build products.

Anything you want to add there?

One thing I really dislike about writing books is that it has to be defined like this.

And I think it's like no definitions will be perfect because there will always be like edge cases.

But yeah, in general, I think it's like, it's like gen AI as a service, like models and service, like when somebody build the models for you and the base model performance is a pretty shock.

So it's like it's enabled people to just like, okay, now I want to integrate AI into my product.

I don't need to learn for Korean deskiness, even though knowing that could really help.

But yeah, it's like it makes entry barrier really low for people who want to use AI to build product.

And at the same time, AI capabilities are like so strong.

It's like it's also like increased like the possibilities, like the type applications that AI can be used for.

So I think like, yes, it was entry barriers, like super low and I said demand for like a applications like a lot bigger.

So I feel this is very, very exciting.

It opens up like a whole new wall of possibilities.

Yeah, it's like now you don't have the time, you don't even spend time building this AI brain, now you can just use it to do stuff.

Such a such an unlock.

Okay, maybe just a final question.

You get to see a lot of what's working, what's not working, where things are heading.

I'm curious just if you have to think about in the next two or three years, just where things are heading.

What do you think?

What do you think?

How do you think building products will be different?

How do you think companies working will be different if you had to think of maybe the biggest change we expect to see in the next few years in terms of how companies work?

I think in a lot of organizations, they don't move that fast, right?

But at the same time, they move faster than expected.

Because again, I think it's like bias, like in a dinosaur company, I don't care.

I think a lot of executives who come to me are like very forward looking.

So maybe for me, I'm very biased towards organizations is like move fast.

So yeah, so I think one big change I see is like in organizational structure.

I think there's a lot of value placed in like, so before, we have a lot of disjointed team.

We have very clear engineering team, product team.

But then the question of like, who should write EVA?

Who should own the metrics?

And it turns out it's like EVA is not a separate problem.

It's a system problem, because you need to look into different components, how they interest each other.

You need to use the behaviors, but you need to know what users care about so that you can write EVA.

It's like reflect what users care about.

So all of that, you can sort it from like, you know, go to different component architectures, place guardrails and stuff.

So it's just engineering, but understanding users is like what product, right?

So because of like a lot of things, any EVA is extremely important.

So like that kind of brings product team and like engineering team, even like marketing team, like user acquisition, like very close to each other.

So yeah, since in a way, so if you go with structuring, so there's more communications between like previously very distinct functions.

Another thing is just like, I also see as teams, of course, I think about like what can be automated in the next few years and what cannot be automated.

And I see that people already like shedding like, actually, it's a little bit like, scary to think about it.

But also things it's like the team's the web topies, it's like, okay, this is a good new and me, but we're like, got rid of these functions, right?

I follow a lot of things like previously outsourced, for example, like traditionally, is a business outsourcing this core to them, and like can be done with like not can be more systemized, systematized.

So with that, you can actually use AI should like automate a lot of that.

And as soon as a separation, people think more about like, what is the value of like junior engineers or senior engineers, how to restructure engineering for that.

So yeah, so I do definitely things that is one thing to success organization, people are just moving pieces around and like, thinking about like use cases, whether you need to like spin out new use cases and who would lead a new effort and like, yeah, that is one big change.

Another things in top of like AI, I think this is, I'm not sure how true this is.

I guess I'm, I'm also like on the camp of like thinking that is has merit is, is a camp of like, okay, base models, we have probably like not quite max out, but we want, we are unlikely to see like really, really strong, like crazily strong model.

So like is even like when we have like GPT, right?

And it should be true, which is a big step up like an automatic to like, like better than like GPT and then to be a tree, which like much much bigger and should be for much, much bigger.

And I of course, on your GPT five, but like is to be five like that scale of like, much bigger, like a step jump compared to like the previous, I think it's a debate.

Right?

So, so I think that it's like we had this appointment, like the base model performance improvement is not going to be like my blowing.

And it was in the last three years.

So, so I think it's like a lot of like improvements we're going to see in the post training phase in the application building phase.

And, and yes, also, I think that's where I feel I would see a lot of improvement there as a very like interest in like multi modality.

So we've seen a lot of text based bad news, a lot of audio, videos use cases, that is very, very exciting.

And I think audio is not quite as soft as well, I think because I do work with like, with with like a couple of like voice startups, and when you talk to think about voice is an entirely different list.

So let's say you have chatbot, right, it will go from a text chatbot to voice chatbot, it's like the consoles are completely different.

Because now with voice chatbot, right, we need to think about like latency, because having multiple steps, first, like have like text, like voice to text, text to text and text question into text answer, and then like, and then text to voice answer, right.

So it's like multiple hops.

And like latency become very important.

And there's a question like, what does it make to sound natural?

So for example, like people think like, in AI and humans, when humans talk to each other, like if I say, I say, you try to interrupt me and say, Chip, that's right, I would like pause, and I try to hear you out, right.

But sometimes I just even make us say, say some word, like acknowledge when I, mm hmm, mm hmm, that I shouldn't stop, I just continue.

So the question of like, for instance, corruption, whether it's like, I should, should I stop or not, like, it's a big and what perceive as like natural conversations.

And that's also regulations, right?

Because like, because like, a lot of time people want to build a chatbot voice chatbots as sound like humans, charging like trick users into thinking they're talking to humans.

But also, right, maybe potential regulations think like, okay, you have to disclose to users when you talk if the if the bot is talking to is human or AI.

So so I think just like, there's a whole space, I think it's not quite as soul as you think is it but it's all it's all not quite like an AI foundation model problem, right?

Because like a human interruption detection is actually a classical machine in problem.

Like you you is a different framing of like you can be classified for that.

Or like the question of like, let us see, actually, it will massive engineering challenge, not an AI challenge.

Of course, it can be an AI challenge because people are trying to build a voice to voice model.

So instead of having like, having to first transcribe the voice from me into text, and then get a model Jerry's text answer and get another model should like turn from text to speech, you can send your voice your voice directly.

So that is something we're working on.

But it's like very hard.

Yeah, so so yeah, so like, even audio, I think of it is like the easier than video, right?

Because video have like, both image and voice is already like pretty hard.

So I think it's a lot of challenges in that space.

That was an awesome list of things.

Let me mirror back real quick.

So what you're predicting in the next few years, things that will change in the way we work.

And these actually resonate with so many conversations I've had on this podcast.

So it says just kind of double the doubling down on where things are heading.

One is the blurring of lines between different functions instead of just like engine design engineering, everyone's gonna be doing a lot of different things now.

Two is just more of work being automated with agents and all these AI tools, and just in theory productivity going up.

Third is shifting from pre training models to post training, fine tuning and things like that because your point model models maybe are slowing down and how smart they're getting.

Although I'll point folks to the chat with the co founder of enthronpic, he made a really good point here.

He's like, we're really bad at understanding what exponentials feel like we're in the middle of that.

And also models are being released more often.

So the difference between them, we may not notice because they're just happening more often versus GPT three came out like a year after before after JPT two.

So maybe two maybe not.

And then the fourth point you made is this idea of multimodal investing in multimodal experiences.

I cannot wait for chat GPT voice mode to get better interruption like exactly what you're saying.

I'm just like talking to it and so it makes a little sound.

It's like, okay, and then you have to and then it's like, and then it stops talking.

So annoying.

I'm shocked.

So we don't have better voice assistant at home yet.

I think I have been testing out a bunch.

Almost like I keep hoping, Oh my God, Zach would be the one is an, I don't know how many of them I just like has to get a boy because they're not that good.

I think it's coming.

I hear it's coming and tropics working with someone that I don't know if it's launched or not yet.

Yeah.

I'm sorry.

I want to bring back to what you mentioned about like the, as well guests like from entropic mentioned about the performance improvement.

I think there's a big change.

I think like this difference between a model based mobility.

So I was not talking about like the pre-trained model, right?

Versus a perceived performance.

So, so let's say it's like, I'm not sure it thought about like, I familiar with the term a test time compute.

I don't think so.

Yeah.

So, so, so the idea is like, okay, like, you have some fixed amount of compute, right?

So you're going to spend a lot of compute on appreciating our trend model, pretty much is an F then a lot of some compute only five to eight is a ratio.

I appreciate engine, a portioning compute is like crazy, very different, different now.

And also like sense and has a spent compute on like Jerry inference when I have a trends and fighting a model and now you want to like service your users.

So I might type of questions a prom and it's like Jared, like do inference, like, and that requires a compute.

And I guess if you have one discussion of like, should I spend more compute on like pre-shooting or fight or inference, right?

Because like inference and people found out as like test time compute.

So like spending more compute on inference is like calling like test time, like compute, like as a strategy of like just allocating more resources, compute resource to Jared inference, when I shouldn't bring better performance.

And how does that do it?

Like, let's say, let's say you have a math questions, right?

And maybe instead of the strategic one answer, again, just like four different answers and say, okay, whichever is the best according to some standard, or like, okay, I have four answers, and maybe like three of them say for a two, and one of them says like plenty, okay, three of them in the in agreement, so the answer should be for a two, right?

So like just people shouldn't Jared a bunch of it.

Or another thing is like a lot of time and reasoning, thinking is just like, be able to like Jared more thinking tokens, I spend more time thinking before showing the final answers.

It's like require more compute, but it's like giving more, more and more better performance.

So so yeah, so so I think it's like, from the user perspective, right, like when the model spend more time exploring different potential answers, thinking longer, it can give you much better final answers.

But the base model itself does not change.

Does it make sense?

Yes, that does.

Absolutely.

Yeah, that is a good corollary to, to Ben man's point.

Yeah, chip, we covered a lot of ground.

I've gone through everything I was hoping to learn and more.

Before we get to a very exciting lightning round, is there anything else that you wanted to share anything else you want to leave listeners with?

So I do work with a few companies that does these things of like, they want employees to like come up with ideas.

So there's a big debate on like, what is a better way for a strategy?

If I should be topped out or like bottom up, right, should like executive come up with like, why not to like kill a use case and like everyone like allocate resource to that?

Or like should you give engineers and PMs and smart people like come up with ideas?

And I think there's a mixture of both.

So some companies, it was okay, we hire a bunch of smart people, like let's see what they come up with.

And they organize like more than hackathons like internal challenge to get people to build products.

And one thing is that I know just like a lot of people just like don't know what you built.

And it shocked me like why I feel like we are in some kind of like an idea crisis, right?

Now we have all this really cool tools to have you like do everything from scratch, like I have you like design, it can have you like write code, it can have your website.

So in theory, we should see a lot more.

But at the same time, people are like somehow stuck, like they don't know what to build.

And I think it's like maybe it's a lot of had to do with like maybe like society expectations.

Because like we have gone through, we have gone into this phase of like specializations, like people like very highly specialized and people are supposed to do like focus on one thing really well instead of being a big picture, and we don't have a big picture of you, it's hard to come up with like ideas of what you built.

So I know what like when I work with this company, we do work out like how come up with a guideline and like how to come up with ideas.

And usually what we think of is like, okay, like one tip is like, go look from the last week, right?

Like for a week, just like pay attention to what you do and what frustrates you.

And once something frustrates you, think like, is there anything we can do?

Is there like can it be done a different way?

So it's not frustrating.

And you can talk like people can swap to accept sub-nope with our teams.

And if you see like common frustrations, maybe there's something you can think about is just to build something around that.

So yeah, so I feel like just like notice like how we work, thinking of like ways to like constantly ask questions like how can it be better.

And then I just build something to like address the frustrations.

I think it's a good way to just like learn and adopt AI.

I think people have felt exactly what you're describing every time they open up one of these vibe coding tools, whether you could just describe anything you want.

I'm like, I don't know what do I want?

And I love this very tactical piece of advice, just like what frustrates you just pay attention to where you're frustrated.

For example, I just built a very cool little vibe coded app.

I was working on a newsletter post inside Google Docs.

And I pasted all these images into the Google Doc from screenshots and stuff.

And then I forgot, oh, yeah, you can't take images out of Google Docs.

It's like this Hotel California experience where you can paste stuff into it.

Very hard to get images back out.

So I just went to all the vibe coded tools and just build an app that I can give you a Google Doc URL.

And let me download all the images automatically.

And it worked amazingly well.

And I made it really cute.

And I'll link to it in the show notes.

Oh, I love to see that I do.

I'm very bullish on using AI, just create like micro tools.

Like just something that's like make your life a bit easier.

100%.

I feel like that's one of the main ways people are using these tools, just like a little niche problem they have.

With that chip, we've reached our very exciting lightning round I've got five questions for you.

Are you ready?

Yeah, always depends on how hard the questions are.

They're very consistent across every guest.

So I imagine you've heard them before.

First question, what are two or three books that you find yourself recommending most to other people?

I'm really terrified of like book recommendations, because I feel like what books should read, read depends on what they want and where they're in life and what they want to get you.

But I just several books I do think is I have really changed the way I think to see the world.

So one thing is a selfish gene.

It's like to understand, it actually helped me with the question like whether I want to have kids or not.

Because it's like understanding more of like, a lot of our functions or where we operate is functions of our genes.

And genes want to do one thing, it's like to procreate.

So yes, in a way, but it's like, so but what's a proposal another thing is like, so everyone wants to live forever, right?

And maybe it's not like consciously, but subconsciously, we do we do want that.

And I said two ways, like one is like my genes, like genes, once it's like, once it goes into forever, but it's also two ideas.

I think there's something going on.

It's like being opposed if you have some ideas out there, and then it's like last for a long time.

So on which you like live on.

I know it's like, it's a little bit like abstract, but I thought it's very interesting.

The other books I really, really like is like from like, the book from Singaporean previous.

I think he's gone as a father of Singapore.

I know, like Lin Lee Kuan Yew, I'm not sure what's the title it but like he did so he was the one who led Singapore from he's changed Singapore from a through world country to a fourth world country within 25 years.

And I have never seen any country leaders have spent so much effort into like putting down his taught on like how to build a country like that.

And I say talk a lot of like public policy, like how to like create policies of encourage people to do the right things that is good for the nations.

And I was talking about like, foreign affairs, foreign policies, like the liberation of the country with other.

So it's a really good book to think about.

For me, it's like system thinking, but like it's a different kind of system which a country which a lot of us don't get a chance to like ever experiment in our life.

So it's good to learn about that.

What was the name of that second book?

It's called like from third to first world fashion.

I think I have it somewhere here.

Yeah, there is a very show and tell.

That's, that's awesome.

I definitely want to read that.

That's a really good tip.

I've heard a lot about just the impact he's had.

And I've seen all these videos on Twitter, just his really wise insights into how to build a thriving society.

And clearly it worked.

How does he have time to write such a thick book?

It's like insane.

That is Claude, please summarize.

I'm just joking.

By the way, Selfish Jean, I also absolutely love that book.

That is such a good choice.

It's such an under the radar kind of book that really changed the way I see the world as well.

So really good pick.

Okay, next question.

Do you have a favorite recent movie or TV show you really enjoyed?

So I watched a lot of movie and TV shows as research because I working on my first novel and I recently sold it.

So I'm interested in like what makes it a drama.

It's not a science fiction or anything that like tech people usually read.

So it's very like I know it's a very out of the left field and like very, so as long as like reading, watching TV to see like what kind of stories become popular, trying to understand the trope and stuff like that.

So I'm not sure if the audience are like...

Well, what's one?

What's one that taught you something about writing?

I think you like YAMC, it's a Chinese TV show.

Cool.

Okay.

I haven't heard that one on the podcast before.

Okay, cool.

Next question.

Do you have a life motto that you often think about, come back to when you're dealing with something hard, whether it's in work or in life?

This sounds very nihilist.

I think society is like in the end, nothing really matters.

Usually I think of like in the grand scheme or in a billion years, nothing will like no one will ever be there.

I think, okay, someone will argue with me about that.

So I go to say like, I've told my theories like in a billion years, like none of us will ever exist.

So like whatever like messy things, like crazy things we do or like how bad we do it, I mean, no one would be there to remember it.

And I think in a way, it's like it's so scary, but it's very liberating because this allows me to, okay, let's just try things out.

Right?

Like, why does it matter?

And this is a story of like recently, so we have a sometime remember who passed away recently.

And I was talking to my dad because I couldn't be home for that.

I was asking my dad, like, okay, say anything I can do to make a person like, oh, something like comfort, sort of anything you can get that person's.

And my dad was like, what can he possibly want at this moment?

Like, and this made me feel like at the end of life, like there's nothing that can bring you like, like material can bring you joy.

There's no like money, no product, nothing.

And in a way, it's me feeling like, okay, what really do I really care about at the end of the day?

So I guess it's like I think about it.

It's like, okay, maybe I fail it, maybe I don't get that contract, maybe do things like in the end, at the end of life, like, I don't think that actually really matters.

So in a way, just like it's kind of liberating.

I know you said it might be nihilistic.

This is what Steve Jobs shared too.

And one of his most famous speeches just, we will all die someday.

So don't take things so seriously.

And it is freeing.

Absolutely.

It just makes you appreciate every moment every day you have just like, yeah, let's just do something hard and scary.

Okay, final question.

You talked about how you're writing a novel.

Most people in tech have never written something creative and fiction.

What's just like one thing you learned in the process about how to write better stories, better fiction?

A lot of time when we read, we get tripped up by some small things.

So I think like I want you to do career writing because I just want to go a better writer.

And it tells us like maybe try my a different audience could have me like become better like anticipating what this different type of audience would want to hear and like what they care about.

So it's the way for me to get up.

So I think if I write it, or even like any kind of like content creations is about like predicting the user's reactions.

Right.

The next token.

Just kidding.

Yeah.

So like you do a podcast, it's like, okay, what kind of things that the users could find engaging, right?

And I find this like a little bit like, and a lot of companies like you have like launch a product, you have a narrative coming out.

So okay, what kind?

How do we position this product in a ways of like users want, right?

So I feel like I have done technical writing for why and I felt like I have had some experience like trying to predict what engineers would want to hear or like care about.

But then I don't have an experience like this completely different type of audience.

So that's what I want to do like career writing, writing a story.

And that's why I was doing a lot of research on the question.

I mean, when research, especially enjoy a lot, like watching a lot of dramas, I just see like what, what it will like.

So one thing that I care about is just like, I think a lot is like, well, like emotional journey was from an editor, right?

So like, when we write something that we care about, like how users would feel like across the story, like we want something in the beginning, right?

We want some things that's like, we need to have a hook so that people continue reading.

But we also don't want too much of like drama, because we'll get like too tired, right?

Like, because like the emotion exhausted, like, because it's like you being like emotionally manipulated, like a lot of time.

So if you have like emotional, emotional journey, maybe like some, some climax or like some something more chill, or maybe like, and so care about another things I didn't realize like for me, for technical writing, you entirely focus on the content, like the argument is very impersonal, right?

Like it's like, for example, like people like ML compilers, like doesn't matter if they like the person telling them about compiler or not, right?

Because it's just like objective, like, like, but like for for novel people care about like character likability.

So so like in the first version is my story, I make the characters like a little bit more like, very, very logical, very rational, and just does everything just like very rationally.

And then the feedback I got is I have a very good friend, and he was he's an amazing person.

He's a great person.

And he was like, cheap, I'll be honest with you, I hate that person.

So it doesn't matter as a story.

It's just like the person is so unlikable.

So say he doesn't want your community.

So he's a second version and makes a person more likable.

Like what how she makes a character more likable is that you put in some vulnerability, like some of the time that it can be a person like have set back because I'm not working related to it.

See, in a lot of ways, like, it's very interesting.

It's like a lot of it is like, yeah, a lot of it is about like understand the emotional bits, like how the users feel, not just about the story, but also about the characters.

That is so interesting.

Wow, I learned a lot more there than I thought.

That was awesome.

Really good example.

Chip, two final questions.

Where can folks find you online, if they want to reach out and maybe work with you or maybe even just share the stuff that you offer if folks want to reach out.

And then how can listeners be useful to you?

Unlike a master's for media, LinkedIn, Twitter, I post a lot, but I keep telling myself that I should do more because I kind of like the competition with readers.

So actually, a master's is not a suspect.

So I have like a placeholder for a suspect right now.

And I'm thinking of doing it for more system thinking because I think it's a very interesting skill.

And so like thinking of doing a YouTube channel on book reviews, and basically books that help you think better.

So I think it's the first book I'm going to review is probably like this book, because it's like my favorite book growing up.

And I have been like, keep on reading it.

So yeah, so how can it be helpful, like, send me books that you like books that help you have changed the way you think, or change you the way you do anything.

So I would appreciate it.

Amazing.

I'm excited to read that book.

Chip, thank you so much for being here.

Thank you so much, Lenny, for having me.

Bye, everyone.

Thank you so much for listening.

If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app.

Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast.

You can find all past episodes or learn more about the show at Lenny's podcast.com.

See you in the the next episode.