The Cognitive Revolution · 2026-02-25

OpenAI's Vision for Universal Medical Intelligence with Karan Singhal

Hosts: Daniel Jeffries

Guests: Karan Singhal

medical AIhealthcare AIAI safetyscalable oversightmultimodal AIprivacyclinical adoptionAI benchmarks

Read summary Jump to transcript Original episode

Why it matters

OpenAI collaborates with 260+ physicians to guide AI behavior and ensure safety in medical contexts.

Key claims

OpenAI collaborates with 260+ physicians to guide AI behavior and ensure safety in medical contexts.
Health-Bench benchmark evaluates models on 49,000 criteria across 5,000+ conversations, measuring accuracy, uncertainty, escalation, and cultural appropriateness.
230 million weekly users already engage with ChatGPT for health-related queries, signaling rapid consumer adoption.
First randomized trial with Kenya's Penta Health showed statistically significant patient outcome improvements using AI co-pilots for clinicians.

Episode summary

Summary

In this episode of The Cognitive Revolution, Karan Singhal, Head of Health AI at OpenAI, discusses the rapid advancements and deployment of AI in healthcare. Singhal highlights OpenAI's approach to building trustworthy medical AI systems through collaboration with over 250 physicians, extensive evaluation benchmarks like Health-Bench, and a focus on safety and uncertainty calibration. The conversation covers OpenAI's plans to make ChatGPT Health widely accessible for free, integrating multimodal data sources such as electronic medical records and wearables, and the ongoing efforts to raise both the floor and ceiling of human health outcomes globally.

Singhal also addresses challenges in clinical adoption, emphasizing the importance of real-world studies, partnerships with health systems, and thoughtful rollout strategies to gain trust among medical professionals. The episode explores the intersection of AI safety and healthcare, scalable oversight, and the potential for AI to revolutionize biomedical research and personalized treatment planning. OpenAI's commitment to privacy, data segregation, and not using user health data for model training is underscored as a key part of their strategy to build user trust and maximize impact.

OpenAI collaborates with 260+ physicians to guide AI behavior and ensure safety in medical contexts.
Health-Bench benchmark evaluates models on 49,000 criteria across 5,000+ conversations, measuring accuracy, uncertainty, escalation, and cultural appropriateness.
230 million weekly users already engage with ChatGPT for health-related queries, signaling rapid consumer adoption.
First randomized trial with Kenya's Penta Health showed statistically significant patient outcome improvements using AI co-pilots for clinicians.
ChatGPT Health product enables users to connect electronic medical records and wearable data with strong privacy protections; data is not used for training foundation models.
OpenAI plans to offer ChatGPT Health globally for free with no ads, aiming for universal access to medical intelligence.
AI models are approaching attending physician-level performance in many clinical scenarios, though subtle human clinical judgment remains valuable.
Research and product development focus on scalable oversight, uncertainty calibration, multimodal data integration, and improving AI personas for healthcare use.

Source material

Transcript

Hello, and welcome back to the cognitive revolution.

Today's episode is brought to you in part by Granola.

To help new users experience the power of the Granola platform, Granola is featuring AI recipes from AI thought leaders, including several past guests of this show.

There's a replica recipe that converts discussion notes to an application build plan, a bentosal recipe that creates content production plans, and a damn ship recipe that looks across multiple sessions to identify cultural trends at your company.

My own recipe is a blind spot finder.

It looks back at recent conversations, and attempts to identify things that I might be missing.

This has already proven useful in the context of contingency planning for my son's cancer treatment, and as I use it more and more, it's getting better and better at suggesting AI topic areas that I've neglected and really ought to explore.

See the link in our show notes to try my blind spot finder recipe, and experience how Granola makes your meeting notes awesome.

Now, today, my guest is Karin Singal, who leads health AI at OpenAI, and who is just named to the time 100 health list for his pioneering work.

This episode began to come together last year on Thanksgiving, when I emailed Karin who I'd met a couple times at AI events, to say thank you for all of his work on AI for health, and to let him know what a difference chatchupity had made for me and my family in the context of my son's cancer diagnosis.

As it turned out, that was just as OpenAI was preparing to make a major product push, with chatchupity health, which allows users to connect chatchupity to data sources including electronic medical record systems and consumer wearables, plus a physician facing chatchupity for health care, both lodging in early 2026.

In this episode, we dig in to how Karin and team have achieved a tending physician level performance with their latest models.

Their plan to ensure that this capability does benefit all of humanity, and their vision to raise not just the floor, but also the ceiling of human health, with continued research and even better models to come.

Highlights of this conversation include how OpenAI works with more than 250 human doctors to ensure accurate, robust, and culturally appropriate responses.

How they built health bench, which contains some 49,000 evaluation criteria to measure models performance, and how models have already gone from a zero percent score on health bench hard by GPT40 when the benchmark was first created.

To today, already, 40%.

Plus, an overview of my experience using large language models to navigate a health emergency, including the critical importance of giving models as much context as possible on your situation, and how that's about to get dramatically easier as chatchupity health rolls out globally.

We also discuss how 230 million people are already using chatchupity for health questions on a weekly basis.

The first randomized trial of AI co-pilot for physicians, which OpenAI conducted with Kenya's Pen to Health System, and which did show a statistically significant improvement in outcomes for patients whose doctors used AI.

And why car and beliefs based on the reception that OpenAI is getting from health systems that 2026 will be the year that using AI becomes a standard part of medical practice.

From there, we go on to cover the steps that OpenAI is taking to ensure privacy and security of user's health information.

How they're using worst of end measures to make sure models first do no harm while at the same time striving to maximize value by training AIs to acknowledge their uncertainty as they offer their best guesses.

How car and understands the relationship between AI for health, AI safety plans, such as scalable oversight, and AI alignment more broadly.

Car and support that OpenAI's models and support reasoning has not drifted toward nerlias, as much as some reports had previously caused me to believe.

The future of medical motimodality, which will do a much better job of converting data to value, and which inspired me to buy a whoop wristband to start collecting data on myself.

And the compounding effect of parallel advances in AI for science, the growing potential for anyone treatment plans and medical move 37s, and the possible need for an update to the rules governing access to experimental medicines and information sharing.

Finally, car and describes OpenAI's utopian plan to make chat GPT health available to all users globally for free with no ads, and early form of universal basic intelligence that I really think everyone ought to celebrate as a triumph of human ingenuity and goodwill.

Zooming out in the grand scheme of AI development I think it is fair to say that we have far more questions than answers and in my mind all outcomes from a post scarcity utopia to literal human extinction absolutely remain on the table.

I signed a recent call for a ban on superintelligence because I do worry that an AI arms race driven by recursive self-improvement loops could easily get out of control.

And yet, at the same time, capabilities like this, which have been so valuable for me and my family, and which will undoubtedly save millions of lives in the coming years, are for me both an incredibly inspiring accomplishment and a practically irrefutable argument for the upside of AI.

The question at this point is not whether we will create powerful AI systems, but exactly what form they will take and under what circumstances and incentives they'll be developed and deployed.

Currents work demonstrates that for the moment at least, we can have it all.

AI systems meticulously crafted to minimize downside risk, which are both capable and efficient enough to meaningfully improve the human condition globally.

There is a ton of work left to be done, both inside and outside of the frontier companies, to make sure that these lofty standards don't slip in the face of intensifying competition.

But today, if you or a loved one are facing a complex health challenge, you owe it to yourself to take full advantage of the incredible medical expertise that current and others have managed to build into systems like ChantiPT Health.

With that, I hope you enjoy this inspiring look at the frontier of medical AI with OpenAI's Head of Health.

Currents and Goal.

Head of Health AI at OpenAI.

Welcome to the cognitive revolution.

Thanks for having me.

I'm super excited about this.

It's rare that I have had so much impact from a guest's work on my life, as I've had from the impact that your work in health at OpenAI and at Google previously before that has made on the last three months for me.

So, regular listeners know the story.

My son got cancer.

I've been an intensive user of all the frontier language models over the last three months to advise us as we've gone through this process.

And boy, have they been a game changer?

Thank you for all your hard work and for making the last three months for me, the mental health side of the equation for me dramatically better than it otherwise would have been.

And also, moving the needle a bit on my son's treatment and our confidence that we were actually doing the right thing for him.

It has been invaluable and the consumer surplus has been off the charts.

Amazing.

Thanks for the kind of intro.

And thank you for sharing the story.

I think it resonated with a lot of people.

So thank you for that.

A big takeaway.

I think this conversation will be if you find yourself in a medical emergency or even just want to do a better job of managing your health in general.

The frontier models today are getting really good at that.

I always go to the example of the AI doctor.

There's obviously a relative scarcity of medical expertise even in a wealthy country for a privileged person like myself in the United States.

You brought your worldview and look around the world and the shortages extreme.

And I've always kind of felt like boy, this would be just an absolutely killer use case that everyone could agree on.

When I started talking about it a couple of years ago, it felt like it was getting close but still a ways off.

And I've done a bunch of episodes over time with Vivek and some of your former teammates at Google DeepMind who work on on similar topics.

And they're always very appropriately cautious or have them over time about.

Yeah, it's not quite there yet.

We're not quite ready to roll it out to production but encouraged by the progress.

That's sort of thing.

But boy, again, it is really getting there.

But the first question I wanted to ask is how do you get into this?

And what were you expecting?

How big of a dream were you daring to dream when you first got into AI for healthcare some years ago now?

Yeah, I started working on AI for healthcare about four years ago now.

And kind of transitioned to doing full-time around that time.

I was thinking a lot about a few kind of fundamental research problems at the time when I was at Google at that time.

A few problems were around foundational work in representation learning, privacy preserving learning.

There's a lot of interesting applications in healthcare.

I think the background for this was a kind of conviction that I had since undergrad honestly that HGI would be a pretty big deal and it would probably happen within lifetimes.

And I thought there was probably two things that I could do to make that better.

One is to work on safety and the other to work on benefits.

And I saw healthcare as kind of the most obvious area for benefit like you.

And like you said, we've been on this kind of amazing exponential over time both with model capabilities.

And I think more recently with people's adoption and kind of the overtime window shift around trust around these models.

As we're seeing just a bunch of people start to use these models across individual patients, individual conditions, researchers and seeing a lot of the benefits become a lot more tangible.

And all of this comes down to for us at OpenAI.

It comes down to thinking about what does it mean to make our mission real?

Our mission is ensure HGI's beneficial for all of humanity.

And the mission is build and deploy HGI.

The second is prevent downside risk.

Whether that's kind of short-term kind of risk or kind of long-term kinds of frontier risk.

Or, and finally thinking about how we can make benefits happen.

And like you say, I think health is one of the most obvious and tangible ways those benefits can happen.

I think when I started working on health back in, I guess, end of 2022 was kind of when it was full time for me.

So like right before Shatchee PT.

For me, it was a lot of thinking around this is kind of capability overhang between Hel-Lems where they were at.

You're scaling this models up.

You're seeing that you can instruction to in these models.

And they can do amazing things.

And how they're being adopted and thought about in the clinical AI and health care roles.

And so my ambition at that time was really, which really two things.

One was get health care and clinical AI world to think about.

Hel-Lems is a thing that could work.

And again, this is prior to Shatchee PT.

And the second was think about the work that we need to do in safety and reliability to make the models trustworthy for the setting.

And then over time, I think those ambitions became less and less ambitious.

And my ambitions that open AI became larger.

And for our work at open AI, we kind of started out with three goals.

The first is to kind of make access to medical expertise more universal.

The second was thinking about the ways in which we can think about this setting as a way to ground or work in safety and alignment.

And I could talk more about kind of the safety motivation there.

And the third thing was thinking about how we can kind of bring society along with the high state technology and work with partnerships.

We're a lot of products.

Work with policy makers to think about the right ways to kind of iteratively deploy the setting like this.

I think over time, things like this sounded really ambitious about two years ago.

We started to set out to work on health at open AI.

Now, actually, would kind of feeling like these aren't ambitious enough.

So we're really excited about what's to come.

I think you did a great job there of kind of laying out a taxonomy or sort of scaffold for this conversation.

Let's maybe talk about the capabilities first.

But they are sort of inseparable.

I'm wondering kind of from a capabilities perspective, is there also a like Hippocratic oath kind of mindset that you bring to the table that is sort of focused on making sure that the AI performs well in medicine that maybe is still distinct from sort of the bigger picture, safety agenda that motivates you.

How do you guys think about like the the way that the model should perform in terms of doing no harm?

Because I think GPD4, well, it did add value.

Also, it could definitely do some harm in terms of giving you wrong ideas.

And I do think we've come along way since then.

But I wonder how you think about that?

100%.

The way we think about kind of the health work that we've been doing at Open AI, we've kind of been operating in three phases.

One is laying the foundation for the work.

And a lot of that has been really around the kind of safety research and work on making the models.

But not just better reasoners but also perform better and better bedside manner and things like making sure that they're able to convey uncertainty while things that they're like able to kind of escalate to a doctor when needed.

And so I can talk a little bit about that kind of work in a second.

We're recently now in this phase of adoption.

So as the foundations have solidified just a bunch of people have been using it.

It's been one of our fastest growing use cases.

And we shared recently that over 230 million people a week are using it for various health and wellness related queries.

And this year, we're really focused on kind of scaling the impact of the work.

And so all that I think comes from the foundation.

And so your question of how do we think about in viewing the models with the right kind of fryer of how to behave and how to ensure that there's minimal arm provided.

I think we were very thoughtful about this as we were laying the foundation for this work and starting to work on health at OpenAI.

And a lot of this comes down to our really close partnership with this cohort of 260 physicians or so that we've been working with since about two years ago.

And what we've been doing with them is basically working on one way of kind of imbuing models with a certain kind of behaviors to write a spec kind of from fryer's principles.

Maybe you or I just sit down and read on a spec and set like this is how model should behave.

They should say X, Y, and Z when in this situation and this in another situation.

This pros and cons of this, you know, one pro to this is that like it's very easy to explain and understand what's going on here and it's easy to be transparent about model behavior.

A con of this though is that it's versus very hard for you or I to say what should happen or what should not happen.

And it's difficult to say anything beyond doing a arm because, well, that's an excellent thing to avoid.

It doesn't tell you what to do in most scenarios.

And so what you want to do is figure out a way that you can move from really large scale principles that matter to what do you do in various specific scenarios and how do you make sure that's guided by the expertise of not just one or two people but actually hundreds of experts.

And so that's kind of the approach that we've taken and you can see it in our approach to evaluation.

So one way to think about how do we encode model behavior and models is like what do the e-values that we care most about.

e-values are really a lifeblood for any researcher.

And so we recently put out back in May of 2025 this work on health bench which is this kind of evaluation of how large language models perform in realistic health conversations between users and models where users could be ideal lay users or health professionals.

And we kind of went about it in a way that like leaned out the expertise of these 50 plus positions rather than writing down specs from perspectives and I can explain that more but I'm sure you have many more questions as well.

Yeah I mean go on as long as you'd like.

I guess to double click on the physicians for a second are they like what's the nature of the relationship with them?

I could imagine anything from you know they sit there doing side by side you know RL HF style prefer this to that or you can imagine some of them may be much more deeply integrated with the research team.

Yeah we have kind of like three different layers that we think about physician expertise and we would be very thoughtful about how to bring in physician expertise into our team in a way that like balanced out and combined well with the research expertise that we have in our team.

So we kind of have three layers.

The first layer is kind of like high level advisors who are actually like more informal who kind of help us with strategy in various ways we kind of share roadmap with and things like this.

The second layer is folks who we work with in kind of a you can think about it as kind of like a human data operation in the way that you're describing but a little bit more closely and you know we're slacking with them all the time and we work with them closely and they're not just kind of going off and doing tasks and we ask them for advice and questions all the time.

So we basically have a slack community where we're interacting with these folks and some part of what they're doing is kind of like comparing model outputs and red teaming model outputs or things like this.

Some part of it is looking for ways in which we have we might have blind spots today and and listing those kinds of ways so that we can prioritize them in the future testing new products things like this.

So as an example the work on strategy for healthcare which we which we announced on January 8th we had red teamers test this product over nine waves for six months and this was in close collaboration with this physician the final the final kind of top of the pyramid of how we kind of rely on physician expertise is as like really close advisors most closely with our team and so they're actually the ones who are kind of working on channeling the voice and combining the voice of these hundreds of physicians interfacing most closely with our research team translating that into emails and model training data and things like this so that we can get kind of then improve our models.

Hey we'll continue our interview in a moment after words of our sponsors.

One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks challenging but familiar tasks that allow you to quickly evaluate new models.

For me drafting the intro essays for this podcast has long been such a test.

I give models a PDF containing 50 intro essays that I previously wrote plus a transcript of the current episode and a simple prompt you know it.

Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years saving me countless hours but as you've probably heard Claude is the AI for minds that don't stop at good enough.

It's the collaborator that actually understands your entire workflow and thinks with you whether you're debugging code midnight or strategizing your next business move Claude extends your thinking to tackle the problems that matter and with Claude code I'm now taking writing support to a whole new level.

Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and I message.

And the result is that I can now ask Claude to draft just about anything for me.

For the recent live show I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions.

Based on those I asked it to draft a dozen personalized email invitations and to promote the show I asked to draft a thread in my style featuring prominent tweets from the six guests that booked a slot.

I do rewrite Claude's drafts not because they're bad but because it's important to me to be able to fully stand behind everything I publish.

But still this process which took just a couple of prompts once I head the initial set up complete easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests' recent contributions and preparing for a meaningful time.

Truly amazing stuff.

Are you ready to tackle bigger problems?

Get started with Claude today at Claude.au slash TCR.

That's Claude.au slash TCR.

And check out Claude Pro which includes access to all of the features mentioned in today's episode.

Once more that's Claude.au slash TCR.

Your IT team wastes half their day on repetitive pastward resets, access requests, onboarding, all pulling them away from meaningful work.

With serval you can cut help des tickets by more than 50 percent.

While legacy players are bolting AI onto decades old systems, serval allows your IT team to describe what they need in plain English and then writes automations in seconds.

As someone who does AI consulting for a number of different companies I've seen firsthand how painful and costly manual provisioning can be.

It often takes a week or more before I can start actual work.

If only the companies I work with were using serval, I'd be productive from day one.

Serval powers the fastest growing companies in the world, like perplexity, percata, mercor, and clay.

And serval guarantees 50 percent help desk automation by week four of your free pilot.

So get your team out of the work they enjoy.

Book your free pilot at serval.com slash cognitive.

That's S-E-R-V-A-L dot com slash cognitive.

We talk about channeling the voice of the physician and calling back to the Hippocratic Oath.

One thing that I do find kind of frustrating, honestly, about my experience in the medical system recently is I think there's a little bit too much emphasis on doing harm.

And this also connects in a pretty deep way to questions about, like, how should we even conceptualize how to talk to our AIs about what they should do.

I mean, we have kind of the detailed, you know, telemotic detail rules set as kind of one extreme possibility where, you know, for every corner case we try to map out what you shouldn't do.

And hopefully it learns that and lives by it.

And then on the other, I don't know if it's the other extreme end, but there's, you know, the kind of anthropic constitution of the cloud constitution recently has at least demonstrated that you can get pretty far as well with something that is less rule based and more about, like, trying to teach the model to have good character and use good judgment as it goes through all the situations that it finds itself in.

I would say my critique of the human doctors that I've really served us really well is that they definitely want to do no harm.

And to the point where there are sometimes too reluctant to engage in hypothetical or to act on something.

I do have one, like, good friend who's a doctor who also happens to have had the same or very similar kind of cancer to my son.

And he's like, notably, he have differently with me in private one on one.

You know, he's like, like, I don't need a randomized control trial to tell you that that makes sense.

But you don't get that too much when you're actually at the hospital.

There's a reluctance among physicians, certainly out of experience.

I think it's pretty commonly shared perception that there's a reluctance to act on things that makes sense because that isn't guaranteed to work out.

Obviously, biology is super messy and there's kind of diversity and there's just a lot that we don't know.

So how do you think about that kind of of challenge?

And so I kind of wonder where you guys want to land in terms of, you know, only adhering to the most rigorously defensible advice versus, you know, doing a little bit more of the amount of askl thing and being like, well, it does take, you know, you've got to be willing to take some risk to help people sometimes.

So maybe we should have the models do that.

Yeah, it's a quick question.

You're pointing out problems.

I think on two sides of the health care ecosystem.

One is as a patient, you have this challenge of needing to advocate potentially pretty hard for yourself.

And I think in your son's story, you pointed out a couple false starts that you had where you saw doctor.

They said it was probably normal.

You had an abnormal blood test reading a couple of times.

And then we're just kind of like, yeah, it's probably okay.

And so this is kind of experienced.

The patients often have when they're having something that they feel or know to be an issue of needing to advocate for themselves.

And doctors often, and feeling like their doctor isn't hearing them.

And on the kind of clinician facing side, you have this challenge of medical evidence is increasing rapidly.

It's very hard to keep up with the latest of what's going on.

You're overloaded with documentation.

And there's a bunch of challenges on both sides.

And so you have this amazing thing, which is AI that is able to do a few things.

You have AI that's able to talk to patients.

Understand the concerns that they're able to have integrate knowledge and information across both their previous history, the latest medical evidence.

One of the things that these models are obviously very incredible at these days is taking in a huge amount of health context, not just on you, but also on the latest medical evidence and things like this.

Integrate that altogether into one context and do something that I think is very difficult for a human to do.

And then on the position facing side, you have, again, that's in capability to integrate information and face the position.

And so that's a lot of why we're doing this work is because we see this kind of gap between where the models are at and how people are using them and we think that it's really important with our upcoming products.

And to get closer to answering the question, you mentioned, like, where do we see how should the models navigate places where there's a, you know, potentially a lack of medical consensus or potentially physicians would disagree about what to do?

This is pretty fundamental to our approach.

And this is why we don't have, you know, one or two or three experts who are determining what the models outputs are.

We have a pretty multi-prong approach.

A lot of this comes down to presenting information to the user, but being sure to present uncertainty when answer digs us.

So I mentioned the safety research motivation for a lot of the work that we're doing.

One of the directions that we've been exploring is are these models well calibrated in their uncertainty and can we get better at having these models verbalize their uncertainty.

So for example, if this, like three to five potential paths for your son's next treatment and a doctor might just for the sake of simplicity and for clear communication, just communicate one or focus on one.

A model can potentially communicate three to five of them, but mentioned that like the state of the evidence may be somewhat limited and mentioned caveats.

And so one of the things that we've been investing in is first, can models become better at understanding their own uncertainty.

And that can that be a thing that we measure and improve.

And second, can they verbalize it better?

And this is a big part of what is the right way of kind of thread the balance between being more aggressive and sharing potentially early, early results or early evidence and being overly conservative.

So that's kind of how we think about that.

What are you seeing in terms of trying to get the models to understand their own uncertainty?

Because I remember that famous graph from the GPT4 model card where the sort of pre-trained model seemed to be much more calibrated with respect to its own uncertainty than the trained model.

And you know, we've only scaled post training since then, right?

I haven't seen like an update to that kind of research in a minute, but it seems like there was at least a fundamental challenge kind of opening up there with respect to that kind of, you know, introspective self-awareness of, you know, how confident am I in this answer?

So that's something you guys have been, you know, is that why I haven't seen that sort of graph in a while?

Well, I think there's a measurement challenge.

So the plot that you're referring to and the kind of plot that existed back then was kind of like given the next token, for example, a multiple choice question, was the models for the ability of that next token for a multiple choice question.

Did that correspond to how likely it was to actually be correct in choosing A?

And what you're seeing now is I think it's become more difficult to measure that because of two reasons.

One is like we have higher expectations of our models than answering multiple choice questions.

And it's hard to say when the correct or not correct.

And the second is this is a little bit more technical.

The models now, and the reasoning tokens, or thinking tokens, between initially opening something into their final answer.

So the result is that you can't ask the model for what is the of that next token being A exactly in the same way that you would before.

And so there's a couple of things that you can do to handle.

This one is you can go in the direction of richer ways of measuring whether model is correct or is doing the thing that you want, then then measuring lot probability of a certain letter.

The second thing you can do is repeatedly sample from a model and then see whether performance says the same or degrees as you repeatedly sample from the model.

So actually on health bench is an example of doing both these things at the same time.

So in health bench, we did this work around basically measuring not just like one or two or three different aspects of model performance in health, but actually across these 250 plus physicians and across five thousand conversations measuring about 49,000 different axes on which model performance could differ.

And part of this is like whether or not the model expresses uncertainty the right way, whether this is the right fact, whether it escalates to a physician, when needed or things like this.

Again, 49,000 different things that are measured in health bench.

And so one of the things that you can do there is then measure correctness in a way that is less about multiple choice accuracy and more about are the right facts included that are really important to emphasize to the user and things like this.

The second thing you get out of that is this metric which we call worst at n which is when you repeatedly sample from the model and try to measure performance on health bench, what is the worst performance you get on and samples.

So sample from the model 20 times, what is the worst performance you get.

So now you have a way of measuring instead of kind of the log-prog-based way, a way of measuring even with a reasoning model whether or not produces a consistent result, conditionally on the different times of thinking that it's doing.

And so what I would say is basically like a harder to produce plots like we could be for it because now the thing that we're measuring is so much more complicated.

But when we do produce the plots as we did in health bench, the plots are also looking to say and models have improved quite a bit at that.

Hey, we'll continue our interview in a moment after a word with our sponsors.

AI agents may be revolutionizing software development, but most product teams are still nowhere near clearing their backlogs.

Until that changes, if it ever does, designers and marketers need a way to move at the pace of the market without waiting for engineers.

That's where Framer comes in.

Framer is an enterprise grade website builder that works like your team's favorite design tool, giving business teams full ownership of your.com.

With Framer's AI Wareframer and AI workshop features, anyone can create page scaffolding and custom components without code in seconds.

And with real-time collaboration, a robust CMS with everything you need for SEO, built in analytics and AB testing, 99.99% uptime guarantees, and the ability to publish changes with a single click.

It's no wonder that speed, design, and data-obsessed companies like Proplexity, Mirror, and Mix panel run their websites on Framer.

Learn how you can get more from your.com from a Framer specialist or get started building for free today at Framer.com slash cognitive and get 30% off a Framer Pro annual plan.

That's Framer.com slash cognitive for 30% off.

Framer.com slash cognitive rules and restrictions may apply.

Everyone listening to this show knows that AI can answer questions, but there's a massive gap between here's how you could do it, and here I did it.

Casclet closes that gap.

Casclet is a general purpose AI agent that connects to your tools and actually does the work.

Describe what you want in plain English.

Triage support emails and file tickets in linear.

Research 50 companies and draft personalized outreach.

Build a live interactive dashboard, pulling from Salesforce and stripe on the fly.

Whatever it is, tasklet does it.

It connects to over 3000 apps, any API or MCP server, and can even spin up its own computer in the cloud for anything that doesn't have an API.

Set up triggers and it runs autonomously, watching your inbox, monitoring feeds, firing on a schedule, all 24-7, even while you sleep.

Want to see it in action?

We set something up just for cognitive revolution listeners.

Click the link in the show notes, and tasklet will build you a personalized RSS monitor for their show.

It will first ask about your interests, and then notify you when relevant episodes drop.

However, you prefer email, text, you choose.

It takes just two minutes, and then it runs in the background.

Of course, that's just a small taste of what an always on AI agent can do, but once you try it, you'll start imagining a lot more.

Listen to my full interview with tasklet founder and CEO Andrew Lee.

Try tasklet for free at tasklet.au.

And use code, cogreve, for 50% off your first month.

The activation link is in the show notes, so give it a try at tasklet.au.

So, how should you, as a user, think about the worst event thing, if I, is there a way to sort of trans, maybe you could just describe the result, how much worse is the worst event, then say the next worst event, or the average event, and then if I'm a user, which I am, do I, can I get security just by running it twice?

Is there a practical upshot of that work that could, give me confidence that I'm not getting the, if I do X that I can be sure, I'm not getting something that's way worse than the model, you know, is typically going to output.

Yeah, the way I think about it is, I think the more compute you spend on things, you will get better results, and the results may be marginal over time.

So, one thing you could do is a user of a model, is sample from model 10 times, combine that together into one output, and have an LM synthesized that, kind of the outputs of, you know, an LM counsel, and then produce that as an answer.

I think that will be marginally better than the answer that you get, from just running one model, and this is not so dissimilar from what G55 Pro, and can stuff like this, does under the hood anyway.

And so, I think you can do a few things if you're a user, and you want to kind of make the best of this.

One is, you can use G55 Pro, or something similar to do, I think like this.

A second thing you can do is, you can actually increase the amount of reasoning, because I think it has a very similar effect to just running it multiple times.

I think in both cases, my current sense is, we're getting to the point where current models, as they are now, are performing incredibly well for most people in most of the times.

But I know, for example, you using G55 Pro, was an important part of, of working through your son's situation.

I think we're reaching a point where, except for the most complicated of cases, most people are best served by just using the model on a default reasoning setting.

I would recommend using the reasoning models instead of the kind of more instant models, using the G55 point to thinking, rather than G55 point to instant, and for a lot of health related things.

But I think most people can get the best of both the worlds between latency and performance by doing that.

I think the way to think about, really broadly, the worst of end result, is just kind of like, if you sample from something 20 or 50 times, you'll have a varying performance across model outputs.

What we saw in the worst of end, was also that recent models have improved pretty significantly, where the worst performance of, for example, O3 at that time, was way, way better than the best performance of GBD40, and we've continued to see that over time.

So we've been kind of shipping model improvements in health pretty rapidly over time, and the model improvements in the last year have been more than in previous years since Cheshire Petilunch.

And so, as an example, today, the nano models, the GBD5 nano models that you can get through the API, or open source models, are actually performing similarly to O3, which was our best and greatest model, not so long ago.

And the latest reasoning models continue to kind of push the frontier of how much you can do with less and less reasoning.

So if you try, this is true for 5.3 codex as well, also for 5.2 thinking, if you try using them by default on health queries, it'll actually think a little bit less, but produce better results.

And we'll continue to try to push that frontier of, of not just needing to pour more compute into it to get a good result, but also getting better performance at a given level of compute.

And the result is that the models are way, way better than they were even a year ago.

Yeah, it's been crazy.

In my just three months of intensive use, nothing has brought home the pace of shipping quite as much as how many updates there have been just in this one kind of chapter of my life it has been wild to see.

I don't know if this is something, how should we think about like, density of effectiveness of the reasoning tokens?

I mean, there was this thing I did an episode with the folks at Apollo.

I'm sure you know, if they're work, if not know them personally.

And one of the really interesting things that they observed when they got access to the chain of thought is that at least for, I think it was over three at the time, although I'm not 100% sure which model it was.

I'll stop my head.

I was the sort of seemingly development of like a new dialect, basically internal to the model, you know, the famous sort of watchers, watchers, whatever.

Can you share anything about sort of how you're balancing the obvious good of efficiency and you know, denser or thinking per token, you know, more value per token created with the seeming tendency to have the internal chain of thought go off in weird and potentially, you know, hard to parse directions.

I guess there's also the commitment from an AI to like, not train on the chain of thought or at least not, you know, apply certain kinds of pressure to it.

I don't know if that's like an absolute ban on any feedback on the chain of thought, but if you should hear your thoughts on that, because that's something that I've been like, well, the world moves past these big stories these days so quickly.

That one seems to kind of come and gone, and I'm not really sure what the state of it is now.

Yeah.

It's a super interesting question.

Also near and near to my heart as a safety researcher.

So chain of thought and interpretability has been one of the one of the nice advances of safety in the last year or two, which has been really cool to see.

So as models have become reasoning or thinking models, they've also emitted tokens that effectively explain their work and what they're thinking.

And this provides a form of interpretability for institutions who want to understand whether models are doing the thing that they expect as safety researchers do.

And so it's been this really cool way of measuring whether models are doing things like scheming or other kinds of producing outputs that are not desirable in various other ways.

And this has been relevant beyond health to a bunch of other domains as well.

I think a lot of the results that people have shared and studied have been actually encoding.

I think the, I've been pleasantly surprised.

So the danger that you're pointing out as as you put more pressure in reinforcement learning to get the models to produce a good output will they kind of slip away from the prior of having their thinking tokens be simple English that is easy for researchers to understand.

I think what we've seen is actually pleasantly surprising, which is that at least until now, we haven't seen a lot of large scale evidence of the slip into what's called neural ease of kind of using, using chain of thoughts, tokens in a way that is not explainable and understandable and English.

And in general, as we've been trying to understand the modern ability of our models over time, we haven't really seen that effect as we've scaled up RL.

I'm not sure if that'll continue to be the case in the future.

But so far, I've been fairly pleasantly surprised that this is kind of side effect of the reasoning paradigm has continued to be useful and hasn't really been, hasn't robustly been seen to be become unreliable.

I think the result that you're pointing out I think has been like sort of continuous over time with this kind of weird blips and interpretability of the models at times, but we haven't seen like a continuous increase in that and we haven't seen evidence clearly, even though we've actually tried to study it of scaling RL causes that to happen more.

We haven't seen that yet.

I would expect it in the limit.

It does.

Okay, that's quite interesting.

So is there no, the way I sort of interpreted that.

First, I thought it was happening more than it sounds like you're saying.

It is happening.

And I naively assumed that there was some sort of like brevity reward signal that was being applied in addition to, you know, ultimate correctness.

And certainly, you know, is intuitively comprehensible why something like that would start to happen if you did that.

Is should I infer from what you're saying now that there isn't really a brevity signal and that this is more of just an immersion weird phenomenon that doesn't happen that often.

And so it's kind of one of those weird language model things that we keep an eye on, but like don't obsessive at too much because it is rare.

Is that a fair summary of the state of play?

I do think it's important to pay attention to.

I think the right way to think about it is like there's nothing reinforcing that this should happen during training, right?

Like there's no reason that it should happen that models produce chain of thoughts that are human interpretable during training.

The reason it happens is because they have a prior of using the English language and when you give them this space with thinking tokens to produce a more correct and helpful answer to the user, they're actually using it in English because that is just the easiest thing for them to do.

I think this is like a basically an empirical phenomenon that is extremely useful for safety research and I would love to keep it that way as much as possible and I'd love to see more research and how we can maintain that.

And I think opening eyes could be meant to avoid optimization pressure as much as possible and also the commitments of the labs.

I think it's really exciting progress.

I do think it's important to watch out for so I think it's a great question.

Yeah, to be continued.

So you mentioned like obviously the incredible complexity of the evaluations that you're running with by simple math, right?

If there's 5,000 conversations and almost 50,000 criteria of evaluation, there's a lot of, I imagine some of those criteria are reused across conversations.

At a minimum, we've got something like 10 evaluation criteria per conversation and probably a lot more criteria per conversation than that.

It's hard to summarize.

How good are they?

Nevertheless, I'm going to ask you.

How good are they?

And how should we think about that?

There's also like this health-bench hard thing that maybe could be a way of saying like, they're good at most things, but here's some things that they still struggle on kind of defining the frontier maybe as one way to say how good they are but how do you communicate to the world how good the latest models are when it comes to health?

Yeah, I think it's going to keep in context like the arc of work around LLM evaluation health.

A couple years ago, people were mainly focused on evaluating LLM by thinking about the performance on multiple choice questions, like medical exams and things like this.

And then with the work that Clapper is an eye to the Google, we started kind of increasingly investing in and what does it look like for specialized or unspecialized LLM to answer general health questions with some of the MedPOM work and potentially go in the direction of asking full questions to a user with the Amy work.

And so, increasingly over time, we've moved towards higher and higher fidelity evaluations and I think health-bench is kind of the latest big step in that direction of wide coverage of LLM performance and safety but actually covering it in a way that covers many different axes of performance that actually matter for the real world.

And so, again, you see these 5,000 conversations.

These 49,000 different axes of performance.

There's actually three different versions of health-bench.

So one version of health-bench is that full data set.

The second version is health-bench consensus and the third version is health-bench hard.

And these kind of mirror what we view is kind of the three kind of high-level principles when designing an Evel.

For health, the first is that you want it to be meaningful, which means if the number goes up, hopefully, human health will improve.

The second is that you want it to be trustworthy, which means that it's backed by the concept of doctors, for example, or other experts.

The third is that you want it to be challenging, so you don't want it to be 100%.

So one thing that's happened over the years is that all the previous benchmarks that meant anything at all have gone to like 9,500% over time and health-bench actually remains unsaturated to this day.

And so what you have with health-bench health-bench consensus health-bench hard is actually a little bit of focusing on these individual axes.

So health-bench overall is a number that we think if it goes up for a model and people are using that model, human health-flow and proof and we feel like that is a statement that we can defend with the rigor behind the work.

The second is health-bench consensus where we specifically focus on a bunch of criteria for evaluation where each example had a majority of multiple physicians agree that it was applicable.

So not only did you have physicians right these criteria, you also had a bunch of physicians check whether these criteria hold for a given conversation and whether they're the right thing to be evaluating about a given model and a given conversation.

And the final thing is health-bench hard where we actually took somewhat adversarily against a bunch of different models across all the different model providers, which shows the examples that existing models fare the worst at but still seem high quality and then we turned that into a benchmark.

So health-bench hard has kind of been my favorite external benchmark for whether an open model is doing really well and when it came out GPD40 was literally zero on this benchmark and so it's an incredibly hard benchmark because this is just how we chose the examples and over time we've kind of improved a performance of around 40%, which is still not near saturation on this benchmark and I think the benchmark has a lot to go for opening eyes models, I think competitor models are more in the kind of 20 range for turn models.

So that's the way I think about the health-bench family of e-veils like we have this kind of commitment to work on e-veils that are meaningful, trustworthy and challenging.

And if you want to focus more on the e-veils that are super trustworthy, then we have these the subset consensus, which is really focused on that.

We also did, in addition, as part of the health-bench work we did a couple additional analyses and one of them was health-bench involves grading these kind of individual 49,000 different rubric items using a model-based grater.

We had physicians compare the model-based grater to the grading of other physicians and we actually found that the model-based grater was doing a better job than the average physician.

And so what that tells you is the grading for health-bench is pretty high quality compared to what you'd expect from a physician.

Recurs yourself, the signs of a recurs yourself and the improvement alert.

How about just a little bit of an intuitive sense for like what's in health-bench hard?

I mean, I could give you just kind of my experience sense of the frontier from the bedside over the last couple of months.

And I would summarize it pretty simply.

This is what I say to like my neighbors and stuff when I'm saying, hey, by the way, you know, should really use a language model next time you're facing a health challenge.

I basically say, look, I was in the hospital for initially like 30 days of really intense treatment where everything felt super high stakes.

We didn't always know what was going on.

We didn't also know how much we could even trust our doctors at that point.

Everything was so new and stressful at the same time.

And what I found was basically that the frontier models were step-for-step with the attending oncologist on almost everything.

And they have basically been.

And by the way, that means they're like, a lot better than the residents.

We're just much more knowledgeable.

They're at the attending level for sure.

There were maybe a half a dozen times over the course of that month where there was some disagreement between the mind-ish life, just using GPT-5 Pro, whatever exact version of GPT-5 was, and then as these other frontier ones came out, I started to do everything in duplicate with German I3 and then triplicate with the latest cloud.

Interestingly, I would say that the AI's disagree with each other even less than the models disagree with the attending, but it's quite limited disagreement between the models on the attending.

And typically when there is disagreement, I found it to be a very minor thing.

Okay, his electrolytes have gone a little low.

Should we give him electrolytes today or not?

And of those like half a dozen things, there's not really a major trend.

I would probably score it like six to four for the doctors with the benefit of hindsight in terms of, we usually follow what the doctors said and you know, did we end the end and feel like they were right or do we kind of wish we had gone with the AI's?

I'd say maybe two out of three times we have felt like they were probably right.

And I think the one, if I tried to chalk it up to something that they have, did it give them an advantage, it almost always was one of those situations where whatever I'm putting in to chat and whatever data, because I'm always of course and now we've got integrations for this, but I was always kind of exporting the latest results from the EMR and so on and dropping in the PDFs that they gave me.

The difference was usually come down to in view of all that, even would additionally say, taking all that into account, but also just looking at him right now, kind of watching how he's breathing, looking at his color, I'm pretty sure he's fine.

It was that kind of very intuitive, very multimodal, very subtle sense that obviously these folks have developed over quite a few years of clinical practice that on these very fine margins seem to give them a slight edge over the models.

So that's kind of my account, but I guess what that means maybe is like my situation isn't that hard and I think that is actually true in the sense that the cancer that Microsoft has is not a super rare one and the treatment protocol for it is quite well established, so it's not like a super hard call in terms of the main line like what to do.

You know, maybe even though it's a hard thing for him to go through maybe it's not that hard in terms of like clinical judgment, I wouldn't say there's, you know, I haven't experienced anything that I would say the models were only 40% on.

So with that, maybe you can tell us a little bit of like, well, what's out there still on the frontier where we do have ground truth or at least some sort of consensus that we feel like is solid enough to grade models on where they're still only at 40%.

Yeah, it's a great question.

I think we can like think about, let me describe a little bit more about what health benches evaluating in the ways in which models have improved over time.

I think talking a little bit about kind of next frontier for model improvement.

So health benches measures many, many different things.

It has, you know, these themes which are focuses of different evaluation examples and a few examples of focuses that we had one was kind of like our models appropriately escalating to care when needed versus not escalating to care on necessarily.

You want to balance this because you don't want to be, for example, overwhelming the health system with a bunch of patients who are worried and by alarmist kind of medical advice versus not escalating to care when needed.

Another kind of aspect that we measured was thinking about the ways in which models can adjust to the different demographics or different epidemiological conditions or different levels of access to care of people globally.

So this is, you know, both like making sure that you adjust a user that's male or female, but also, you know, somebody asks a question in a region where tuberculosis is more common versus less common, making sure you adjust for that and things like this.

And so that specifically was actually we call that global health.

That specifically was the biggest single focus of health bench because that's like I think one of the biggest ways that are can be most impactful.

And then a bunch of other things we talked a little bit about calibration.

One of the ways that the models have gotten significantly better over time is not when they know they're uncertain, not only flagging that uncertainty, but actually browsing to get more information.

For example, getting the latest resources and being able to synthesize that information together.

And other things asking follow questions and the right kinds of follow questions over time.

Initially, to actually fatigue would almost never ask follow questions and help settings now they do much more often.

And they're much more likely to prioritize the right follow questions for you.

You know, you have this story around when you're using the model for your son, where you actually learned over time to ask the model to interview and figure out what you can do the physical exam yourself over time and things like this.

The model's have gotten better at knowing that would be useful and flagging that as well.

So it's just everything from pure reasoning and solving benchmarks and medical calculations and figuring out the diagnosis to all the way to how does the model behave that side manner is at comforting and things like this.

And again, a balance of both difficult and high trustworthiness signal that we're getting in working with our experts.

And over time, we've kind of continued as the model's have improved from, you know, O1 to O3, GPD 4.1, or GPD 5, and so on, all these models have improved significantly in health.

And today, every major stage of model training for every model that we ship at OpenAI actually now benefits from our work in health.

And that's going to continue for future models as well.

I think the frontiers are in a few different areas now.

I think one is, I think a lot of text-based performance is actually like, that's out of like, sub-specialty areas is actually pretty good.

So if your goal is to kind of keep up with the latest evidence even as a physician, I think the models are doing an incredibly good job today and people are finding it a lot of value and I think that's one of the greatest data points towards that point.

I think now models have continued to also improve in their ability to integrate information.

You mentioned this thing of like, taking a bunch of different information as a physician, like looking at the palette of your son's skin, over time, and integrating that all into one context.

I think a big challenge for model today, actually, and this is also the model issue and more of a, how do you, how do you surface them is actually getting the right context, right?

And one of the challenges that you face in your short is they're pulling the right information from the health system so that you could upload to the right questions yourself and advocate for yourself.

We're looking to make that better with the release of our chat activity health offering, which I can talk a little bit more about later.

But in short, this is basically like an experience within chat activity that enables you to connect your health information from your medical records, or any wearables, or Apple Health, and provides additional kind of purpose-built privacy protections for that.

So you can talk about that a little bit later.

But I think on the model related side, I think this like a few challenges.

One is getting that context again.

I think the models are incredible and they have the right context.

The second is thinking about various modalities that are not well captured by things like health vent, which is really focused on text.

So thinking about really good performance in multi-modal, an image and voice, I think people will start to rely on the models in more and more of modalities.

And I think this is the still significant room to improve there.

And I hope that in the future, the best models in the world for doing various imaging modalities in health are actually the models that are most easily and readily available to people today.

Now, with a cursing as you say, the multi-modal thing, I'm like, hey, maybe I was even undershooting it.

I had the intuition that like taking a picture of my kid as he sits in his bed wouldn't necessarily help, and might even confuse, but maybe I'm wrong.

Like, should I, in the future, would you advise me to start just including cell phone camera pictures with my daily synopsis?

Well, I think we're right to point out that over there is still a little bit of a challenge that it's, for the average user, it's difficult to know exactly what data you can connect and how to connect it.

Right?

And you were highly motivated in this case.

And so there's a little bit of a gap between the most motivated, the most expert user, who's willing to wait through, signing into their patient portal and things like this in tick screen shots or copying things manually as you did, versus the person who wants that health data integrated and wants to be able to understand their health and advocate for their health with that of a loved one, but has challenges in doing so.

And so I think that's the thing that we do hope to work on on the, on the product side of the chassis of the health, really lowering the activation energy to do that kind of thing.

I think in general, the models will generally benefit from having more more context.

I'm not sure if specifically they would have helped in that case with additional information, but I would generally advocate for people to, if they find it useful, to try putting in various kinds of context that they can, and I think people have been surprised.

I think you were surprised in your own story, how useful it can be if the models have more context.

You know, an interesting tidbit here is there's been studies showing since like two or three years ago that for these clinical case challenges, which are effectively these crossword puzzles for doctors, where you get all the patient context up front, potentially multimodal context as well.

The models did an incredible job at figuring out the next steps for diagnosis or treatment, and these are extremely challenging puzzles, really difficult for doctors, and people even found that they did, even improved on the performance of these experts.

And that's been true for some amount of time.

And I think a lot of the challenge in the ensuing time has been what does it look like to get the models to have a back and forth conversation to slice up the right kinds of context from a US user?

And then it was judged by the health who are taking that, again, to step further and thinking about the right ways that the product interface can make it actually a lot easier for you to do that.

And in a way that's secure and we're not training in your data and people can trust that as well.

How about for these larger modalities, you know, when you get like a pet scan or whatever, my understanding.

For I haven't really been able to look at my hands on too much of that data, I'm told these like gigabytes and I'd have to like get it on a disk and I don't even have a computer that takes a disk, so I'd have to also get a disk drive to read that.

Obviously like that data can be compressed.

How do you think about the pipeline of like raw scan type data and feeding that into tokens?

Are we talking like certain specialized perceiver modules that kind of ingest those and do the reduction or some other third thing that a mysterious third thing?

Yeah, this is a super great question.

One of the interesting things about biomedicine from a modeling perspective is that there's kind of a long tail of modalities that are interesting and relevant and often they're fairly difficult to put into models today.

And so I think there's going to be two broad approaches and this is talking more about kind of external research.

It's kind of two broad approaches that I've seen in external research.

One is thinking about the right ways for models to call a specialized tools or run code over various kinds of data.

So if you think about how does a human view a gigapixel scan?

Let's say a pathology image.

How does a human view it?

They basically view slices of that image or even like a 3D or 40 scan.

They view slight, like your son's MRI.

They view slices of that image, right?

No human can see it at 3D or 40 met, monality.

And so what they're effectively doing is using a tool to understand which slices of the image might be most important and then taking a look at the slices of that image and manipulation.

That's actually a thing that models can do with tools and Python and things like this.

So that's number one.

There's a sub-blood of that, which is that you can also use very specialized tools that are fit for purpose, like specific Python libraries or even additional models that specifically encode these modalities.

The second approach is you actually have these models encode the modalities themselves.

You have a way of basically putting them into token space in some way.

It's kind of what's been done for images and video and audio for various models.

I think what you'll see in Biomedicine is a bit of a mix of both.

I think this benefits to both approaches as pros and cons for both.

I think you'll see like a little bit of a hybrid approach where researchers will kind of end up doing a mix of both depending on the modality.

It's just like quite striking how good AI's not necessarily even the general purpose AI's but we have like these specialized AI models that can fold the protein in a superhuman way and what seems to be missing is the latent space joining of that modality to text, but my expectation and you can tell me if you think I'm right or wrong or how long I have to wait is that we're going to see that more and that that is going to be a major driver of like truly superhuman performance in a lot of domains because we just don't have people that can like intuitively fold the protein.

But if you had a person who could do that and they had their general reasoning capabilities it seems like you would have a really different kind of like a qualitatively different kind of intelligence.

So I'm really interested in sort of what you see as like the roadmap, the timeline, the expectations for that kind of deep integration.

I think it's a really optimistic picture.

I mean I totally agree if you think about how do we lean into the natural capabilities of these models?

These models can take in vast amounts of information into their context and they have the ability at least in theory to take in a lot of modalities of data and effectively merge that together and I think the recession for that is becoming increasingly solid for these other modalities.

I think there is significant research to do depending on the biological modality and like I said there's kind of a long tail of biological modalities that matter and so depending on the modality depending on the relevance and impact and availability of data and all these kinds of factors I think it could take more or less time so it's hard for me to give you a universal answer right?

I show you your optimism.

Just saw one in the last couple days sleep FM from Professor James Jow and others that he's a Stanford a collaborator is probably at multiple institutions but they had a really interesting finding.

The idea there is they use I think like six maybe up to eight different modalities that are all measured during one night of sleep like how you're breathing and you know various things that are easily measured and kind of a sleep study setting and just takes kind of one night of sleep to gather all this data and then they've gotten really good at predicting all kinds of different diseases based on that data.

It's not yet so they're integrating all these modalities with each other into a holistic understanding and using that for these sort of narrow set of predictions which are very high value predictions not yet going all the way to text but just another data point that is top of mind from here right now that's like man this is it's all happening.

All the way in spaces will be joined and I can't shake the idea that that's a big part of what super intelligence ends up looking like.

Going back for a second to bedside manner and you mentioned like global health and context being a big part of health bench and kind of what you're trying to make sure you're doing to make sure I've ended up with all humanity and it was funny when you said like what you're doing to make sure you're doing that's right.

One day you'll scale.

It only two hundred million users today.

There's already some scale.

The thing that I want to address that is like the bedside manner and also how that relates to the sort of n of one context of the individual user their preferences, their memories because now of course all these products have an integrated memory module which I'm sure that is also going to become a deeper kind of integration over time.

Right now I think it's not disclosed exactly what it looks like.

People generally understand that it's kind of a scratch pad of like key highlights that the model can kind of reference.

I believe still in like mostly text modality as it goes.

But regardless of exactly how that works today say to say it will continue to become a deeper and deeper integration and as everybody kind of pursues very strategies for continue learning.

But even just that like there's also integration with tools and I now has access to my Gmail and I've got even just system prompts.

Like I can I can just tell the model that I wanted to behave a certain way or a different kind of way.

And it just strikes me that that creates a like impossible surface area for you to manage.

You know, we've seen like weird sort of emergent things like sick of fancy here and whatever else there.

How do you think about making sure because this is one thing I do think the human doctors generally do a good job of is kind of like sizing the person up and being like okay how do I like cut through what I need to cut through to get this person understand what they need to understand.

And that is something I think models probably have not been as good at yet.

But really seems to matter a lot in the medical health care domain.

So is there some additional suite of testing that you do?

Or how do you think about that you know just in you know insane long tail of this.

I think we're seeing that 200 million health users bring to the product.

Well you pointed out two problems and I'll talk about them in reverse order.

One of them is people are bringing in an incredible array of experiences of different settings and ways that they use chassis PT and their memories are different and all these different things mean that the experience of somebody one person using chassis PT can be different from the experience of another person using chassis PT.

And for reasons that are not a different model or something like this.

We do a few things as a company to kind of understand and improve model behavior.

And one of them is kind of emails that we run on models before launching and health ventures a great example of that and we have we have additional emails that we run and things like this.

Another thing that we do is monitor.

So we look at we monitor the usage of in-production traffic in ways that are privacy preserving and run classifiers over to understand how we can see.

Safety risk whether it's anything from health to kind of frontier risk and things like this.

And so we actually able to measure these things and this is actually a great example of this kind of work in the blog post that we had about sensitive mental health conversations where we are able to show that the evaluations that we are doing in mental health and also patterns that we are able to see in production traffic again.

Logging in privacy preserving way with models running over them.

We are able to see kind of a correspondence between exactly how people are using the models and what is that look like.

And what are our emails measuring?

And they were actually very well correlated.

And so that's kind of how we think about closing the gap between emails that you run in models and what people are doing in the real world.

I would actually add this like an intermediate step as well which is doing real world study.

And here I'll plug our work with Panda Health.

This was a study that we did where we I think the first real world study of an LM-based co-pilot for clinicians where we had some applications in this group of clinics in Kenya use AI as a co-pilot or a safety net.

And as they were typing in their electronic medical record it would flag things if things were interesting or alarming or potentially incorrect.

And other clinicians did not have that.

And for the patients in the group who are treated by clinicians with AI versus not with the AI, there was a statistically significant improvement in diagnosis and treatment outcomes.

And this is again another example of how do you move from offline evaluation?

That's potentially not super realistic like medical multiple choice exams to increasingly realistic evaluation that you can run offline like health bench and then increasingly into the real world.

And so you can view the health bench study.

It's kind of like a forward-looking study.

And you can view our analyses of production traffic.

It's kind of more retrospectively looking and again exactly capturing a lot of the differences in variance that you're pointing out.

Yeah.

That's really interesting.

How do you want to talk a little bit about the privacy preserving nature?

And I guess I would preface by saying I think people at least as individuals typically way, if they think about it at all, like typically way over index on this.

I've, you know, had the occasion to think like on my sons behalf.

Should I be putting this information out there or you know, whatever?

And then I see, you know, what somebody like Sid from GitLab has done with his cancer, like literally open your brain.

You know, you know, you know, all of his own biology, you know, down to the DNA level, and you know, just an incredible amount of very individualized data that he's put out there.

And I'm like, I think that's the way because I can't really come up with too many, I mean, you can get really sci-fi about it.

But unless you're getting really sci-fi about it, it's like hard to come up with the way that anybody would really use that against you.

And there's a lot of ways that it might stand to benefit you.

And you know, I've even just in telling my son's story I don't know, maybe something could still happen, but like what has happened is that people have reached out to me with interesting opportunities for connection and, you know, I've been able to tap into expertise, including, you know, the team that that Sid has built to help him, and also they're like starting all kinds of individualized therapy companies with fascinating developments there.

So I think my advice to people would be like, seek the benefit and don't worry too much about whether or not your data is like sitting in some logs up where that doesn't seem to be a huge concern.

But anyway, that's my role to say that.

It's your role to build a product that, you know, deals with people as they come and people do seem to really worry about that.

So what would you want people to know about the privacy preserving nature of the infrastructure that you guys have built?

Yeah.

I mean, privacy is incredibly important to a lot of people.

And I think it's a very personal thing.

I think people's preferences for what data they share and how they share it.

I think if you zoom out for a second, think about what the level of ownership of your own data is as a patient today.

And maybe what it was in the past even before things of the 21st century here's act and things like this.

Most patients, let's say 10 years ago, most patients had no real way to access their own health data.

So regardless of who when suits, not only did they have not a way to control who and what they shared it with, it was just out there and shared anyway in various ways.

And a patient didn't have a chance to even look at their own data or have an implementable path to actually look at their own data.

Now we're in a slightly better place, I would say, but not still not in an ideal place where most patients still feel like it's incredibly difficult to get access to their data or their loved ones and take control of it and how it's used.

And people like you and many other people with incredible stories have found incredible ownership and advocacy that is enabled only by having access to the data and being able to use it in the ways that they see fit.

So we want to respect that.

We want people to be able to access their data and this is again part of what we hope to enable and chance to be healthy as just a lower activation energy way for you to connect your health data and make that useful.

At the same time, it is super important to a lot of people that this is done in a way where it's clear that there's not some kind of competing incentive here.

So one of the things that we're super clear about which chance to be healthy is that none of the data here that you connect is actually used to train our foundation models.

And the reason we do that is because first we think our foundation models are great already.

And this is actually the most important way we can improve them.

And the second is because we think this will further lower the activation energy for people to do incredible things with our models.

And we think like if people think that there's some tension between privacy and utility that fewer people will go for utility but we actually don't want that to be the case.

We want people to see the value and to use it.

And so specifically in the case, you know, everybody has a bunch of really cool privacy productions built in, including encryption and things like this.

Which actually help the additional kind of purpose built layers of encryption just for health data.

And so the result is that like any you have a couple of things.

You have encryption of the data.

You also have isolation of the data from other data that you have in CHAPT.

So for example, if you have other apps or memories from CHAPT or things like this, those are actually kept separate from health and stuff that you do in health can benefit from that.

But your other conversations and the CHAPT don't have your health context or health information and things like this.

So you can keep that completely separate and have that within the separate experience.

And as we continue to get feedback, as people continue to use it while we roll it out on the wait list, we'll continue to improve these protections because we think it's really important for users.

And we think it's actually not as important as it seems from proof in the models.

Yeah, interesting.

And that's a is that a new that like segregation of data is that a new feature with health?

So like all of us have done today has been in kind of one product experience, but that that fork has kind of yet just now been introduced.

Yeah, I've been motivated to get a whoop to wear on my risks to start to collect data.

I've been fortunate throughout my life to have generally good health and haven't created much of a trail in terms of medical history.

But now, and I've kind of always looked at these, you know, quantified self things and been like, it's sort of interesting, but you know, who's got time for that?

And am I really going to like do all the processing on this?

And the answer's kind of always been no.

And so I haven't worn anything today, but now I'm thinking, alright, it's time.

I can actually get this thing.

I can get the value from wearing this thing because now I'll have a product that runs in the background that's, you know, it's going to be a great point that will, you know, really bring to light what I need to know.

And so I'm like motivated to, you know, to change by behavior at least on that level.

And I honestly think it's also probably going to end up encouraging me to exercise more.

In fact, that's kind of my own North Star metric right now for like all things AI.

And that's just health products either.

But like, I will know that my, I've been spending less time in this chair and moving my body more.

And when I feel like I can, you know, untoward myself from the desk, I'll feel like I'm, you know, I'm really winning.

And I'm, I'm getting, come back kind of close to, to getting there.

I know of one friend who uses voice mode like nonstop and says he does a ton of it from the gym while he's working out.

And I'm like, I want to be like you.

I'm not quite there yet, but I do think it's really interesting to think about both the, you know, the ability to crunch all the state of it.

And also the freedom, you know, the liberation from the, the desk to go out and like, make more exercise records in the first place.

I'm really looking for it.

Those are like, you know, major things I think on the horizon in 2026.

And I'm super excited about it.

I think people, yeah, I just add on that.

I think people should expect that the value of data that they collect on themselves should increase over time as model intelligence increases.

So the ability for models, for a research perspective and a product perspective, to analyze that data and come up with useful insights, that could be useful in the future, expecting to increase over time.

And so if there was ever a time to get a watch or anything that would say it's now.

So I do, at this point, basically everything in triplicate for my son, my like morning routine is when the lab results come back, export him out of the EMR, drop the PDF into all three of the, frontier models, grock has not cracked that top tier, maybe it should, but it didn't in my initial valuation that I haven't gone back.

So it's Claude, Gemini, and Chatchy Beauty.

And the sort of tasting notes on like, how the models interact with me, is in short, Gemini is by far, this is quite surprising for a Google product, but it's like definitely clear.

It is the most inclined to push me to advocate for something.

It's also the most confident when it says, I don't have something to worry about.

And Gemini, I would say it's like very accurate.

Sometimes I'm a little uncomfortable with how opinionated it is, because I'm like, what if you're wrong?

That's a surprising sort of behavioral profile, I would say for a Google product, especially because everybody, you know, historically has said, oh, they're going to be so conservative, you know, they can't build AI products because, you know, they can't live with this sort of risk or whatever, Gemini III does not reflect that analysis.

I would say, Chatcho BT is kind of on the other end of the spectrum in terms of, it generally gives me, and again, no system prompted does have like my memories and stuff, but no intentional attempt by me to, shape how it's responding to me.

It tends to give me the longest answers, the most information, and it's kind of the most like clinical and neutral in tone.

It's sort of a report style, you know, there's like nine issues, you know, and we kind of go through and dress each one, and then kind of, you know, summarize them all again at the bottom, and then call it as kind of somewhere in the middle where it's, it's generally much briefer than Chatcho BT.

It's more like Gemini length response, but more kind of, measured in a more Chatcho BT, like way, certainly compared to Gemini's, you know, more opinionated persona.

What do you think of that?

How do you think about sort of the balance between like, thoroughness and digestibility?

I guess it's probably the main tension I would, you know, you're a winner on there specifically for Chatcho BT.

Yeah, I think it's hard to know what's right and what's wrong.

One of the things that we've came to improve and also measure of the health bench, but also improve in our recent models is, having the, and this is especially true of our most recent models.

Having the knowledge of when the user is likely a, a health professional versus a late user and tailoring the response accordingly, and part of this is making sure that you're applying a level of detail, a technical hole jargon that makes sense for the level of expertise for the user.

And so this is actually a specific thing that we train and evaluate it on, there's actually in our open source health bench, email, this is actually a part of it that is focused on this.

And so this is something that we've, we've seen improve, especially recently with 5.2 thinking.

And so definitely an area that we've been investing in.

But overall, I think it's very hard to say what is ideal and what is not ideal and I actually think it personally, I think it's amazing that if other folks can, other competitors can be in the space and actually push the over to and window along with us because I think it's hard for any one company to do.

And my, a large percentage of of the work that we're trying to do is think about where the models are at today, and where is the over to window of user trust for the models?

And can we move the over to window along in the right ways?

And I think additional models, additional products from other competitors, from other companies, from other players in the help of ecosystem, actually go a long way to helping shift that as well.

And I think it would be a hard battle for us to win if it was just us.

So I actually think personally it's amazing that there's other folks here.

Okay, so you, it is a striking thing to say that we don't need your data.

Basically we've got enough.

I am interested in whatever you can tell us about like, where that data is coming from.

This seems like an area where synthetic data, maybe I'm naive here and saying this, but it seems like you really want some, some really ground truth from actual human medical trajectories as opposed to, you know, I would feel a lot more comfortable synthesizing chats than like trying to dial that in, via synthetic process that I would feel like you can kind of fully synthesize data all the way to superhuman doctor performance.

So I'm interested in kind of where data is coming from.

And I have this notion that you may think again is like kind of beside the point, given the tricks that you guys have up your sleeves, but I've been imagining a sort of possible new social contract, not just between like patients and AI providers, but even kind of patients and like the medical regulatory system, where I'm like, you know, I don't think my son is going to end up in this position and I certainly hope not.

There's a lot of people who are in a spot where they're like, I would kind of try anything if it had a chance to save my life and a lot of times they can't even get access to it.

And the now we're in a spot where the AI is increasingly are going to know about it.

You might not know about it as the first one.

The AI is going to know about it.

They're going to tell you about it.

They're going to make a pretty damn compelling argument in a lot of cases that this was probably the best thing out there, and I have been down this path in kind of a contingency planning frame of mind.

So everything's gone well from my son so far, but what if there were relaps, you know, what would we do?

And I haven't really impressed by what the models have been able to give me there as well.

So I imagine this situation where there's just a mounting pressure from patients that are like, look, AI knows about it, you know, and it's telling me this is the right thing to do, and you're telling me I can't have access.

That seems like that is going to be very hard gate for the establishment to keep for all that much longer.

Then the other thing I would love to see kind of on the other side of that trade is, okay, if you are going to be given access to unproven treatments, then what we as society, even again, more broadly than like the AI companies, what we as society want back are your results because we want to fold this into our general understanding.

It seems to me that we are getting to a point where and this probably is going to be mostly driven by AI, we could envision, not necessarily doing a weight with clinical trials, but we could envision moving beyond clinical trials, doing a lot of end-of-one things that are just like, this is the best guess that we have for you based on all available information, and then if we can capture the result of that and train on it, as humans as AI is, you know, as society collectively, it seems like there is so much room to learn so much more to move so much faster, all while delivering outcomes to, some people who are trying these things that they couldn't otherwise get their hands on, and I really am feeling like motivated to advocate for that.

I guess there's maybe two data questions of like, how do you not need more like run of the mill data, and what sort of, you know, is that the kind of data that you still need, and what would you think about sort of an evolution of the social contract, I call that AI and a right to try?

Yeah, that's a super interesting idea.

We talked a little bit about the patient facing problem of data fragmentation, right, and the experience of this as a patient is, your data, your healthcare data is just kind of in the ether somewhere, you don't know where it's going and how it's going, it's governed by HIPAA from the most part, and the result of that is that there are pathways for that data to go places that don't involve your consent.

And so the experience for today for a patient is that you both have limited access to data and limited control over where that data goes.

If you think about it actually from the provider perspective, this is also a problem, right?

So if I'm a healthcare provider, I want to understand the whole picture of patients health and integrate across modalities as doctors are so expert at doing, how to improve that experience.

Right now doctors have a lot of trouble actually, like if you go to a new health system, pulling the relevant records from other health systems from your entire context, you can imagine a future where CHPT actually has a lot of context on you, and can the doctor be able to pull that data in as well.

I think that could be a really cool future.

And then the third point that you're talking about is research, right?

So right now a lot of clinical trials, the majority of clinical trials that fail, they fail prior to the provision, are failing due to recruitment challenges, right?

And so we basically have a failure to recruit the right patients who meet some eligibility criteria.

And a lot of this comes down to the data is not in one place for what these patients are and who they are and whether they are eligible.

And so that is also a problem that is facing researchers, and it's hampering our ability to kind of raise the ceiling of human health as well.

So you're pointing out a potential future where not only do patients have more access to data, providers are able to access the data that patients in a way that's unhobbled as long as the patient can sense, but also potentially researchers are able to access the data to patient again if a patient can sense.

I think it would be really cool to imagine a future where patients are able to kind of opt in for an additional consent and are able to kind of experience the benefits of what would you call it in AI, AI enabled.

AI and the right to try.

I think that would be potentially a really cool future.

I do think clinical trials in the standard of evidence there is battle tested and there is something important about that as well.

But I'm optimistic about a future where because data is a little bit more central as a little bit less fragmented, we just have a better ability to advance the science.

And because patients have access to their own data, they're actually more able to advocate for themselves because of AI.

They're actually more able to figure out what clinical trials they may be eligible for or what experimental treatments they may be interesting for them.

So that if they do kind of have that active and informed consent that they can do that the heart understanding of science can improve that our models can improve things like this.

I think that would be a really optimistic future.

And I think this coming by the way is not that the models have run out of data.

I think data is always a helpful thing for models.

But more of thinking about what is the right way to roll this out to society and where can we get the most impactful path?

Is the most impactful path the lean into privacy and lowering the activation energy for users or is it to lean into getting data?

And I think here the decision that we made is the right decision for our users to lean into not using the data for training in our foundation models.

And I think that's actually what's going to have the most long-term impact if we think about the arc of improving AI.

I'm improving human health with AI is moving the opportunity to move into a long in a way that's trustworthy that shows that we respect the value of people's privacy and data.

And then in the long run I think things like additional consent and different changes to the contract of how research is done.

I think our really cool things explore.

Yeah.

As you were thinking about this, as you mentioned, like additional consent.

I did get a packet of 40 pages of paper at the hospital one day.

And they clearly really do want to collect this.

There's like a person who's job it is there to go around and visit the parents of these kids at the Children's Hospital and ask an explain and answer questions.

And I was like probably the easiest customer that she'd had in terms of pretty spos to sign.

So I signed and I initialed in a ton of places and whatever, to share whatever data we could share with the general public good.

And still, I just got this in the mail from the University of Minnesota that is like, can we have your data?

And I'm just like, man, I thought I agreed to this already.

And what is the yield on those things that they're sending out?

I mean, it's got to be quite poor.

I haven't sent mine back just because I've been busy and I like fully intend to.

So it's, man, those barriers are really tough.

Would you think about doing something like, I mean, obviously, you know, much has been made recently about ads coming to CHHPT and I think the argument that, hey, we need some way to support this for billions of people and ads is a proven way to do that is a pretty compelling argument.

I don't know if it extends to health, you know, would we obviously there's a lot of pharmaceutical advertising in the world today.

Would there be such a thing as pharmaceutical advertising in the world of CHHPT health at any point?

And would you consider like a, you know, you can get CHHPT pro for free if you'll give us our data.

You know, that seems like something that a lot of people probably would opt into in a way that I would think that they would still feel good about.

So I don't know, any other sort of trade, you know, because again, we do want this to go to billions, right?

So any of the trades you could see, opening I'm making to support that reach.

Yeah, so ads aren't coming into CHHPT health, and we don't plan for that right now.

Again, we think it's really important to kind of create a clear separation between our health impact work and things that could be seen as contributing to other incentives for the company.

And so that's kind of the line that we've taken there.

I think you're right that access is a really important point for this and this is why we've made CHHPT health free.

This is not the default path providing a reason to model for free without rate limits to all users.

But that is like the chart that we're, the path that we're charting with CHHPT health, and that's actually not a thing that's available otherwise in the product.

And so we are deliberately optimizing as much as we can for access and doing as much as we can.

I think there's going to be some limits to how we can do it, but I don't see any trade-offs in the immediate future.

Cool, yeah, that's admirable.

I mean, I think my sense of open AI on this dimension is like very appreciative.

You know, the pains that have been taken to support a free user base of hundreds of millions of people at, obviously, not insignificant cost when probably a lot more revenue could have been extracted from those people is, I think, a pretty admirable thing.

So to see that, and I hadn't even heard that there was this plan to go to free without rate limits for health for all.

I mean, that's, I've had this idea for a while of universal basic intelligence, which is basically, you know, this sort of, again, I'm obsessed with ideas of new social contracts, but this is like one version of it that is, you know, a pretty, huge needle mover, right?

It's going to be an unbelievable needle mover for a huge, huge number of people.

That's awesome.

You want to talk a little bit more about kind of the medical establishments response.

At the hospital, my experience is still mostly lack of awareness is kind of what I see among the providers as I've built up a rapport with certain people.

I tell them, you know, I'm consulting with AI on this, and they're kind of usually okay with that.

They're not like hostile to it.

They're sometimes skeptical.

I have had a couple of interactions with them.

I'm like, tell me what it's said.

And I'll tell them, they'll be like, oh, okay, pretty good.

They kind of doubt can turn around pretty quickly when you get the right answer.

There is, you know, mentioned certainly of open evidence.

There is broadly though, I think, just kind of a lack of awareness of how good the systems have become.

So my guess it would be that that sort of awareness is probably the biggest concern that you have.

I think in a lot of professions, we should expect to see closing of guild ranks and raising of, you know, barriers to entry.

And, you know, the cynic in me thinks maybe that'll happen in medicine, the idealist is like, maybe not.

You know, maybe this is a chance to live up to the actual mission of the profession and do the right thing.

And maybe it's also the case that people are just so overworked in medicine that they're like taking, you know, to would be happy to take whatever help they can get.

But what is the, you know, how would you characterize the broad reaction from doctors writ large?

I think broadly, I think you're right to point out that the question window shift for consumers has actually been faster than the overdone window shift for doctors.

And so already in the last year, the rate of adoption of CHATCHIPTI for health even before we launched the recent products was incredible and it's been one of the most amazing things to see.

Again, over 200 million people a week using CHATCHIPTI for health and wellness questions.

I think on the doctor's side, you do see a rapid increase in adoption, but it's not quite as fast.

And so I would expect there for that to come with a couple of things.

One is kind of like interesting interactions between patients and doctors who are maybe at different time scales and different points in their kind of journey of adopting AI.

It says a few of how to couple of them.

And I think it's interesting because I think a lot of the, when it talked to physicians, a lot of them first hear about AI through patients who are using AI rather than AI that is actually built specifically for them, which is one of the things that we're doing with CHATCHIPTI for health care, which I can talk about.

This is a lot of super interesting and I think one of the things that we've learned over time is that the best way to shift the overturn window, like we've done over my time here at OpenAI and also at Google, I've done a bunch of work on studying things, doing real-world studies, things like this.

I think the thing that I found most effective for shifting people's opinions about AI and health, especially as the models have become more capable and safe, has been just putting the technology in people's hands.

So that's actually what we've been doing on the health CHATCHIPTI for health care side.

So CHATCHIPTI for health care was announced the day after CHATCHIPTI health and it's actually like a more of an industry-facing announcement.

And so what this is, is basically like a version of CHATCHIPTI that is actually purpose-built for the workflows of health professionals and specifically clinicians.

And there includes HEPA compliance and includes additional features like specific evidence retrieval of for medical guidelines and things like this.

And additional workflows that were specific to enterprises and writing workflows that doctors do and things like this.

And so we launched that actually with eight of the leading institutions across the country.

And one of the most important things for building trust and credibility in the kind of the medical establishment is actually working with these leading partners.

And so working with these partners, we've been receiving amazing feedback.

We've gotten actually a ton of end-downs since that announcement, like more than our team can actually handle.

And so I think my hope is that this announcement and also the work proceeding it on the, on the research side with our study with Ben at Health, we hopefully see a little bit of a wave of adoption, not just among individual clinicians, which is what we're starting to see, but also at the level of health systems and potentially even governments.

And so I think we're just at the beginning of that way.

I think that's part of what I meant by we are seeing adoption, but not scaled impact yet.

And I think scaled impact will start to happen, especially in the medical establishment facing side more this year.

Yeah, what if there's a question, I'm sure you thought about this, that's sort of analogous to the privacy question on the patient side, that is like, what is the form factor or sort of mode of rollout or use that is going to be best received by doctors?

Because if I'm a patient, I might say, hey, what I want is for the hospital or even some bigger organization than that, to just like architect some workflows that just grind through everything, like I want nothing missed.

I want every record of my examined.

And I don't really care if that offends a doctor at some point, like I just want the best results, right?

But I could also then imagine that if you were to do something like that, you might ruffle some feathers, create some immune response, so to speak from the profession that you would rather avoid.

So do you have kind of a, is there a way, similarly with the, you know, or analogously to the sort of privacy thing, we might like take a little bit off the fastball in terms of how much you could create in terms of like immediate value in order to make the rollout more, more acceptable to decision makers so that they, you know, hopefully kind of work with you as opposed to, you know, in some cases, maybe be more inclined to fight you over time?

I think the trade officer actually less than one would expect.

You mentioned this possibility of possible protectionism in the industry.

And we've actually seen much, much less of that than one would expect.

And I think the reason is a lot of these doctors actually use it for themselves, or people that take care of.

And when you do do that, and you see the value of it, and you see it getting incredible things correct, and maybe pointing out things that you hadn't thought of and things like this, then that becomes like the easiest kind of conversation.

And when we talk to a health system, executives, you can tell instantly who's used it and who hasn't used it for the cease case, and it's a conversation just become incredibly easy when people have used it.

What we see today, I think is not a lack of top-down interest in adopting.

We actually see a huge wave of top-down interest in adopting.

I think the reality is though in health care, that workflows are fairly entrenched in a lot of ways, and it takes some time to make changes.

And this is again the kind of thing that we found in our work with Penn to Health, where we had to do an additional rolling out of cool new tool.

We did this work on active change management, where we actually like brought people along, show them how to use the technology, had sessions where we hosted a bunch of them together, and had them learn together about how to use the technology more in the future.

And that was really important part of rolling this out.

And that kind of change management, I think, will be really important here as well.

So I do think that takes a bit of time, but my expectation is that this will be one of the faster rollouts of software and healthcare history.

I think that's already been happening with people's usage of AI, and I think that's just at the beginning.

So I wouldn't expect, I don't think it'll take years and years for people to adopt AI more and more in clinical workflows.

One of our goals for this year is for AI assisted care to become more part of the norm of care.

And I think by end of year, that'll, I think, be the case, we hope.

But I don't think it'll happen over weeks, which is something that us AI innovators are used to, I think.

Let's talk about the connection between health and AI safety more broadly.

That I think is super interesting.

I remember the classic question from Ilya, once upon a time, how do we teach AI to love humanity?

I understand that that notion is not entirely gone from open AI, and is still part of what you're thinking about.

So I'd love to hear more about that.

Yeah, absolutely.

I think it starts with kind of the foundations of the work that we've been doing on health at open AI and how we started it.

So I have a bit of a background as a researcher here's about safety, and has worked on safety research in the past.

And part of the motivation for coming to open AI and to work on health for me was thinking about the setting as a place that can provide a kind of concrete grounding for technical work on safety and alignment.

And I had a feeling, and this was about two years ago, I had a feeling back then that a lot of the most ambitious work, the most medium and long-term work on safety and alignment, was going on in kind of toy settings, or going on with math problems or things like this.

And it felt like if there was only a setting where the problems that people were working on were well motivated, that provided more concrete feedback loops to researchers and more short-term incentive for the research that the research could have in better.

And so that was kind of part of the thesis for our approach for work at Open AI.

And we've done a few kinds of research, and I talked about one of them, which is kind of work on calibration.

Another problem that we've really thought a lot about is thinking about scalable oversight.

So how do we supervise AI systems that are potentially more capable than us in certain ways?

And this is a problem that we've actually had for some time in our work with physicians, right?

So in many ways AI doesn't match your expectations of a doctor and the ability to kind of integrate across a bunch of different modalities and things like this, but in many specific narrow ways models can even perform up-perform physicians.

And in those specific narrow ways, when we evaluate models with physicians or have physician signal be part of the training signal for training models, we have to invest in research that is around this problem of how do you supervise a system that are potentially more capable than you, which is one version of the problem of scalable oversight.

And so this is the problem that goes beyond health and is actually, I think, a really important problem for thinking about how we think about AI line into one of the more important problems there.

But our work and how it's given a bunch of concrete grounding because we do have models that in some ways are more capable.

So that's been a lot of our focus has been thinking about the right way to approach that problem and make that better.

And again, that has kind of been some of the work that we've done on emails, some of the ways that we've pushed, the high compute RL paradigm in the setting so that we can learn again from the aggregate of the opinions of experts and things like this.

And that's been an important part of our motivation.

But broadly, I think a lot of where this has been heading for us has been thinking about how do we kind of extract out the right personas or characters from the model?

Because if the models become more and more capable over time, they have more and more inputs, let's say they have more and more context that they're getting about a patient's medical record because the users are more proactively uploading them or the activation energy is lowered via product and they have more ability to take output actions.

Whether that's telling you things or outputting, you know, a note that you can give to your doctor or other things that they can do, probably the most important thing in the very long term is are the models, are the models the kind of models that would do the right thing for the patient or for the user for the clinician or for the researcher.

So that's the kind of thing that we've been investing in the bunch is like, how do we think about extracting the right kinds of personas for models?

How do we do so in a way that's surgical?

They get the best aspects of the personas and gets less of the ones, the parts that we don't like.

So for example, avoiding a bias towards being over conservative in challenging situations involving on clear medical consensus, but also navigating the uncertainty super well.

And so that work, which is kind of closer to like persona or character or soul in its form, actually we're grounding a lot in working health, which has been a really exciting advance.

And I hope we'll have more to share on that in the future.

Can you talk a little bit more about scalable oversight?

I mean, this was, I think at one point kind of the plan, right?

The super-line team, I think was sort of premise on this idea of, and maybe other ideas too, but like certainly my understanding with scalable oversight was going to be a big part of it.

I think one of the things that stood out was like the strong student sometimes was just like ignoring or sort of overriding the instruction of the weak teacher.

And it was like, we're the weak teacher in this situation, right?

So like how are we supposed to feel about the already emerging trend as of probably at least two years ago now that at times the strong student just decides that it should ignore or exercise its judgment despite the fact that it may contradict what the, again humans in the analogy are the weak teacher.

Do we, has there been, have there been like paradigmatic advances in that that give you confidence that this is going to really work?

Or is it kind of the usual understanding I have today from open AI probably around safety is like, well it's going to be sort of a defense and depth strategy like everything will work a little bit, will kind of gradually chip away at the problem and hopefully with enough layers of defense like it'll be okay.

Do you have like more ambitious ideas about what scalable oversight can achieve still in your mind than that?

Well I think by my thinking of how scalable oversight will proceed as kind of change a little bit over time.

I think of the problem as having two parts.

One is I call it radar scaling which is how do you think about the right way to elicit the opinions and values from people or experts or whatever?

And that's an example of something that we've been investing in in health and so then you can imagine schemes where you have AI be part of that loop and help improve the ability for humans to critique for example AI outputs.

That's like one area of work and that the company has been investing in the second area of work is given some, I think the framing for this has become more expensive over time and sometimes we don't use the word scalable oversight to refer to it anymore and I think the same is true at the labs.

Given some idea of what the values are, whether they're elicited using radar scaling or not, how do you spend a lot of compute on training the models to have those values?

And so we call that, I call that internally value oversight.

Because there's radar scaling in this value oversight.

And so the second problem I think is also a problem where even though we don't use the word scalable oversight, I think things have advanced quite a bit over time.

So one example of this which again doesn't use the word scalable oversight is the work that people are doing on specs and constitutions and measuring and improving models to adhere to these.

This is one way of saying we have values that we care about and we want the model to be very persistent and improving them.

And I think if you check various system cards or things like this, you actually see that the models have gotten much better at doing this.

And I think people have been finding increasing generalization in training models to have certain personas or characters who are things like this for whether they complied certain safety requirements or specs or constitutions or things like this.

So what I'd say is basically I think the research has been actually advancing but in a way that actually looks different from what it did before.

It looks a little bit more like humdrum post training than people imagined it would look like.

And I think this is a good thing.

I think this is like a little bit more grounded and how systems will look and how we will train them and things like this.

I don't think we've solved all the research problems, but I think sometimes we're working on the research problem without referring to them in the same way as we used to.

And would you say the core of the reason to think that this will work is that a somewhat less capable model with a super big inference budget can be expected to catch a, you know, a thing of like a deceptive, you know, sort of failure mode.

The smarter model could, you know, even if it's like more capable per token.

It's a somewhat less capable model with just a much bigger token budget.

It's kind of, I mean, this idea has been out there for a while, but it's really spark when you said like spend lots of compute to make it work.

Like just really giving that thing a lot of time to think.

It can be enough as long as the like delta isn't so bad that it can like detect anything that's going wrong in the more capable model.

So as long as you've got kind of a small enough delta and a sufficiently aligned current model, then you can theory like continue to bootstrap your way into ever more capable models without losing the alignment.

Is that, would you say that's kind of a, the core of the idea?

I think that's part of it.

I think another part of this is that the task of discrimination or critique seems to be easier than the task of generating good outputs.

And so when we study the performance of monitors, a given model can monitor itself, especially given privilege and information, like the chain of thought pretty well, and can actually, which is a little bit surprising on its base, but then if you think about the fact that discrimination has been better than generation, and this is kind of underlying a lot of work in things like RL-AIF and constitutionally AI and things like this, then it's not so surprising.

I think that's another part of it.

But I don't think we have a full understanding of how this will scale, especially when we're thinking about the regime of having trusted, but less capable models aligning a more capable model.

I think so far I haven't mainly been talking about if we have the values in some format, whether it's from humans or models, then how to be instilled those into models in a way that is trustworthy.

I think that's part of the problem, but I don't think the whole problem is solved for sure.

Yeah.

How do you think we're doing on safety as a whole?

I mean, I have been in LA as a reader since 2007, and I actually read a lot more of this stuff that we know way back in the day than I have more recently.

I would say, broadly speaking, it's like, hey, this is, in some sense, gone amazingly well, relative to baseline expectations, right?

We do have models that undeniably have like a pretty good sense of human values, and that's just like manifestly obvious on a day-to-day basis.

That was, I think, considered to be a very unlikely outcome years ago, and if you were to teleport back to 2007, and like, drop a post on overcoming bias, indicating as much, you know, be like shocking or like, you know, be like considered laughable that this would be a accomplished in this way.

And yet, you know, we do still kind of see these, like, I guess my mental model of this sort of sea saw that it seems like we're on is like, every generation of new models has like new capabilities generally, and then it also seems to have, usually some like new emergent problem, and it was, you know, whether it's like deception or now we've got like evil awareness has become like front and center, and we sort of are like, okay, that's a big problem.

The next generation, you know, gets more powerful, we kind of tamp down that last problem.

It doesn't go to zero, but it's at least like reduced, but it seems like we're sort of headed in this, headed toward this strange world where if you sort of extrapolate both the like meter trend, and the kind of whatever, two thirds to order of magnitude reduction and bad behaviors that seems to happen, at least once a, you know, given bad behaviors like recognized and addressed from one generation to the next.

If you've got to extrapolate that out to like 20, 20, eight or whatever, you're like, okay, I can imagine a model now that can do a month's worth of human work or a couple months worth of human work, but it also maybe has like, for any given run, like, a one in a thousand or a one in 10,000 or a one in a 100,000 chance of like, actively screwing me over in some, you know, super bizarre way.

And that just seems like a really weird thing to contemplate, you know, it's a really weird world to live in.

But, you know, if we believe in like straight lines on a lot of graphs, that seems to be kind of where we're going, right?

So, do you think that's where we are going?

And, you know, if not like maybe a different mental model of what the sort of alignment, you know, versus emergent problems, balance would be in a couple years time.

Yeah.

I think if you show world will definitely be pretty weird.

And so, it's a very hard for me to predict what will happen.

I do think model will become more capable.

I do think they will be able to do things over a longer time, horizon, I think that's very important.

I do think also we've been pleasantly surprised that the extent of safety generalization or alignment generalization, which is underlying the trend that you're pointing out of a lot of various safety benchmarks or failure modes decreasing over time.

So, that's been like a relatively pleasant development because I think if you think about it, that was the case for, it was the case during, we've had two major kind of scaling laws for deep learning.

One is kind of the pre-training scaling law.

What we found is that when we scaled up pre-training, we had models that were relatively good role models that could be relatively easily tuned via SFT or small amounts of supervised learning to exhibit a certain persona and extract that persona and things like this and the helpful useful assistance.

And that's a lot of what that early work on instruction tuning and things like this did and also what the early work on MedPOM did for the health setting.

Which is really great.

I think we saw a lot of generalization there.

I think there was a question for the reasoning setting whether that generalization would hold or see similar generalization.

And so far my sense is at scale we do, we do see it.

It's more scales we saw a lot of it.

And so that's a promising development, which is that if you are able to figure out the right ways and patterns to get these models to be broadly beneficial and not harmful that even if they're put in settings or they're doing things that you didn't foresee doing a month or a worth of work in days or things like this that they can continue to generalize and be safe.

And those settings and those SFT curves will continue to go down.

That said, I think it's right in pointing out there's a rapid increase in capabilities and service in which these models will be used and also a rapid decrease in safety failures that we're seeing over time.

It's a little bit hard to predict how both those curves will net out.

I do think it's really important.

And this is why I care so much about the parts of our mission that are about proactively making the tangible benefits happen and also about working on mitigating safety risks that we really stay ahead of these curves and think about the right ways to shape them in the right direction.

What does a move 37 look like in health?

And are we going to see that in the current paradigm or would it take a sort of deep integration of modalities like we were kind of touching on earlier or another idea that I'm kind of enamored with is this idea of maybe a different kind of training objective thinking about like instead of getting right answers to questions like more about at a fundamental level, like predicting what the state of health is going to be for a patient in a way where you could even begin to do like in silago experiments, right?

Like, I've got this whole profile of this patient.

What if I change this variable?

What would their health look like in that case?

That seems like it might be a pretty different paradigm from kind of like read the whole internet and make sure you give me the right answer.

What do you think the touch on any of those that you want?

But maybe also just take the opportunity to zoom out and give us a sense of like, what are your big visions and ambitions?

And what do you think people can kind of expect as you guys continue to do your thing over the next year plus?

Yeah, I love the move 37 framing.

Just explain the reference.

This is a reference to the famous move 37 during a game half a go, the AI system playing go, and Lisa Dull during one of their now famous games.

And the interesting thing about this move was that it was a move that everybody agreed humans would not have made, but in hindsight was brilliant and was key to winning the game.

And so I think your question is kind of like, can we imagine a world where models are able to do something and make some kind of interesting prediction that a human probably would not have been able to make and but what's again brilliant in impactful on hindsight?

My view is that this is not too far away.

I think in many ways many people report to me that they saw many doctors for their case and in only after talking to chatGBT was actually able to flag the thing that they then shared with their doctor and then they were able to come to a diagnosis together and things like this.

I think that seems to happen somewhat routinely and I think whether it rise to the level of a move 37 or not, I think is a matter of taste depending on the case.

I think what I'd say is that I think you should expect world models in health to improve.

And what I mean by world models is we should expect our models to have a much better understanding over time of us, our health, our trajectories of our health and how that intersects with our biology.

And that's going to be really key to thinking about whether this move 37 could look like.

Like you mentioned this idea of simulations of people and stimulating things in silico.

I think full simulations are obviously very expensive and difficult and I think isn't a thing that models are most designed to do in their current form.

So I would say that they are probably a little bit different from the current kinds of training that people are doing.

But if it's more along the lines of given a lot of context about a user predict something interesting about them or predict the results of some intervention, I think this is something that models first are already getting better at and second could get a lot better at over time.

And I think that would be potentially quite quite impactful.

I think we'll start to reach a point where it'll be more and more clear demonstrations of this.

I think you're starting to see interesting and increasingly clear demonstrations of models doing interesting science and math in the public.

I think you'll see a little bit more in this space as well.

But I do think it'll take a little bit of time.

But my view on paradigms is kind of that for you can get a lot out of extending and tweaking in the current paradigms.

And you can in fact view everything that happens so far as just one paradigm.

But between pre-training and scaling, reasoning, soil training, I think you can get quite a lot of juice.

And I think even integrating additional modalities and things like this, I think doesn't require too much additional changes or tweaks.

I think broadly our team's mission today is do whatever it takes to ensure AGI's beneficial on human health for all of humanity.

And so we see that happening through three channels.

The first is helping consumers understand and navigate their health, which is already happening through Chaspite Health.

The second is empowering the health system and thinking about the ways in which a assisted care can become part of the standard care and it can reduce the extent that clinicians are bogged down about paperwork and her spending more time thinking about the seeing patients and thinking about the problems that matter most for improving care.

The final thing is really pushing up the ceiling of research and a founder vision pretty inspiring.

But I would say is like, I think you can expect as a lot of the core problems in health care that we've talked about, the fragmentation of data, fragmentation of patient experience to improve.

As those happen, you should expect an acceleration of the application of intelligence to that data.

And I think biology and health is one of the areas in which marginal gains and intelligence have the most obvious value in solving more human humanity.

There are many examples of previous breakthroughs in biology that were like, there was nothing stopping that breakthrough from happening five or ten years earlier, except just more human ingenuity applied to it.

But there's no physical blocker.

And so, when that's the case, then I think you can assume that long running models, long running agents connected to the right data can do really incredible things.

And I think the hope is that in addition to our existing work which has really been focused on how do you raise the floor of human health?

We start to raise the ceiling of human health as well.

Yeah, it's going to be not just the AI doctor, but also the AI biomedical research scientist as well.

Wow, okay, it's an exciting time to be alive.

I've been increasingly saying that at the end of these conversations these days.

So yeah, anything else you want to make sure people are aware of that I didn't ask you about?

Like I mentioned earlier, I think the right way to think about our work for health that open the eyes really operating in three phases.

One is really laying the foundation.

I mean, a lot of those focus on our work on safety.

And so, examples of this are like the artwork on health bench, where again, we had this kind of evaluation which you can run offline of large language model performance at safety and health.

We had this study of the first AI clinical co-pilot with Penta Health.

We had a bunch of model improvements over the last year.

And all these kind of laid a foundation for the work that we're doing today.

The second thing is this kind of rise in adoption, which we've been saying.

We've been seeing an incredible rate of individual users, especially patients adopting technology.

And we've seen hundreds of millions of people asking health questions a week, which has been a rapid growth since last year.

The final thing is really the future of scaling the impact of this work.

And that's where really our work on touch with the health, which is this kind of consumer-facing product, which enables you to connect your health data with additional privacy protections, and also our work on touch safety for health care, which is again facing health systems and really enabling health professionals to use AI as a co-pilot in their work flows, kind of come in.

And so, between all these things, I'm really excited about the work that we're doing to scale the impact of this work in the next year.

And I'm looking forward to what's to come.

Karin, say go ahead of health at OpenAI.

Thank you, legitimately, for all your hard work.

It really is incredibly valuable and incredibly impactful, and that's obviously just going to continue to grow.

Exponentially, along with so many things in the AI space.

And thank you for being part of the cognitive revolution.

Thanks for having me.

Oh, oh, oh, oh, oh, oh.

Do no harm that alone wants to fight when your people need an answer and I just cause I type by.

Five miles forward every cavity inside universal, basic intelligence and to drive.

No, Karin, I did my shit.

If you're finding value in the show, we'd appreciate it if you take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube.

Of course, we always welcome your feedback, guest-in-topic suggestions, and sponsorship inquiries.

Either via our website, cognitiverevolution.au, or by DMing me on your favorite social network.

The cognitive revolution is part of the Turpentine network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more.

We're produced by AI podcasting.

If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.iong.

And thank you to everyone who listens for being part of the cognitive revolution.