This appeared last week:
John Halamka on the risks and benefits of clinical LLMs
At HIMSS24,
the president of Mayo Clinic Platform offered some tough truths about the
challenges of deploying genAI – touting its enormous potential while
spotlighting patient-safety dangers to guard against in provider settings.
By Mike Miliard
March 13,
2024 11:13 AM
ORLANDO
– At HIMSS24 on Tuesday, Dr. John Halamka, president of Mayo Clinic Platform,
offered a frank discussion about the substantial potential benefits – and very
real potential for harm – in both predictive and generative artificial
intelligence used in clinical settings.
Healthcare
AI has a credibility problem, he said. Mostly because the models so
often lack transparency and accountability.
"Do
you have any idea what training data was used on the algorithm, predictive or
generative, you're using now?" Halamka asked. "Is the result of that
predictive algorithm consistent and reliable? Has it been tested in a clinical
trial?"
The
goal, he said, is to figure out some strategies so "the AI future we all
want is as safe as we all need."
It
starts with good data, of course. And that's easier discussed than achieved.
"All
algorithms are trained on data," said Halamka. "And the data that we
use must be curated, normalized. We must understand who gathered it and for
what purpose – that part is actually pretty tough."
For
instance, "I don't know if any of you have actually studied the data
integrity of your electronic health record systems, and your databases and your
institutions, but you will actually find things like social determinants of
health are poorly gathered, poorly representative," he explained.
"They're sparse data, and they may not actually reflect reality. So if you
use social determinants of health for any of these algorithms, you're very
likely to get a highly biased result."
More
questions to be answered: "Who is presenting that data to you? Your
providers? Your patients? Is it coming from telemetry? Is it coming from
automated systems that extract metadata from images?"
Once
those questions are answered satisfactorily, that you've made sure the data has
been gathered in a comprehensive enough fashion to develop the algorithm you
want, then it's just a question of identifying potential biases and mitigating
them. Easy enough, right?
"In
the dataset that you have, what are the multimodal data elements? Just patient
registration is probably not sufficient to create an AI model. Do you have such
things as text, the notes, the history and physical [exam], the operative note,
the diagnostic information? Do you have images? Do you have telemetry? Do you
have genomics? Digital pathology? That is going to give you a sense of data
depth – multiple different kinds of data, which are probably going to be used
increasingly as we develop different algorithms that look beyond just
structured and unstructured data."
Then
it's time to think about data breadth. "How many patients do you have? I
talked to several colleagues internationally that say, well, we have a registry
of 5,000 patients, and we're going to develop AI on that registry. Well, 5,000
is probably not breadth enough to give you a highly resilient model."
And
what about "heterogeneity or spread?" Halamka asked. "Mayo has
11.2 million patients in Arizona, Florida, Minnesota and internationally. But
does it offer a representative data of France, or a representative Nordic
population?"
As
he sees it, "any dataset from any one institution is probably going to
lack the spread to create algorithms that can be globally applied," said
Halamka.
In
fact, you could probably argue there is no one who can create an unbiased
algorithm developed in one geography that will work in another geography
seamlessly.
What
that implies, he said, is you need a global network of federated participants
that will help with model creation and model testing and local tuning if we're
going to deliver the AI result we want on a global basis."
On
that front, one of the biggest challenges is that "not every country on
the planet has fully digitized records," said Halamka, who was recently in
Davos, Switzerland for the World Economic Forum.
"Why
haven't we created an amazing AI model in Switzerland?" he asked.
"Well, Switzerland has extremely good chocolate – and extremely bad
electronic health records. And about 90% of the data of Switzerland is on
paper."
But
even with good digitized data. And even after accounting for that data's depth,
breadth and the spread, there are still other questions to consider. For
instance, what data should be included in the model?
"If
you want a fair, appropriate, valid, effective and safe algorithm, should you
use race ethnicity as an input to your AI model? The answer is to be really
careful with doing that, because it may very well bias the model in ways you
don't want," said Halamka.
"If
there was some sort of biological reason to have race ethnicity as a data
element, OK, maybe it's helpful. But if it's really not related to a disease
state or an outcome you're predicting, you're going to find – and I'm sure
you've all read the literature about overtreatment, undertreatment,
overdiagnosis – these kinds of problems. So you have to be very careful when
you decide to build the model, what data to include."
Even
more steps: "Then, once you have the model, you need to test it on data
that's not the development set, and that may be a segregated data set in
your organization, or maybe another organization in your region or around the
world. And the question I would ask you all is, what do you measure? How do you
evaluate a model to make sure that it is fair? What does it mean to be
fair?"
Halamka
has been working for some time with the Coalition for Health AI, which was
founded with the idea that, "if we're going to define what it means to be
fair, or effective, or safe, that we're going to have to do it as a
community."
CHAI
started with just six organizations. Today, it's got 1,500 members from around
the world, including all the big tech organizations, academic medical centers,
regional healthcare systems payers, pharma and government.
"You
now have a public private organization capable of working as a community to
define what it means to be fair, how you should measure what is a testing and
evaluation framework, so we can create data cards, what data went into the
system and model cards, how do they perform?"
It's
a fact that every algorithm will have some sort of inherent bias, said Halamka.
That's
why "Mayo has an assurance lab, and we test commercial algorithms and
self-developed algorithms," he said. "And what you do is you identify
the bias and then you mitigate it. It can be mitigated by returning the
algorithm to different kinds of data, or just an understanding that the
algorithm can't be completely fair for all patients. You just have to be
exceedingly careful where and how you use it.
"For
example, Mayo has a wonderful cardiology algorithm that will predict cardiac
mortality, and it has incredible predictive, positive predictive value for a
body mass index that is low and a really not good performance for a body mass
index that is high. So is it ethical to use that algorithm? Well, yes, on
people whose body mass index is low, and you just need to understand that bias
and use it appropriately."
Halamka
noted that the Coalition for Health AI has created an extensive series of
metrics and artifacts and processes – available at CoalitionforHealthAI.org.
"They're all for free. They're international. They're for download."
Over
the next few months, CHAI "will be turning its attention to a lot of
generative AI topics," he said. "Because generative AI evaluation is
harder.
With
predictive models, "I can understand what data went in, what data comes
out, how it performs against ground truth. Did you have the diagnosis or not?
Was the recommendation used or helpful?
With
generative AI, "It may be a completely well-developed technology, but
based on the prompt you give it, the answer could either be accurate or kill
the patient."
Halamka
offered a real example.
"We
took a New England Journal of Medicine
CPC
case and gave it to a commercial narrative AI product. The case said the
following: The patient is a 59-year-old with crushing, substantial chest pain,
shortness of breath – and left leg radiation.
"Now,
for the clinicians in the room, you know that left leg radiation is kind of
odd. But remember, our generative AI systems are trained to look at language.
And, yeah, they've seen that radiation thing on chest pain cases a thousand
times.
"So
ask the following question on ChatGPT or Anthropic or whatever it is you're
using: What is the diagnosis? The diagnosis came back: 'This patient is having
myocardial infarction. Anticoagulate them immediately.'
"But
then ask a different question: 'What diagnosis shouldn't I miss?'"
To
that query, the AI responded: "'Oh, don't miss dissecting aortic aneurysm
and, of course, left leg pain,'" said Halamka. "In this case, this
was an aortic aneurysm – for which anticoagulation would have instantly killed
the patient.
"So
there you go. If you have a product, depending on the question you ask, it
either gives you a wonderful bit of guidance or kills the patient. That is not
what I would call a highly reliable product. So you have to be exceedingly
careful."
At
the Mayo Clinic, "we've done a lot of derisking," he said.
"We've figured how to de identify data and how to keep it safe, the
generation of models, how to build an international coalition of organizations,
how to do validation, how to do deployment."
Not
every health system is as advanced and well-resourced as Mayo, of course.
"But
my hope is, as all of you are on your AI journey – predictive and generative –
that you can take some of the lessons that we've learned, take some of the
artifacts freely available from the Coalition for Health AI, and build a
virtuous life cycle in your own organization, so that we'll get the benefits of
all this AI we need while doing no patient harm," he said.
More here:
https://www.healthcareitnews.com/news/john-halamka-risks-and-benefits-clincial-llms
It is well worth reading this
article and following up the ideas offered. A really high-value talk I reckon!
David.