STAT+: New Stanford tool evaluates AI models on tasks that actually matter in health care

Stanford researchers have developed a new tool to evaluate AI language models for routine health care tasks.

Mar 17, 2025 - 09:34
 0
STAT+: New Stanford tool evaluates AI models on tasks that actually matter in health care

Harvard Medical School professor Isaac Kohane remembers being asked, when he was a trainee doctor, to diagnose a child with low blood sugar in the intensive care unit. He delivered a beautifully comprehensive list of everything it could possibly be, he recalled — “Mwah!” Then his attending asked him a simple question: “When were the IVs switched?”

Sure enough, looking back at the logs, there had been a five-minute period where there was no glucose flowing into the child’s line, and the built-up insulin in their body dropped their blood sugar. Kohane, who is now the chair of biomedical informatics at Harvard, laughs at himself now. “He was thinking about the way the real world works. And I was focusing on book smarts,” he said.

Some experts worry that a similar situation is building with artificial intelligence as the health care industry rushes to implement AI language models based loosely on the fact that such models can pass knowledge tests like the U.S. medical licensure exam. There’s little evidence to show that AI models can reliably perform as well or better than clinicians in real world settings: When researchers built a test to see how well AI answered physician queries and instructions, they found that GPT-4 had a 35% error rate compared to answers written by humans.

Continue to STAT+ to read the full story…