STAT+: OpenAI leaps into health care with AI benchmark to evaluate models
OpenAI on Monday released a large set of data for evaluating how well large language models answer questions related to health care. Experts lauded the open-source data and detailed evaluation…

OpenAI on Monday released a large set of data for evaluating how well large language models answer questions related to health care. Experts lauded the open-source data and detailed evaluation rubrics, calling them “unprecedented” in scale and breadth. The project, HealthBench, marks OpenAI’s first foray into health care applications of AI, outside of external partnerships.
“Our mission as OpenAI is to ensure AGI is beneficial to humanity,” said Karan Singhal, who leads OpenAI’s health AI team, referring to OpenAI’s goal of developing artificial general intelligence. “One part of that is building and deploying technology. Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings,” he said.
OpenAI’s HealthBench contains 5,000 “realistic health conversations,” each with a custom rubric to grade the model’s responses to health-related questions. The questions and rubrics were curated by a group of 262 physicians who have practiced in a combined 60 countries, the company said. In total, the rubrics encompass over 57,000 unique criteria, allowing the company to measure the performance of models in many more dimensions than traditional benchmarks, said Singhal.