What Will It Take to Translate AI Research into Clinical Advances?
How to get from benchmarks to bedside
- 7 min read
- Perspective
Photo illustration: Valerie Chiang

Photo illustration: Valerie Chiang
When I decided to join a natural language processing lab during my residency in radiation oncology, I encountered some skepticism from colleagues.
Natural language processing (NLP) is a field of artificial intelligence (AI) that focuses on “teaching” computers to understand and process human language, and it falls outside the continuum of basic science to clinical research that residents in my field typically focus on. “Why not leave computer science to the engineers?” I was often asked. “Can’t we clinicians just use the tools they create?”
I joined the lab — Guergana Savova’s clinical NLP lab in the Computational Health Informatics Program at Boston Children’s Hospital — in 2018, right around the time that Google released its groundbreaking model BERT, the predecessor of today’s large language models (LLMs). The impressive performance of BERT and similar models on many traditionally challenging NLP tasks captured the attention of the computer science community, drawing new scientists to the field. But it wasn’t until 2022, when OpenAI released ChatGPT, that the public really began to interact with, understand, and envision the potential of LLMs. By that time, I had established my own lab — focused on translating AI into the clinic — within the Department of Radiation Oncology at Brigham and Women’s Hospital/Dana-Farber Cancer Institute.
LLMs are now nearly synonymous with AI in the general vernacular, and the rapid advances in their development have played a major role in driving research on and investments in AI. These technologies undoubtedly hold the potential to advance human health, but the enthusiasm surrounding them often overshadows the on-the-ground reality of translating them into clinical care. Within months of ChatGPT’s release, studies reported that LLMs could pass benchmark biomedical exams, including the United States Medical Licensing Exam (USMLE), implying that they had the ability to conduct clinical reasoning. These reports received a lot of media attention and spurred efforts to employ the models for patient portal messaging, ambient documentation, and other uses.
Clinicians were likely to accept the model’s output even when they would have answered differently if writing the response on their own.
At the same time, my research group and others found that these models fell short when moving from multiple-choice exams to more realistic use cases that require deeper clinical reasoning, such as responding to patient messages or answering patients’ questions about their cancer care. For example, we found in one study that while ChatGPT almost always recommended at least one correct component of multimodal treatment, one-third of responses also included an inappropriate modality, such as surgery and radiotherapy for noncurative diagnoses or newer systemic therapies such as targeted therapies and immunotherapies in diagnoses where they are not indicated. More recently, clinical pilots using LLMs to help clinicians respond to patients’ questions have failed to produce the anticipated efficiency gains, although other use cases, such as for ambient documentation, have shown more promise.
This gap between the hype surrounding AI and the reality of its role in health care is impeding the translation of AI advances into practical, user-centered applications that are actually needed by health care institutions, clinicians, and — most importantly — patients. My field of radiation oncology has a long history of close collaborations between clinicians, engineers, and physicists to develop technologies that address significant clinical challenges, such as stereotactic radiosurgery. We need to foster that type of translational expertise in the development of AI technologies to ensure they are grounded in real health care needs.
Better benchmarks
One step in this direction is to develop benchmarks that reflect real-world goals. For developers, benchmarks and leaderboards play a major role in driving competition and innovation. A benchmark is a dataset that can be used to score a model’s performance. Leaderboards track and rank model performance across a group of benchmarks as a proxy for performance in a given domain or task. For example, the Open Medical-LLM Leaderboard includes the USMLE as well as other datasets pertaining to biomedicine that were mostly generated from existing multiple-choice biomedical exams. Developers use such leaderboards to demonstrate their model’s strength in the domain of health care.
These benchmarks are already having a real effect on accelerating the integration of LLMs into health care, but the clinical community is not usually included in constructing them and may not even be aware that they exist. This focus on benchmark performance inadvertently marginalizes the voices of clinicians and patients. Chasing leaderboards leads to the development of LLMs that are really good at answering questions found in the benchmark exams, but it’s not clear how this kind of performance relates to clinical use. My research group has shown that popular benchmarks are not always reliable and may not demonstrate generalizable knowledge or understanding. For example, we found that using the brand name of a medication in a prompt to a model can result in different answers than using the generic name.
The result of this focus on existing benchmarks is a mismatch between the purported clinical performance of a model and its actual performance, risks, and utility. More realistic benchmarks require close collaboration between clinicians and computer scientists — neither can do this alone. Clinicians can identify the clinically relevant task for a model to perform, create the dataset that models will be scored against, and determine threshold performance metrics needed to proceed to clinical testing. Computer scientists can ensure benchmarks effectively stress-test models and are evaluable, accessible, secure, and robust.
Testing, testing
A second step is to address the tension between the speed of AI advances and the time needed to implement and evaluate new approaches in the clinic. Although we don’t want unnecessary delays, uptake of AI tools based on insufficient, inadequately aligned testing and anecdotal experiences risks early harms and loss of trust, which over the long term could slow implementation. Benchmarks can’t model the complexities of clinical care or of the interaction between humans and AI. To understand the full spectrum of risks and benefits, we need realistic preclinical simulations and lower-risk applications where there is a human expert in the loop to oversee and fix errors.
For example, my research group found that use of LLMs to respond to simulated patient questions about cancer care led to responses that contained more educational content than responses written by clinicians alone, and clinicians felt more efficient. These results indicate that using an LLM could improve health literacy and mitigate burnout. But responses drafted by the models also sometimes included errors, and clinicians exhibited automation bias when given the drafts written by models — that is, clinicians were likely to accept the model’s output even when they would have answered differently if writing the response on their own.
We need to make it easier for clinicians and computer scientists to collaborate.
These results indicate that clinical decision-making was affected and that errors could make their way to patients if clinicians aren’t vigilant. There were also a few severe errors that resulted when models inadequately communicated urgency, a critical type of error that is not tested in biomedical exam benchmarks. This straightforward study provided a path for us to improve models for our use case, develop more realistic, safety-focused benchmarks, and explore strategies to better support clinicians to oversee AI output.
Approaches to evaluating AI can draw on existing standards for medical research, ethics, and regulation. Like all medical technologies, AI should be evaluated in a risk-stratified manner suited to the clinical domain, specific use case, and implementation strategy. Research focused on understanding where and how to integrate new AI technologies into the clinical workflow will ensure that we don’t miss an otherwise promising technology that was suboptimally implemented. At a more transformative level, clinical expertise is needed to reimagine how medical data is entered and used in order to realize the true potential of AI in health care.
Clinicians’ role
As a third step, we need to make it easier for clinicians and computer scientists to collaborate. I have worked with many talented computer scientists who are passionate about developing safe, ethical AI innovations to improve the lives of patients and clinicians. But most work outside of health care and do not have easy access to the clinical expertise and datasets needed to build tools that address pressing medical needs.
Computer scientists and clinicians largely exist in separate academic spheres. When I attend NLP or broader AI conferences, I am often one of only a handful of clinicians present, even at conferences with special sessions and workshops dedicated to AI in health care. Computer scientists rarely attend clinical conferences, even those that have strong communities of clinical and biomedical informatics researchers. We also publish our research in different venues and might not even know where to find one another’s papers.

There is a lot we can learn from other translational research fields. We can improve collaboration via joint conferences, collaborative journals, and interdisciplinary research teams. Such efforts could be incentivized by multidisciplinary funding opportunities, promotion criteria, and requirements for meaningful engagement between clinicians and developers who want to bring their models to the clinic. I urge my colleagues interested in health care AI to consider attending AI conferences. You will certainly meet many scientists interested in using their skills to improve health care, and you might even find you have an outsized voice in aligning technical advances with clinical needs.
AI is becoming more accessible, but the nuanced understanding of clinical contexts remains a critical, nonautomatable asset unique to clinicians. As clinicians, we can take a leadership role in identifying high-value use cases; overseeing clinically appropriate and reproducible evaluations that demonstrate value for patients, the workforce, and payers; advocating for transparency about AI models we’re using; and ensuring that patients are included in decisions about AI strategies. Bringing AI to the clinic requires critical evaluation under uncertainty, balancing risks and benefits, and navigating and communicating new information quickly. We practice these skills every day. We are well-equipped to partner with our computer science colleagues to move beyond the hype cycle toward genuine improvements in health care.
Danielle Bitterman is an HMS assistant professor in the Department of Radiation Oncology at Brigham and Women’s Hospital/Dana-Farber Cancer Institute and a faculty member in the AI in Medicine program at Mass General Brigham. She is a clinical lead for data science/AI at Mass General Brigham Digital.
Research at Harvard Medical School hangs in the balance due to the government’s decision to terminate large numbers of federally funded grants and contracts across Harvard University.