Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments.
IF 4.7 2区 医学Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Chuan Hong, Anand Chowdhury, Anthony D Sorrentino, Haoyuan Wang, Monica Agrawal, Armando Bedoya, Sophia Bessias, Nicoleta J Economou-Zavlanos, Ian Wong, Christian Pean, Fan Li, Kathryn I Pollak, Eric G Poon, Michael J Pencina
{"title":"Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments.","authors":"Chuan Hong, Anand Chowdhury, Anthony D Sorrentino, Haoyuan Wang, Monica Agrawal, Armando Bedoya, Sophia Bessias, Nicoleta J Economou-Zavlanos, Ian Wong, Christian Pean, Fan Li, Kathryn I Pollak, Eric G Poon, Michael J Pencina","doi":"10.1093/jamia/ocaf023","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation.</p><p><strong>Materials and methods: </strong>We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects-such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness-to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies.</p><p><strong>Results: </strong>The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians' manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable.</p><p><strong>Discussion: </strong>Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance.</p><p><strong>Conclusion: </strong>Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf023","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation.
Materials and methods: We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects-such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness-to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies.
Results: The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians' manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable.
Discussion: Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance.
Conclusion: Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.
期刊介绍:
JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.