Leona Cilar Budler, Hongyu Chen, Aokun Chen, Maxim Topaz, Wilson Tam, Jiang Bian, Gregor Stiglic
{"title":"A Brief Review on Benchmarking for Large Language Models Evaluation in Healthcare","authors":"Leona Cilar Budler, Hongyu Chen, Aokun Chen, Maxim Topaz, Wilson Tam, Jiang Bian, Gregor Stiglic","doi":"10.1002/widm.70010","DOIUrl":null,"url":null,"abstract":"This paper reviews benchmarking methods for evaluating large language models (LLMs) in healthcare settings. It highlights the importance of rigorous benchmarking to ensure LLMs' safety, accuracy, and effectiveness in clinical applications. The review also discusses the challenges of developing standardized benchmarks and metrics tailored to healthcare‐specific tasks such as medical text generation, disease diagnosis, and patient management. Ethical considerations, including privacy, data security, and bias, are also addressed, underscoring the need for multidisciplinary collaboration to establish robust benchmarking frameworks that facilitate LLMs' reliable and ethical use in healthcare. Evaluation of LLMs remains challenging due to the lack of standardized healthcare‐specific benchmarks and comprehensive datasets. Key concerns include patient safety, data privacy, model bias, and better explainability, all of which impact the overall trustworthiness of LLMs in clinical settings.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WIREs Data Mining and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/widm.70010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper reviews benchmarking methods for evaluating large language models (LLMs) in healthcare settings. It highlights the importance of rigorous benchmarking to ensure LLMs' safety, accuracy, and effectiveness in clinical applications. The review also discusses the challenges of developing standardized benchmarks and metrics tailored to healthcare‐specific tasks such as medical text generation, disease diagnosis, and patient management. Ethical considerations, including privacy, data security, and bias, are also addressed, underscoring the need for multidisciplinary collaboration to establish robust benchmarking frameworks that facilitate LLMs' reliable and ethical use in healthcare. Evaluation of LLMs remains challenging due to the lack of standardized healthcare‐specific benchmarks and comprehensive datasets. Key concerns include patient safety, data privacy, model bias, and better explainability, all of which impact the overall trustworthiness of LLMs in clinical settings.