{"title":"Clinical Trial Design Approach to Auditing Language Models in Health Care Setting.","authors":"Lovedeep Gondara, Jonathan Simkin, Shebnum Devji","doi":"10.1200/CCI-24-00331","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Rapid advancements in natural language processing have led to the development of sophisticated language models. Inspired by their success, these models are now used in health care for tasks such as clinical documentation and medical record classification. However, language models are prone to errors, which can have serious consequences in critical domains such as health care, ensuring that their reliability is essential to maintain patient safety and data integrity.</p><p><strong>Methods: </strong>To address this, we propose an innovative auditing process based on principles from clinical trial design. Our approach involves subject matter experts (SMEs) manually reviewing pathology reports without previous knowledge of the model's classification. This single-blind setup minimizes bias and allows us to apply statistical rigor to assess model performance.</p><p><strong>Results: </strong>Deployed at the British Columbia Cancer Registry, our audit process effectively identified the core issues in the operational models. Early interventions addressed these issues, maintaining data integrity and patient care standards.</p><p><strong>Conclusion: </strong>The audit provides real-world performance metrics and underscores the importance of human-in-the-loop machine learning. Even advanced models require SME oversight to ensure accuracy and reliability. To our knowledge, we have developed the first continuous audit process for language models in health care, modeled after clinical trial principles. This methodology ensures that audits are statistically sound and operationally feasible, setting a new standard for evaluating language models in critical applications.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400331"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/3 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Rapid advancements in natural language processing have led to the development of sophisticated language models. Inspired by their success, these models are now used in health care for tasks such as clinical documentation and medical record classification. However, language models are prone to errors, which can have serious consequences in critical domains such as health care, ensuring that their reliability is essential to maintain patient safety and data integrity.
Methods: To address this, we propose an innovative auditing process based on principles from clinical trial design. Our approach involves subject matter experts (SMEs) manually reviewing pathology reports without previous knowledge of the model's classification. This single-blind setup minimizes bias and allows us to apply statistical rigor to assess model performance.
Results: Deployed at the British Columbia Cancer Registry, our audit process effectively identified the core issues in the operational models. Early interventions addressed these issues, maintaining data integrity and patient care standards.
Conclusion: The audit provides real-world performance metrics and underscores the importance of human-in-the-loop machine learning. Even advanced models require SME oversight to ensure accuracy and reliability. To our knowledge, we have developed the first continuous audit process for language models in health care, modeled after clinical trial principles. This methodology ensures that audits are statistically sound and operationally feasible, setting a new standard for evaluating language models in critical applications.