从电子健康记录中确定癌症结果的人工智能蒸馏技术的经验评估

IF 15.1 1区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Irbaz Bin Riaz, Syed Arsalan Ahmed Naqvi, Noman Ashraf, Gordon J. Harris, Kenneth L. Kehl
{"title":"从电子健康记录中确定癌症结果的人工智能蒸馏技术的经验评估","authors":"Irbaz Bin Riaz, Syed Arsalan Ahmed Naqvi, Noman Ashraf, Gordon J. Harris, Kenneth L. Kehl","doi":"10.1038/s41746-025-01646-7","DOIUrl":null,"url":null,"abstract":"<p>Phenotypic information for cancer research is embedded in unstructured electronic health records (EHR), requiring effort to extract. Deep learning models can automate this but face scalability issues due to privacy concerns. We evaluated techniques for applying a teacher-student framework to extract longitudinal clinical outcomes from EHRs. We focused on the challenging task of ascertaining two cancer outcomes—overall response and progression according to Response Evaluation Criteria in Solid Tumors (RECIST)—from free-text radiology reports. Teacher models with hierarchical Transformer architecture were trained on data from Dana-Farber Cancer Institute (DFCI). These models labeled public datasets (MIMIC-IV, Wiki-text) and GPT-4-generated synthetic data. “Student” models were then trained to mimic the teachers’ predictions. DFCI “teacher” models achieved high performance, and student models trained on MIMIC-IV data showed comparable results, demonstrating effective knowledge transfer. However, student models trained on Wiki-text and synthetic data performed worse, emphasizing the need for in-domain public datasets for model distillation.</p>","PeriodicalId":19349,"journal":{"name":"NPJ Digital Medicine","volume":"21 1","pages":""},"PeriodicalIF":15.1000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Empirical evaluation of artificial intelligence distillation techniques for ascertaining cancer outcomes from electronic health records\",\"authors\":\"Irbaz Bin Riaz, Syed Arsalan Ahmed Naqvi, Noman Ashraf, Gordon J. Harris, Kenneth L. Kehl\",\"doi\":\"10.1038/s41746-025-01646-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Phenotypic information for cancer research is embedded in unstructured electronic health records (EHR), requiring effort to extract. Deep learning models can automate this but face scalability issues due to privacy concerns. We evaluated techniques for applying a teacher-student framework to extract longitudinal clinical outcomes from EHRs. We focused on the challenging task of ascertaining two cancer outcomes—overall response and progression according to Response Evaluation Criteria in Solid Tumors (RECIST)—from free-text radiology reports. Teacher models with hierarchical Transformer architecture were trained on data from Dana-Farber Cancer Institute (DFCI). These models labeled public datasets (MIMIC-IV, Wiki-text) and GPT-4-generated synthetic data. “Student” models were then trained to mimic the teachers’ predictions. DFCI “teacher” models achieved high performance, and student models trained on MIMIC-IV data showed comparable results, demonstrating effective knowledge transfer. However, student models trained on Wiki-text and synthetic data performed worse, emphasizing the need for in-domain public datasets for model distillation.</p>\",\"PeriodicalId\":19349,\"journal\":{\"name\":\"NPJ Digital Medicine\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":15.1000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NPJ Digital Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1038/s41746-025-01646-7\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Digital Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41746-025-01646-7","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

癌症研究的表型信息嵌入在非结构化电子健康记录(EHR)中,需要努力提取。深度学习模型可以实现自动化,但由于隐私问题而面临可扩展性问题。我们评估了应用教师-学生框架从电子病历中提取纵向临床结果的技术。我们专注于从自由文本放射学报告中确定两种具有挑战性的癌症结果-根据实体肿瘤反应评估标准(RECIST)的总体反应和进展。采用分层Transformer架构的教师模型使用来自Dana-Farber癌症研究所(DFCI)的数据进行训练。这些模型标记公共数据集(MIMIC-IV, Wiki-text)和gpt -4生成的合成数据。然后训练“学生”模型来模仿老师的预测。DFCI“教师”模型取得了优异的成绩,使用MIMIC-IV数据训练的学生模型也显示出类似的结果,证明了有效的知识转移。然而,在维基文本和合成数据上训练的学生模型表现得更差,这强调了对领域内公共数据集进行模型蒸馏的需求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Empirical evaluation of artificial intelligence distillation techniques for ascertaining cancer outcomes from electronic health records

Empirical evaluation of artificial intelligence distillation techniques for ascertaining cancer outcomes from electronic health records

Phenotypic information for cancer research is embedded in unstructured electronic health records (EHR), requiring effort to extract. Deep learning models can automate this but face scalability issues due to privacy concerns. We evaluated techniques for applying a teacher-student framework to extract longitudinal clinical outcomes from EHRs. We focused on the challenging task of ascertaining two cancer outcomes—overall response and progression according to Response Evaluation Criteria in Solid Tumors (RECIST)—from free-text radiology reports. Teacher models with hierarchical Transformer architecture were trained on data from Dana-Farber Cancer Institute (DFCI). These models labeled public datasets (MIMIC-IV, Wiki-text) and GPT-4-generated synthetic data. “Student” models were then trained to mimic the teachers’ predictions. DFCI “teacher” models achieved high performance, and student models trained on MIMIC-IV data showed comparable results, demonstrating effective knowledge transfer. However, student models trained on Wiki-text and synthetic data performed worse, emphasizing the need for in-domain public datasets for model distillation.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
25.10
自引率
3.30%
发文量
170
审稿时长
15 weeks
期刊介绍: npj Digital Medicine is an online open-access journal that focuses on publishing peer-reviewed research in the field of digital medicine. The journal covers various aspects of digital medicine, including the application and implementation of digital and mobile technologies in clinical settings, virtual healthcare, and the use of artificial intelligence and informatics. The primary goal of the journal is to support innovation and the advancement of healthcare through the integration of new digital and mobile technologies. When determining if a manuscript is suitable for publication, the journal considers four important criteria: novelty, clinical relevance, scientific rigor, and digital innovation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信