Elizabeth Geena Woo, Michael C. Burkhart, Emily Alsentzer, Brett K. Beaulieu-Jones
{"title":"综合数据蒸馏使临床信息的大规模提取成为可能","authors":"Elizabeth Geena Woo, Michael C. Burkhart, Emily Alsentzer, Brett K. Beaulieu-Jones","doi":"10.1038/s41746-025-01681-4","DOIUrl":null,"url":null,"abstract":"<p>Large-language models (LLMs) show promise for clinical note information extraction, but deployment challenges include high computational costs and privacy concerns. We used synthetic data distillation to fine-tune smaller, open-source LLMs to achieve performance comparable to larger models while enabling local hardware deployment or reduced cloud costs. Using Llama-3.1-70B-Instruct, we generated synthetic question-answer training pairs to fine-tune smaller Llama models. We evaluated performance across three tasks: synthetic clinical trial criteria, the i2b2 2018 Clinical Trial Eligibility Challenge, and apixaban trial criteria questions. The 8B-parameter model achieved high accuracy across all tasks and sometimes outperformed the 70B-Instruct teacher model. Fine-tuning with only the most challenging questions still improved performance, demonstrating the value of targeted training. Results from 3B- and 1B-parameter models showed a clear size-performance tradeoff. This work demonstrates synthetic data distillation’s potential for enabling scalable clinical information extraction.</p>","PeriodicalId":19349,"journal":{"name":"NPJ Digital Medicine","volume":"50 1","pages":""},"PeriodicalIF":12.4000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Synthetic data distillation enables the extraction of clinical information at scale\",\"authors\":\"Elizabeth Geena Woo, Michael C. Burkhart, Emily Alsentzer, Brett K. Beaulieu-Jones\",\"doi\":\"10.1038/s41746-025-01681-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Large-language models (LLMs) show promise for clinical note information extraction, but deployment challenges include high computational costs and privacy concerns. We used synthetic data distillation to fine-tune smaller, open-source LLMs to achieve performance comparable to larger models while enabling local hardware deployment or reduced cloud costs. Using Llama-3.1-70B-Instruct, we generated synthetic question-answer training pairs to fine-tune smaller Llama models. We evaluated performance across three tasks: synthetic clinical trial criteria, the i2b2 2018 Clinical Trial Eligibility Challenge, and apixaban trial criteria questions. The 8B-parameter model achieved high accuracy across all tasks and sometimes outperformed the 70B-Instruct teacher model. Fine-tuning with only the most challenging questions still improved performance, demonstrating the value of targeted training. Results from 3B- and 1B-parameter models showed a clear size-performance tradeoff. This work demonstrates synthetic data distillation’s potential for enabling scalable clinical information extraction.</p>\",\"PeriodicalId\":19349,\"journal\":{\"name\":\"NPJ Digital Medicine\",\"volume\":\"50 1\",\"pages\":\"\"},\"PeriodicalIF\":12.4000,\"publicationDate\":\"2025-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NPJ Digital Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1038/s41746-025-01681-4\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Digital Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41746-025-01681-4","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Synthetic data distillation enables the extraction of clinical information at scale
Large-language models (LLMs) show promise for clinical note information extraction, but deployment challenges include high computational costs and privacy concerns. We used synthetic data distillation to fine-tune smaller, open-source LLMs to achieve performance comparable to larger models while enabling local hardware deployment or reduced cloud costs. Using Llama-3.1-70B-Instruct, we generated synthetic question-answer training pairs to fine-tune smaller Llama models. We evaluated performance across three tasks: synthetic clinical trial criteria, the i2b2 2018 Clinical Trial Eligibility Challenge, and apixaban trial criteria questions. The 8B-parameter model achieved high accuracy across all tasks and sometimes outperformed the 70B-Instruct teacher model. Fine-tuning with only the most challenging questions still improved performance, demonstrating the value of targeted training. Results from 3B- and 1B-parameter models showed a clear size-performance tradeoff. This work demonstrates synthetic data distillation’s potential for enabling scalable clinical information extraction.
期刊介绍:
npj Digital Medicine is an online open-access journal that focuses on publishing peer-reviewed research in the field of digital medicine. The journal covers various aspects of digital medicine, including the application and implementation of digital and mobile technologies in clinical settings, virtual healthcare, and the use of artificial intelligence and informatics.
The primary goal of the journal is to support innovation and the advancement of healthcare through the integration of new digital and mobile technologies. When determining if a manuscript is suitable for publication, the journal considers four important criteria: novelty, clinical relevance, scientific rigor, and digital innovation.