LitAutoScreener：基于大型语言模型的循证医学文献自动筛选工具的开发与验证。

Health data science Pub Date : 2025-09-02 eCollection Date: 2025-01-01 DOI:10.34133/hds.0322

Yiming Tao, Xuehu Li, Zuhar Yisha, Sihan Yang, Siyan Zhan, Feng Sun

{"title":"LitAutoScreener：基于大型语言模型的循证医学文献自动筛选工具的开发与验证。","authors":"Yiming Tao, Xuehu Li, Zuhar Yisha, Sihan Yang, Siyan Zhan, Feng Sun","doi":"10.34133/hds.0322","DOIUrl":null,"url":null,"abstract":"Background: The traditional manual literature screening approach is limited by its time-consuming nature and high labor costs. A pressing issue is how to leverage large language models to enhance the efficiency and quality of evidence-based evaluations of drug efficacy and safety. Methods: This study utilized a manually curated reference literature database-comprising vaccine, hypoglycemic agent, and antidepressant evaluation studies-previously developed by our team through conventional systematic review methods. This validated database served as the gold standard for the development and optimization of LitAutoScreener. Following the PICOS (Population, Intervention, Comparison, Outcomes, Study Design) principles, a chain-of-thought reasoning approach with few-shot learning prompts was implemented to develop the screening algorithm. We subsequently evaluated the performance of LitAutoScreener using 2 independent validation cohorts, assessing both classification accuracy and processing efficiency. Results: For respiratory syncytial virus vaccine safety validation title-abstract screening, our tools based on GPT (GPT-4o), Kimi (moonshot-v1-128k), and DeepSeek (deepseek-chat 2.5) demonstrated high accuracy in inclusion/exclusion decisions (99.38%, 98.94%, and 98.85%, respectively). Recall rates were 100.00%, 99.13%, and 98.26%, with statistically significant performance differences (χ 2 = 5.99, P = 0.048), where GPT outperformed the other models. Exclusion reason concordance rates were 98.85%, 94.79%, and 96.47% (χ 2 = 30.22, P < 0.001). In full-text screening, all models maintained perfect recall (100.00%), with accuracies of 100.00% (GPT), 100.00% (Kimi), and 99.45% (DeepSeek). Processing times averaged 1 to 5 s per article for title-abstract screening and 60 s for full-text processing (including PDF preprocessing). Conclusions: LitAutoScreener offers a new approach for efficient literature screening in drug intervention studies, achieving high accuracy and significantly improving screening efficiency.","PeriodicalId":73207,"journal":{"name":"Health data science","volume":"5 ","pages":"0322"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404845/pdf/","citationCount":"0","resultStr":"{\"title\":\"LitAutoScreener: Development and Validation of an Automated Literature Screening Tool in Evidence-Based Medicine Driven by Large Language Models.\",\"authors\":\"Yiming Tao, Xuehu Li, Zuhar Yisha, Sihan Yang, Siyan Zhan, Feng Sun\",\"doi\":\"10.34133/hds.0322\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The traditional manual literature screening approach is limited by its time-consuming nature and high labor costs. A pressing issue is how to leverage large language models to enhance the efficiency and quality of evidence-based evaluations of drug efficacy and safety. Methods: This study utilized a manually curated reference literature database-comprising vaccine, hypoglycemic agent, and antidepressant evaluation studies-previously developed by our team through conventional systematic review methods. This validated database served as the gold standard for the development and optimization of LitAutoScreener. Following the PICOS (Population, Intervention, Comparison, Outcomes, Study Design) principles, a chain-of-thought reasoning approach with few-shot learning prompts was implemented to develop the screening algorithm. We subsequently evaluated the performance of LitAutoScreener using 2 independent validation cohorts, assessing both classification accuracy and processing efficiency. Results: For respiratory syncytial virus vaccine safety validation title-abstract screening, our tools based on GPT (GPT-4o), Kimi (moonshot-v1-128k), and DeepSeek (deepseek-chat 2.5) demonstrated high accuracy in inclusion/exclusion decisions (99.38%, 98.94%, and 98.85%, respectively). Recall rates were 100.00%, 99.13%, and 98.26%, with statistically significant performance differences (χ 2 = 5.99, P = 0.048), where GPT outperformed the other models. Exclusion reason concordance rates were 98.85%, 94.79%, and 96.47% (χ 2 = 30.22, P < 0.001). In full-text screening, all models maintained perfect recall (100.00%), with accuracies of 100.00% (GPT), 100.00% (Kimi), and 99.45% (DeepSeek). Processing times averaged 1 to 5 s per article for title-abstract screening and 60 s for full-text processing (including PDF preprocessing). Conclusions: LitAutoScreener offers a new approach for efficient literature screening in drug intervention studies, achieving high accuracy and significantly improving screening efficiency.\",\"PeriodicalId\":73207,\"journal\":{\"name\":\"Health data science\",\"volume\":\"5 \",\"pages\":\"0322\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404845/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.34133/hds.0322\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34133/hds.0322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：传统的手工文献筛选方法耗时长，人工成本高。一个紧迫的问题是如何利用大型语言模型来提高基于证据的药物疗效和安全性评估的效率和质量。方法：本研究利用了一个人工整理的参考文献数据库，包括疫苗、降糖药和抗抑郁药的评估研究，这些研究之前由我们的团队通过传统的系统综述方法开发。这个经过验证的数据库是LitAutoScreener开发和优化的金标准。遵循PICOS（人口、干预、比较、结果、研究设计）原则，采用思维链推理方法和少量学习提示来开发筛选算法。随后，我们使用2个独立的验证队列评估了LitAutoScreener的性能，评估了分类准确性和处理效率。结果：对于呼吸道合胞病毒疫苗安全性验证的筛选，我们基于GPT （GPT- 40）、Kimi （moonshot-v1-128k）和DeepSeek （DeepSeek -chat 2.5）的工具在纳入/排除决策方面表现出较高的准确性（分别为99.38%、98.94%和98.85%）。召回率分别为100.00%、99.13%和98.26%，性能差异有统计学意义（χ 2 = 5.99, P = 0.048）， GPT优于其他模型。排除原因一致性率分别为98.85%、94.79%和96.47% （χ 2 = 30.22, P < 0.001）。在全文筛选中，所有模型都保持了完美的召回率（100.00%），准确率分别为100.00% （GPT）、100.00% （Kimi）和99.45% （DeepSeek）。标题-摘要筛选的平均处理时间为每篇文章1到5秒，全文处理（包括PDF预处理）的平均处理时间为60秒。结论：LitAutoScreener为药物干预研究提供了一种高效的文献筛选新方法，具有较高的准确性，显著提高了筛选效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

LitAutoScreener: Development and Validation of an Automated Literature Screening Tool in Evidence-Based Medicine Driven by Large Language Models.

查看原文本刊更多论文

LitAutoScreener: Development and Validation of an Automated Literature Screening Tool in Evidence-Based Medicine Driven by Large Language Models.

Background: The traditional manual literature screening approach is limited by its time-consuming nature and high labor costs. A pressing issue is how to leverage large language models to enhance the efficiency and quality of evidence-based evaluations of drug efficacy and safety. Methods: This study utilized a manually curated reference literature database-comprising vaccine, hypoglycemic agent, and antidepressant evaluation studies-previously developed by our team through conventional systematic review methods. This validated database served as the gold standard for the development and optimization of LitAutoScreener. Following the PICOS (Population, Intervention, Comparison, Outcomes, Study Design) principles, a chain-of-thought reasoning approach with few-shot learning prompts was implemented to develop the screening algorithm. We subsequently evaluated the performance of LitAutoScreener using 2 independent validation cohorts, assessing both classification accuracy and processing efficiency. Results: For respiratory syncytial virus vaccine safety validation title-abstract screening, our tools based on GPT (GPT-4o), Kimi (moonshot-v1-128k), and DeepSeek (deepseek-chat 2.5) demonstrated high accuracy in inclusion/exclusion decisions (99.38%, 98.94%, and 98.85%, respectively). Recall rates were 100.00%, 99.13%, and 98.26%, with statistically significant performance differences (χ ² = 5.99, P = 0.048), where GPT outperformed the other models. Exclusion reason concordance rates were 98.85%, 94.79%, and 96.47% (χ ² = 30.22, P < 0.001). In full-text screening, all models maintained perfect recall (100.00%), with accuracies of 100.00% (GPT), 100.00% (Kimi), and 99.45% (DeepSeek). Processing times averaged 1 to 5 s per article for title-abstract screening and 60 s for full-text processing (including PDF preprocessing). Conclusions: LitAutoScreener offers a new approach for efficient literature screening in drug intervention studies, achieving high accuracy and significantly improving screening efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Health data science

CiteScore

3.70

自引率

0.00%

发文量