ChatGPT和Claude在产科系统评价的研究选择中表现如何?

IF 5.7 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Suppachai Insuk, Kansak Boonpattharatthiti, Chimbun Booncharoen, Panitnan Chaipitak, Muhammed Rashid, Sajesh K Veettil, Nai Ming Lai, Nathorn Chaiyakunapruk, Teerapon Dhippayom
{"title":"ChatGPT和Claude在产科系统评价的研究选择中表现如何?","authors":"Suppachai Insuk, Kansak Boonpattharatthiti, Chimbun Booncharoen, Panitnan Chaipitak, Muhammed Rashid, Sajesh K Veettil, Nai Ming Lai, Nathorn Chaiyakunapruk, Teerapon Dhippayom","doi":"10.1007/s10916-025-02246-4","DOIUrl":null,"url":null,"abstract":"<p><p>The use of generative AI in systematic review workflows has gained attention for enhancing study selection efficiency. However, evidence on its screening performance remains inconclusive, and direct comparisons between different generative AI models are still limited. The objective of this study is to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet in the study selection process of a systematic review in obstetrics. A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024. Titles and abstracts were screened using a structured prompt-based approach, comparing decisions by ChatGPT, Claude and junior researchers with decisions by an experienced researcher serving as the reference standard. For the full-text review, short and long prompt strategies were applied. We reported title/abstract screening and full-text review performances using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value. In the title/abstract screening phase, human researchers demonstrated the highest accuracy (0.9593), followed by Claude (0.9448) and ChatGPT (0.9138). The F1-score was the highest among human researchers (0.3853), followed by Claude (0.3724) and ChatGPT (0.2755). Negative predictive value (NPV) was high across all screeners: ChatGPT (0.9959), Claude (0.9961), and human researchers (0.9924). In the full-text screening phase, ChatGPT with a short prompt achieved the highest accuracy (0.904), highest F1-score (0.90), and NPV of 1.00, surpassing the performance of Claude and human researchers. Generative AI models perform close to human levels in study selection, as evidenced in obstetrics. Further research should explore their integration into evidence synthesis across different fields.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"110"},"PeriodicalIF":5.7000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.\",\"authors\":\"Suppachai Insuk, Kansak Boonpattharatthiti, Chimbun Booncharoen, Panitnan Chaipitak, Muhammed Rashid, Sajesh K Veettil, Nai Ming Lai, Nathorn Chaiyakunapruk, Teerapon Dhippayom\",\"doi\":\"10.1007/s10916-025-02246-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The use of generative AI in systematic review workflows has gained attention for enhancing study selection efficiency. However, evidence on its screening performance remains inconclusive, and direct comparisons between different generative AI models are still limited. The objective of this study is to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet in the study selection process of a systematic review in obstetrics. A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024. Titles and abstracts were screened using a structured prompt-based approach, comparing decisions by ChatGPT, Claude and junior researchers with decisions by an experienced researcher serving as the reference standard. For the full-text review, short and long prompt strategies were applied. We reported title/abstract screening and full-text review performances using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value. In the title/abstract screening phase, human researchers demonstrated the highest accuracy (0.9593), followed by Claude (0.9448) and ChatGPT (0.9138). The F1-score was the highest among human researchers (0.3853), followed by Claude (0.3724) and ChatGPT (0.2755). Negative predictive value (NPV) was high across all screeners: ChatGPT (0.9959), Claude (0.9961), and human researchers (0.9924). In the full-text screening phase, ChatGPT with a short prompt achieved the highest accuracy (0.904), highest F1-score (0.90), and NPV of 1.00, surpassing the performance of Claude and human researchers. Generative AI models perform close to human levels in study selection, as evidenced in obstetrics. Further research should explore their integration into evidence synthesis across different fields.</p>\",\"PeriodicalId\":16338,\"journal\":{\"name\":\"Journal of Medical Systems\",\"volume\":\"49 1\",\"pages\":\"110\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Systems\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s10916-025-02246-4\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02246-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

在系统综述工作流程中使用生成式人工智能已引起人们的关注,以提高研究选择效率。然而,关于其筛选性能的证据仍然不确定,不同生成式人工智能模型之间的直接比较仍然有限。本研究的目的是评估chatgpt - 40和Claude 3.5 Sonnet在产科系统评价的研究选择过程中的表现。检索PubMed、EMBASE、Cochrane CENTRAL和EBSCO Open dissertation,检索时间自成立至2024年2月。使用结构化的基于提示的方法筛选标题和摘要,将ChatGPT, Claude和初级研究人员的决定与经验丰富的研究人员作为参考标准的决定进行比较。全文审阅采用了短提示策略和长提示策略。我们报告了标题/摘要筛选和全文审阅的准确性、灵敏度(召回率)、精度、f1评分和阴性预测值。在标题/摘要筛选阶段,人类研究者的准确率最高(0.9593),其次是Claude(0.9448)和ChatGPT(0.9138)。人类研究人员的f1得分最高(0.3853),其次是Claude(0.3724)和ChatGPT(0.2755)。阴性预测值(NPV)在所有筛选者中都很高:ChatGPT (0.9959), Claude(0.9961)和人类研究人员(0.9924)。在全文筛选阶段,短提示ChatGPT的准确率最高(0.904),f1得分最高(0.90),NPV为1.00,超过了Claude和人类研究人员的表现。生成式人工智能模型在研究选择方面的表现接近人类水平,这在产科中得到了证明。进一步的研究应该探索将它们整合到不同领域的证据综合中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.

The use of generative AI in systematic review workflows has gained attention for enhancing study selection efficiency. However, evidence on its screening performance remains inconclusive, and direct comparisons between different generative AI models are still limited. The objective of this study is to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet in the study selection process of a systematic review in obstetrics. A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024. Titles and abstracts were screened using a structured prompt-based approach, comparing decisions by ChatGPT, Claude and junior researchers with decisions by an experienced researcher serving as the reference standard. For the full-text review, short and long prompt strategies were applied. We reported title/abstract screening and full-text review performances using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value. In the title/abstract screening phase, human researchers demonstrated the highest accuracy (0.9593), followed by Claude (0.9448) and ChatGPT (0.9138). The F1-score was the highest among human researchers (0.3853), followed by Claude (0.3724) and ChatGPT (0.2755). Negative predictive value (NPV) was high across all screeners: ChatGPT (0.9959), Claude (0.9961), and human researchers (0.9924). In the full-text screening phase, ChatGPT with a short prompt achieved the highest accuracy (0.904), highest F1-score (0.90), and NPV of 1.00, surpassing the performance of Claude and human researchers. Generative AI models perform close to human levels in study selection, as evidenced in obstetrics. Further research should explore their integration into evidence synthesis across different fields.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Medical Systems
Journal of Medical Systems 医学-卫生保健
CiteScore
11.60
自引率
1.90%
发文量
83
审稿时长
4.8 months
期刊介绍: Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信