How Artificial Intelligence Differs From Humans in Peer Review.

IF 2.3 3区 医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE
Michael V Joachim, Thomas B Dodson, Amir Laviv
{"title":"How Artificial Intelligence Differs From Humans in Peer Review.","authors":"Michael V Joachim, Thomas B Dodson, Amir Laviv","doi":"10.1016/j.joms.2025.03.015","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The peer review process faces challenges of reviewer fatigue and bias. Artificial intelligence (AI) may help address these issues, but its application in the oral and maxillofacial surgery peer review process remains unexplored.</p><p><strong>Purpose: </strong>The purpose of the study was to measure and compare manuscript review performance among 4 large language models and human reviewers. large language models are AI systems trained on vast text datasets that can generate human-like responses.</p><p><strong>Study design/setting/sample: </strong>In this cross-sectional study, we evaluated original research articles submitted to the Journal of Oral and Maxillofacial Surgery between January and December 2023. Manuscripts were randomly selected from all submissions that received at least one external peer review.</p><p><strong>Predictor variable: </strong>The predictor variable was source of review: human reviewers or AI models. We tested 4 AI models: Generative Pretrained Transformer-4o and Generative Pretrained Transformer-o1 (OpenAI, San Francisco, CA), Claude (version 3.5; Anthropic, San Francisco, CA), and Gemini (version 1.5; Google, Mountain View, CA). These models will be referred to by their architectural design characteristics, ie, dense transformers, sparse-expert, multimodal, and base transformer, to highlight their technical differences rather than their commercial identities.</p><p><strong>Outcome variables: </strong>Primary outcomes included reviewer recommendations (accept = 3 to reject = 0) and responses to 6 Journal of Oral and Maxillofacial Surgery editor questions. Secondary outcomes comprised temporal stability (consistency of AI evaluations over time) analysis, domain-specific assessments (methodology, statistical analysis, clinical relevance, originality, and presentation clarity; 1 to 5 scale), and model clustering patterns.</p><p><strong>Analyses: </strong>Agreement between AI and human recommendations was assessed using weighted Cohen's kappa. Intermodel reliability and temporal stability (24-hour interval) were evaluated using intraclass correlation coefficients. Domain scoring patterns were analyzed using multivariate analysis of variance with post hoc comparisons and hierarchical clustering.</p><p><strong>Results: </strong>From 22 manuscripts, human reviewers rejected 15 (68.2%), while AI rejection rates were statistically significantly lower (0 to 9.1%, P < .001). AI models demonstrated high consistency in their evaluations over time (intraclass correlation coefficient = 0.88, P < .001) and showed moderate agreement with human decisions (κ = 0.38 to 0.46).</p><p><strong>Conclusions: </strong>While AI models showed reliable internal consistency, they were less likely to recommend rejection than human reviewers. This suggests their optimal use is as screening tools complementing expert human review rather than as replacements.</p>","PeriodicalId":16612,"journal":{"name":"Journal of Oral and Maxillofacial Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.joms.2025.03.015","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The peer review process faces challenges of reviewer fatigue and bias. Artificial intelligence (AI) may help address these issues, but its application in the oral and maxillofacial surgery peer review process remains unexplored.

Purpose: The purpose of the study was to measure and compare manuscript review performance among 4 large language models and human reviewers. large language models are AI systems trained on vast text datasets that can generate human-like responses.

Study design/setting/sample: In this cross-sectional study, we evaluated original research articles submitted to the Journal of Oral and Maxillofacial Surgery between January and December 2023. Manuscripts were randomly selected from all submissions that received at least one external peer review.

Predictor variable: The predictor variable was source of review: human reviewers or AI models. We tested 4 AI models: Generative Pretrained Transformer-4o and Generative Pretrained Transformer-o1 (OpenAI, San Francisco, CA), Claude (version 3.5; Anthropic, San Francisco, CA), and Gemini (version 1.5; Google, Mountain View, CA). These models will be referred to by their architectural design characteristics, ie, dense transformers, sparse-expert, multimodal, and base transformer, to highlight their technical differences rather than their commercial identities.

Outcome variables: Primary outcomes included reviewer recommendations (accept = 3 to reject = 0) and responses to 6 Journal of Oral and Maxillofacial Surgery editor questions. Secondary outcomes comprised temporal stability (consistency of AI evaluations over time) analysis, domain-specific assessments (methodology, statistical analysis, clinical relevance, originality, and presentation clarity; 1 to 5 scale), and model clustering patterns.

Analyses: Agreement between AI and human recommendations was assessed using weighted Cohen's kappa. Intermodel reliability and temporal stability (24-hour interval) were evaluated using intraclass correlation coefficients. Domain scoring patterns were analyzed using multivariate analysis of variance with post hoc comparisons and hierarchical clustering.

Results: From 22 manuscripts, human reviewers rejected 15 (68.2%), while AI rejection rates were statistically significantly lower (0 to 9.1%, P < .001). AI models demonstrated high consistency in their evaluations over time (intraclass correlation coefficient = 0.88, P < .001) and showed moderate agreement with human decisions (κ = 0.38 to 0.46).

Conclusions: While AI models showed reliable internal consistency, they were less likely to recommend rejection than human reviewers. This suggests their optimal use is as screening tools complementing expert human review rather than as replacements.

在同行评议中人工智能与人类的区别。
背景:同行评议过程面临着审稿人疲劳和偏见的挑战。人工智能(AI)可能有助于解决这些问题,但其在口腔颌面外科同行评审过程中的应用仍未得到探索。目的:本研究的目的是衡量和比较4种大型语言模型和人工审稿人的稿件评审绩效。大型语言模型是在大量文本数据集上训练的人工智能系统,可以产生类似人类的反应。研究设计/设置/样本:在这项横断面研究中,我们评估了2023年1月至12月间提交给Journal of Oral and Maxillofacial Surgery的原创研究文章。从所有至少接受过一次外部同行评审的投稿中随机选择稿件。预测变量:预测变量是评论的来源:人类评论者或AI模型。我们测试了4个AI模型:生成预训练transformer - 40和生成预训练transformer - 01 (OpenAI, San Francisco, CA), Claude(版本3.5;Anthropic, San Francisco, CA)和Gemini(版本1.5;b谷歌,山景城,加州)。这些模型将根据其建筑设计特征(即密集变压器、稀疏专家、多模态和基本变压器)来参考,以突出其技术差异,而不是其商业特征。结果变量:主要结果包括审稿人建议(接受= 3至拒绝= 0)和对《口腔颌面外科杂志》编辑提出的6个问题的回答。次要结果包括时间稳定性(人工智能评估随时间的一致性)分析、特定领域评估(方法学、统计分析、临床相关性、原创性和表述清晰度);1到5的比例),并对聚类模式进行建模。分析:使用加权科恩kappa评估人工智能和人类建议之间的一致性。模型间可靠性和时间稳定性(24小时间隔)采用类内相关系数进行评估。使用多变量方差分析、事后比较和分层聚类分析领域评分模式。结果:在22篇论文中,人工审稿人拒稿15篇(68.2%),而人工智能的拒稿率显著低于人工智能(0 ~ 9.1%)。结论:虽然人工智能模型具有可靠的内部一致性,但它们推荐拒稿的可能性低于人工审稿人。这表明它们的最佳用途是作为筛选工具补充专家的人工审查,而不是作为替代品。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Oral and Maxillofacial Surgery
Journal of Oral and Maxillofacial Surgery 医学-牙科与口腔外科
CiteScore
4.00
自引率
5.30%
发文量
0
审稿时长
41 days
期刊介绍: This monthly journal offers comprehensive coverage of new techniques, important developments and innovative ideas in oral and maxillofacial surgery. Practice-applicable articles help develop the methods used to handle dentoalveolar surgery, facial injuries and deformities, TMJ disorders, oral cancer, jaw reconstruction, anesthesia and analgesia. The journal also includes specifics on new instruments and diagnostic equipment and modern therapeutic drugs and devices. Journal of Oral and Maxillofacial Surgery is recommended for first or priority subscription by the Dental Section of the Medical Library Association.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信