{"title":"Evaluating the potential risks of employing large language models in peer review","authors":"Lingxuan Zhu, Yancheng Lai, Jiarui Xie, Weiming Mou, Lihaoyun Huang, Chang Qi, Tao Yang, Aimin Jiang, Wenyi Gan, Dongqiang Zeng, Bufu Tang, Mingjia Xiao, Guangdi Chu, Zaoqu Liu, Quan Cheng, Anqi Lin, Peng Luo","doi":"10.1002/ctd2.70067","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Objective</h3>\n \n <p>This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.</p>\n </section>\n \n <section>\n \n <h3> Background</h3>\n \n <p>LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from <i>eLife</i>'s new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.</p>\n </section>\n </div>","PeriodicalId":72605,"journal":{"name":"Clinical and translational discovery","volume":"5 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ctd2.70067","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and translational discovery","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ctd2.70067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.
Background
LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.
Methods
Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from eLife's new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.
Results
LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.
Conclusions
LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.
目的系统探讨大型语言模型(Large Language Models, LLMs)在同行评审过程中的潜在危害。法学硕士越来越多地用于学术过程,包括同行评审。虽然他们可以解决诸如审稿人稀缺和审稿效率等挑战,但法学硕士生成的审稿中对公平性、透明度和潜在偏见的担忧尚未得到彻底调查。方法采用Claude 2.0软件对来自eLife新出版模式的20篇未经修改的癌症生物学原稿进行同行评议报告、退稿建议、引文请求和反驳。人工智能(AI)检测工具(zeroGPT和GPTzero)评估评论是否可识别为llm生成。所有llm生成的输出由两位专家在五点李克特量表上评估合理性。结果llm生成的评论在一定程度上与人类评论一致,但缺乏深度,特别是在详细的评论中。该模型被证明在生成令人信服的拒绝评论方面非常精通,并且可以创建合理的引用请求,包括对不相关参考文献的请求。人工智能探测器很难识别法学硕士生成的评论,其中82.8%的回复被归为GPTzero编写的人工评论。法学硕士很容易被滥用,通过产生有偏见、操纵和难以检测的内容来破坏同行评议过程,对学术诚信构成重大威胁。需要指导方针和检测工具来确保法学硕士促进而不是损害同行评审过程。