Evaluating the potential risks of employing large language models in peer review

IF 1.9

Clinical and translational discovery Pub Date : 2025-06-27 DOI:10.1002/ctd2.70067

Lingxuan Zhu, Yancheng Lai, Jiarui Xie, Weiming Mou, Lihaoyun Huang, Chang Qi, Tao Yang, Aimin Jiang, Wenyi Gan, Dongqiang Zeng, Bufu Tang, Mingjia Xiao, Guangdi Chu, Zaoqu Liu, Quan Cheng, Anqi Lin, Peng Luo

{"title":"Evaluating the potential risks of employing large language models in peer review","authors":"Lingxuan Zhu, Yancheng Lai, Jiarui Xie, Weiming Mou, Lihaoyun Huang, Chang Qi, Tao Yang, Aimin Jiang, Wenyi Gan, Dongqiang Zeng, Bufu Tang, Mingjia Xiao, Guangdi Chu, Zaoqu Liu, Quan Cheng, Anqi Lin, Peng Luo","doi":"10.1002/ctd2.70067","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Objective</h3>\n \n <p>This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.</p>\n </section>\n \n <section>\n \n <h3> Background</h3>\n \n <p>LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from <i>eLife</i>'s new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.</p>\n </section>\n </div>","PeriodicalId":72605,"journal":{"name":"Clinical and translational discovery","volume":"5 4","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ctd2.70067","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and translational discovery","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ctd2.70067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.

Background

LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.

Methods

Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from eLife's new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.

Results

LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.

Conclusions

LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.

Abstract Image

查看原文本刊更多论文

评估在同行评审中使用大型语言模型的潜在风险

目的系统探讨大型语言模型（Large Language Models, LLMs）在同行评审过程中的潜在危害。法学硕士越来越多地用于学术过程，包括同行评审。虽然他们可以解决诸如审稿人稀缺和审稿效率等挑战，但法学硕士生成的审稿中对公平性、透明度和潜在偏见的担忧尚未得到彻底调查。方法采用Claude 2.0软件对来自eLife新出版模式的20篇未经修改的癌症生物学原稿进行同行评议报告、退稿建议、引文请求和反驳。人工智能（AI）检测工具（zeroGPT和GPTzero）评估评论是否可识别为llm生成。所有llm生成的输出由两位专家在五点李克特量表上评估合理性。结果llm生成的评论在一定程度上与人类评论一致，但缺乏深度，特别是在详细的评论中。该模型被证明在生成令人信服的拒绝评论方面非常精通，并且可以创建合理的引用请求，包括对不相关参考文献的请求。人工智能探测器很难识别法学硕士生成的评论，其中82.8%的回复被归为GPTzero编写的人工评论。法学硕士很容易被滥用，通过产生有偏见、操纵和难以检测的内容来破坏同行评议过程，对学术诚信构成重大威胁。需要指导方针和检测工具来确保法学硕士促进而不是损害同行评审过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical and translational discovery

CiteScore

1.00

自引率

0.00%

发文量