Scott A. Helgeson MD , Patrick W. Johnson BS , Nilaa Gopikrishnan MD , Tapendra Koirala MD , Pablo Moreno-Franco MD , Rickey E. Carter PhD , Zachary S. Quicksall MS , Charles D. Burger MD
{"title":"人类审稿人区分人类撰写或人工智能生成的医学手稿的能力:一项随机调查研究。","authors":"Scott A. Helgeson MD , Patrick W. Johnson BS , Nilaa Gopikrishnan MD , Tapendra Koirala MD , Pablo Moreno-Franco MD , Rickey E. Carter PhD , Zachary S. Quicksall MS , Charles D. Burger MD","doi":"10.1016/j.mayocp.2024.08.029","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To assess the ability of humans to differentiate human-authored vs artificial intelligence (AI)–generated medical manuscripts.</div></div><div><h3>Methods</h3><div>This is a prospective randomized survey study from October 1, 2023, to December 1, 2023, from a single academic center. Artificial intelligence–generated medical manuscripts were created using ChatGPT 3.5 and were evaluated alongside randomly selected human-authored manuscripts. Participants, who were blinded from manuscript selection and creation, were randomized to receive three manuscripts that were either human-authored or AI-generated and had to fill out a survey questionnaire after review regarding who authored the manuscript. The primary outcome was accuracy of human reviewers in differentiating manuscript authors. Secondary outcomes were to identify factors that influenced prediction accuracy.</div></div><div><h3>Results</h3><div>Fifty-one physicians were included in the study, including 12 post-doctorates, 19 assistant professors, and 20 associate or full professors. The overall specificity of 55.6% (95% CI, 30.8% to 78.5%), sensitivity of 31.2% (95% CI,11.0% to 58.7%), positive predictive value of 38.5% (95% CI,13.9% to 68.4%) and negative predictive value of 47.6% (95% CI, 25.7% to 70.2%). A stratified analysis of human-authored manuscripts indicated that high-impact factor manuscripts were identified with higher accuracy than low-impact factor ones (<em>P</em>=.037). For individual-level data, neither academic rank nor prior manuscript review experience significantly predicted the accuracy. The frequency of AI interaction was a significant factor, with occasional (odds ratio [OR], 8.20; <em>P</em>=.016), fairly frequent (OR, 7.13; <em>P</em>=.033), and very frequent (OR, 8.36; <em>P</em>=.030) use associated with correct identification. Further analysis revealed no significant predictors among the papers' qualities.</div></div><div><h3>Conclusion</h3><div>Generative AI such as ChatGPT could create medical manuscripts that could not be differentiated from human-authored manuscripts.</div></div>","PeriodicalId":18334,"journal":{"name":"Mayo Clinic proceedings","volume":"100 4","pages":"Pages 622-633"},"PeriodicalIF":6.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human Reviewers' Ability to Differentiate Human-Authored or Artificial Intelligence–Generated Medical Manuscripts\",\"authors\":\"Scott A. Helgeson MD , Patrick W. Johnson BS , Nilaa Gopikrishnan MD , Tapendra Koirala MD , Pablo Moreno-Franco MD , Rickey E. Carter PhD , Zachary S. Quicksall MS , Charles D. Burger MD\",\"doi\":\"10.1016/j.mayocp.2024.08.029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><div>To assess the ability of humans to differentiate human-authored vs artificial intelligence (AI)–generated medical manuscripts.</div></div><div><h3>Methods</h3><div>This is a prospective randomized survey study from October 1, 2023, to December 1, 2023, from a single academic center. Artificial intelligence–generated medical manuscripts were created using ChatGPT 3.5 and were evaluated alongside randomly selected human-authored manuscripts. Participants, who were blinded from manuscript selection and creation, were randomized to receive three manuscripts that were either human-authored or AI-generated and had to fill out a survey questionnaire after review regarding who authored the manuscript. The primary outcome was accuracy of human reviewers in differentiating manuscript authors. Secondary outcomes were to identify factors that influenced prediction accuracy.</div></div><div><h3>Results</h3><div>Fifty-one physicians were included in the study, including 12 post-doctorates, 19 assistant professors, and 20 associate or full professors. The overall specificity of 55.6% (95% CI, 30.8% to 78.5%), sensitivity of 31.2% (95% CI,11.0% to 58.7%), positive predictive value of 38.5% (95% CI,13.9% to 68.4%) and negative predictive value of 47.6% (95% CI, 25.7% to 70.2%). A stratified analysis of human-authored manuscripts indicated that high-impact factor manuscripts were identified with higher accuracy than low-impact factor ones (<em>P</em>=.037). For individual-level data, neither academic rank nor prior manuscript review experience significantly predicted the accuracy. The frequency of AI interaction was a significant factor, with occasional (odds ratio [OR], 8.20; <em>P</em>=.016), fairly frequent (OR, 7.13; <em>P</em>=.033), and very frequent (OR, 8.36; <em>P</em>=.030) use associated with correct identification. Further analysis revealed no significant predictors among the papers' qualities.</div></div><div><h3>Conclusion</h3><div>Generative AI such as ChatGPT could create medical manuscripts that could not be differentiated from human-authored manuscripts.</div></div>\",\"PeriodicalId\":18334,\"journal\":{\"name\":\"Mayo Clinic proceedings\",\"volume\":\"100 4\",\"pages\":\"Pages 622-633\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mayo Clinic proceedings\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0025619624004890\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mayo Clinic proceedings","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0025619624004890","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
Human Reviewers' Ability to Differentiate Human-Authored or Artificial Intelligence–Generated Medical Manuscripts
Objective
To assess the ability of humans to differentiate human-authored vs artificial intelligence (AI)–generated medical manuscripts.
Methods
This is a prospective randomized survey study from October 1, 2023, to December 1, 2023, from a single academic center. Artificial intelligence–generated medical manuscripts were created using ChatGPT 3.5 and were evaluated alongside randomly selected human-authored manuscripts. Participants, who were blinded from manuscript selection and creation, were randomized to receive three manuscripts that were either human-authored or AI-generated and had to fill out a survey questionnaire after review regarding who authored the manuscript. The primary outcome was accuracy of human reviewers in differentiating manuscript authors. Secondary outcomes were to identify factors that influenced prediction accuracy.
Results
Fifty-one physicians were included in the study, including 12 post-doctorates, 19 assistant professors, and 20 associate or full professors. The overall specificity of 55.6% (95% CI, 30.8% to 78.5%), sensitivity of 31.2% (95% CI,11.0% to 58.7%), positive predictive value of 38.5% (95% CI,13.9% to 68.4%) and negative predictive value of 47.6% (95% CI, 25.7% to 70.2%). A stratified analysis of human-authored manuscripts indicated that high-impact factor manuscripts were identified with higher accuracy than low-impact factor ones (P=.037). For individual-level data, neither academic rank nor prior manuscript review experience significantly predicted the accuracy. The frequency of AI interaction was a significant factor, with occasional (odds ratio [OR], 8.20; P=.016), fairly frequent (OR, 7.13; P=.033), and very frequent (OR, 8.36; P=.030) use associated with correct identification. Further analysis revealed no significant predictors among the papers' qualities.
Conclusion
Generative AI such as ChatGPT could create medical manuscripts that could not be differentiated from human-authored manuscripts.
期刊介绍:
Mayo Clinic Proceedings is a premier peer-reviewed clinical journal in general medicine. Sponsored by Mayo Clinic, it is one of the most widely read and highly cited scientific publications for physicians. Since 1926, Mayo Clinic Proceedings has continuously published articles that focus on clinical medicine and support the professional and educational needs of its readers. The journal welcomes submissions from authors worldwide and includes Nobel-prize-winning research in its content. With an Impact Factor of 8.9, Mayo Clinic Proceedings is ranked #20 out of 167 journals in the Medicine, General and Internal category, placing it in the top 12% of these journals. It invites manuscripts on clinical and laboratory medicine, health care policy and economics, medical education and ethics, and related topics.