Burak Berksu Ozkara,Alexandre Boutet,Bryan A Comstock,Johan Van Goethem,Thierry A G M Huisman,Jeffrey S Ross,Luca Saba,Lubdha M Shah,Max Wintermark,Mauricio Castillo
{"title":"Artificial Intelligence-Generated Editorials in Radiology: Can Expert Editors Detect Them?","authors":"Burak Berksu Ozkara,Alexandre Boutet,Bryan A Comstock,Johan Van Goethem,Thierry A G M Huisman,Jeffrey S Ross,Luca Saba,Lubdha M Shah,Max Wintermark,Mauricio Castillo","doi":"10.3174/ajnr.a8505","DOIUrl":null,"url":null,"abstract":"BACKGROUND AND PURPOSE\r\nWe aimed to evaluate GPT-4's ability to write radiology editorials and to compare these with human-written counterparts, thereby determining their real-world applicability for scientific writing.\r\n\r\nMATERIALS AND METHODS\r\nSixteen editorials from eight journals were included. To generate the AI-written editorials, the summary of 16 human-written editorials was fed into GPT-4. Six experienced editors reviewed the articles. First, an unpaired approach was used. The raters were asked to evaluate the content of each article using a 1-5 Likert scale across specified metrics. Then, they determined whether the editorials were written by humans or AI. The articles were then evaluated in pairs to determine which article was generated by AI and which should be published. Finally, the articles were analyzed with an AI detector and for plagiarism.\r\n\r\nRESULTS\r\nThe human-written articles had a median AI probability score of 2.0%, whereas the AI-written articles had 58%. The median similarity score among AI-written articles was 3%. 58% of unpaired articles were correctly classified regarding authorship. Rating accuracy was increased to 70% in the paired setting. AI-written articles received slightly higher scores in most metrics. When stratified by perception, human-written perceived articles were rated higher in most categories. In the paired setting, raters strongly preferred publishing the article they perceived as human-written (82%).\r\n\r\nCONCLUSIONS\r\nGPT-4 can write high-quality articles that iThenticate does not flag as plagiarized, which may go undetected by editors, and that detection tools can detect to a limited extent. Editors showed a positive bias toward human-written articles.\r\n\r\nABBREVIATIONS\r\nAI = Artificial intelligence; LLM = large language model; SD = standard deviation.","PeriodicalId":7875,"journal":{"name":"American Journal of Neuroradiology","volume":"65 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Neuroradiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3174/ajnr.a8505","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND AND PURPOSE
We aimed to evaluate GPT-4's ability to write radiology editorials and to compare these with human-written counterparts, thereby determining their real-world applicability for scientific writing.
MATERIALS AND METHODS
Sixteen editorials from eight journals were included. To generate the AI-written editorials, the summary of 16 human-written editorials was fed into GPT-4. Six experienced editors reviewed the articles. First, an unpaired approach was used. The raters were asked to evaluate the content of each article using a 1-5 Likert scale across specified metrics. Then, they determined whether the editorials were written by humans or AI. The articles were then evaluated in pairs to determine which article was generated by AI and which should be published. Finally, the articles were analyzed with an AI detector and for plagiarism.
RESULTS
The human-written articles had a median AI probability score of 2.0%, whereas the AI-written articles had 58%. The median similarity score among AI-written articles was 3%. 58% of unpaired articles were correctly classified regarding authorship. Rating accuracy was increased to 70% in the paired setting. AI-written articles received slightly higher scores in most metrics. When stratified by perception, human-written perceived articles were rated higher in most categories. In the paired setting, raters strongly preferred publishing the article they perceived as human-written (82%).
CONCLUSIONS
GPT-4 can write high-quality articles that iThenticate does not flag as plagiarized, which may go undetected by editors, and that detection tools can detect to a limited extent. Editors showed a positive bias toward human-written articles.
ABBREVIATIONS
AI = Artificial intelligence; LLM = large language model; SD = standard deviation.
期刊介绍:
The mission of AJNR is to further knowledge in all aspects of neuroimaging, head and neck imaging, and spine imaging for neuroradiologists, radiologists, trainees, scientists, and associated professionals through print and/or electronic publication of quality peer-reviewed articles that lead to the highest standards in patient care, research, and education and to promote discussion of these and other issues through its electronic activities.