Scientific Writing in the Era of Large Language Models: A Computational Analysis of AI- Versus Human-Created Content.

IF 8.9 1区医学 Q1 CLINICAL NEUROLOGY

Stroke Pub Date : 2025-10-01 Epub Date: 2025-08-15 DOI:10.1161/STROKEAHA.125.051913

Rohan Khera, Aline F Pedroso, Vipina K Keloth, Hua Xu, Gisele S Silva, Lee H Schwamm

{"title":"Scientific Writing in the Era of Large Language Models: A Computational Analysis of AI- Versus Human-Created Content.","authors":"Rohan Khera, Aline F Pedroso, Vipina K Keloth, Hua Xu, Gisele S Silva, Lee H Schwamm","doi":"10.1161/STROKEAHA.125.051913","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) are artificial intelligence (AI) tools that can generate human expert-like content and be used to accelerate the synthesis of scientific literature, but they can spread misinformation by producing misleading content. This study sought to characterize distinguishing linguistic features in differentiating AI-generated from human-authored scientific text and evaluate the performance of AI detection tools for this task.Methods: We conducted a computational synthesis of 34 essays on cerebrovascular topics (12 generated by large language models [Generative Pre-trained Transformer 4, Generative Pre-trained Transformer 3.5, Llama-2, and Bard] and 22 by human scientists). Each essay was rated as AI-generated or human-authored by up to 38 members of the Stroke editorial board. We compared the collective performance of experts versus GPTZero, a widely used online AI detection tool. We extracted and compared linguistic features spanning syntax (word count, complexity, and so on), semantics (polarity), readability (Flesch scores), grade level (Flesch-Kincaid), and language perplexity (or predictability) to characterize linguistic differences between AI-generated versus human-written content.Results: Over 50% of the stroke experts who reviewed the study essays correctly identified 10 (83.3%) of AI-generated essays as AI, whereas they misclassified 7 (31.8%) of human-written essays as AI. GPTZero accurately classified 12 (100%) of AI-generated and 21 (95.5%) of human-written essays. However, the tool relied on only a few key sentences for classification. Compared with human essays, AI-generated content had lower word count and complexity, exhibited significantly lower perplexity (median, 15.0 versus 7.2; P<0.001), lower readability scores (Flesch median, 42.1 versus 26.4; P<0.001), and higher grade level (Flesch-Kincaid median, 13.1 versus 14.8; P=0.006).Conclusions: Large language models generate scientific content with measurable differences versus human-written text but represent features that are not consistently identifiable even by human experts and require complex AI detection tools. Given the challenges that experts face in distinguishing AI from human content, technology-assisted tools are essential wherever human provenance is essential to safeguard the integrity of scientific communication.","PeriodicalId":21989,"journal":{"name":"Stroke","volume":" ","pages":"3078-3083"},"PeriodicalIF":8.9000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stroke","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1161/STROKEAHA.125.051913","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/15 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) are artificial intelligence (AI) tools that can generate human expert-like content and be used to accelerate the synthesis of scientific literature, but they can spread misinformation by producing misleading content. This study sought to characterize distinguishing linguistic features in differentiating AI-generated from human-authored scientific text and evaluate the performance of AI detection tools for this task.

Methods: We conducted a computational synthesis of 34 essays on cerebrovascular topics (12 generated by large language models [Generative Pre-trained Transformer 4, Generative Pre-trained Transformer 3.5, Llama-2, and Bard] and 22 by human scientists). Each essay was rated as AI-generated or human-authored by up to 38 members of the Stroke editorial board. We compared the collective performance of experts versus GPTZero, a widely used online AI detection tool. We extracted and compared linguistic features spanning syntax (word count, complexity, and so on), semantics (polarity), readability (Flesch scores), grade level (Flesch-Kincaid), and language perplexity (or predictability) to characterize linguistic differences between AI-generated versus human-written content.

Results: Over 50% of the stroke experts who reviewed the study essays correctly identified 10 (83.3%) of AI-generated essays as AI, whereas they misclassified 7 (31.8%) of human-written essays as AI. GPTZero accurately classified 12 (100%) of AI-generated and 21 (95.5%) of human-written essays. However, the tool relied on only a few key sentences for classification. Compared with human essays, AI-generated content had lower word count and complexity, exhibited significantly lower perplexity (median, 15.0 versus 7.2; P<0.001), lower readability scores (Flesch median, 42.1 versus 26.4; P<0.001), and higher grade level (Flesch-Kincaid median, 13.1 versus 14.8; P=0.006).

Conclusions: Large language models generate scientific content with measurable differences versus human-written text but represent features that are not consistently identifiable even by human experts and require complex AI detection tools. Given the challenges that experts face in distinguishing AI from human content, technology-assisted tools are essential wherever human provenance is essential to safeguard the integrity of scientific communication.

查看原文本刊更多论文

大语言模型时代的科学写作：人工智能与人类创作内容的计算分析。

背景：大型语言模型（llm）是人工智能（AI）工具，可以生成类似人类专家的内容，并用于加速科学文献的合成，但它们可以通过产生误导性内容来传播错误信息。本研究试图描述区分人工智能生成与人类撰写的科学文本的显著语言特征，并评估人工智能检测工具在这项任务中的性能。方法：我们对34篇关于脑血管主题的论文进行了计算合成（12篇由大型语言模型[生成式预训练变压器4，生成式预训练变压器3.5，Llama-2和Bard]生成，22篇由人类科学家生成）。每篇文章都由多达38名《中风》编辑委员会成员评定为人工智能生成或人类撰写。我们将专家的集体表现与广泛使用的在线人工智能检测工具GPTZero进行了比较。我们提取并比较了跨越语法（字数、复杂性等）、语义（极性）、可读性（Flesch分数）、年级水平（Flesch- kincaid）和语言困惑（或可预测性）的语言特征，以表征人工智能生成的内容与人类编写的内容之间的语言差异。结果：在审查研究论文的中风专家中，超过50%的人正确地将10篇（83.3%）人工智能生成的论文识别为人工智能，而将7篇（31.8%）人类撰写的论文错误地分类为人工智能。GPTZero准确分类了12篇（100%）人工智能生成的文章和21篇（95.5%）人类撰写的文章。然而，该工具仅依赖于几个关键句子进行分类。与人类论文相比，人工智能生成的内容具有更低的字数和复杂性，显着降低了困惑度(中位数，15.0比7.2；购买力平价= 0.006)。结论：大型语言模型生成的科学内容与人类书写的文本具有可测量的差异，但所代表的特征即使是人类专家也无法始终识别，并且需要复杂的人工智能检测工具。鉴于专家在区分人工智能和人类内容方面面临的挑战，技术辅助工具对于维护科学传播的完整性至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Stroke 医学-临床神经学

CiteScore

13.40

自引率

6.00%

发文量

2021

审稿时长

3 months

期刊介绍： Stroke is a monthly publication that collates reports of clinical and basic investigation of any aspect of the cerebral circulation and its diseases. The publication covers a wide range of disciplines including anesthesiology, critical care medicine, epidemiology, internal medicine, neurology, neuro-ophthalmology, neuropathology, neuropsychology, neurosurgery, nuclear medicine, nursing, radiology, rehabilitation, speech pathology, vascular physiology, and vascular surgery. The audience of Stroke includes neurologists, basic scientists, cardiologists, vascular surgeons, internists, interventionalists, neurosurgeons, nurses, and physiatrists. Stroke is indexed in Biological Abstracts, BIOSIS, CAB Abstracts, Chemical Abstracts, CINAHL, Current Contents, Embase, MEDLINE, and Science Citation Index Expanded.