Can AI outperform professional writers in summarizing foot and ankle literature?

Foot & ankle surgery (New York, N.Y.) Pub Date : 2025-06-06 DOI:10.1016/j.fastrc.2025.100522

Seth L. Warren DPM , Steven R. Cooperman DPM, MBA, AACFAS

{"title":"Can AI outperform professional writers in summarizing foot and ankle literature?","authors":"Seth L. Warren DPM , Steven R. Cooperman DPM, MBA, AACFAS","doi":"10.1016/j.fastrc.2025.100522","DOIUrl":null,"url":null,"abstract":"<div><div>This study evaluates the performance of an advanced large language model in summarizing scientific literature within the specialized field of foot and ankle surgery. Building upon prior work that demonstrated ChatGPT-3.5′s comparability to podiatric residents, this investigation compares ChatGPT-4.5 directly against paid, professionally written summaries sourced from Foot and Ankle Quarterly. Ten original research articles were summarized by ChatGPT-4.5 and matched with corresponding professionally written summaries. Quantitative analysis using BLEU and ROUGE metrics assessed textual similarity, while Flesch Reading Ease and Flesch-Kincaid Grade Level scores evaluated readability. A qualitative preference survey was conducted among three blinded, fellowship-trained foot and ankle surgeons. Results showed that AI-generated summaries were preferred in 73.33 % of comparisons and demonstrated no factual inaccuracies. Although professionally written summaries were quantitatively more readable, AI-generated summaries maintained higher consistency in language complexity. ROUGE scores suggested substantial content overlap between AI-generated and reference summaries, whereas BLEU scores reflected differences, which may be attributable to shorter AI summary lengths. These findings suggest ChatGPT-4.5 can reliably and efficiently produce accurate, high-quality summaries, potentially surpassing paid academic writers in certain domains. Broader implications include improved efficiency in academic research and literature review. Continued investigation and oversight are necessary to guide the responsible integration of AI tools into clinical and scholarly workflows.</div></div><div><h3>Level of evidence</h3><div>III, comparative study</div></div>","PeriodicalId":73047,"journal":{"name":"Foot & ankle surgery (New York, N.Y.)","volume":"5 3","pages":"Article 100522"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foot & ankle surgery (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667396725000576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study evaluates the performance of an advanced large language model in summarizing scientific literature within the specialized field of foot and ankle surgery. Building upon prior work that demonstrated ChatGPT-3.5′s comparability to podiatric residents, this investigation compares ChatGPT-4.5 directly against paid, professionally written summaries sourced from Foot and Ankle Quarterly. Ten original research articles were summarized by ChatGPT-4.5 and matched with corresponding professionally written summaries. Quantitative analysis using BLEU and ROUGE metrics assessed textual similarity, while Flesch Reading Ease and Flesch-Kincaid Grade Level scores evaluated readability. A qualitative preference survey was conducted among three blinded, fellowship-trained foot and ankle surgeons. Results showed that AI-generated summaries were preferred in 73.33 % of comparisons and demonstrated no factual inaccuracies. Although professionally written summaries were quantitatively more readable, AI-generated summaries maintained higher consistency in language complexity. ROUGE scores suggested substantial content overlap between AI-generated and reference summaries, whereas BLEU scores reflected differences, which may be attributable to shorter AI summary lengths. These findings suggest ChatGPT-4.5 can reliably and efficiently produce accurate, high-quality summaries, potentially surpassing paid academic writers in certain domains. Broader implications include improved efficiency in academic research and literature review. Continued investigation and oversight are necessary to guide the responsible integration of AI tools into clinical and scholarly workflows.

Level of evidence

III, comparative study

查看原文本刊更多论文

人工智能在总结足部和脚踝文献方面能胜过专业作家吗？

本研究评估了一种先进的大型语言模型在总结足部和踝关节外科专业领域的科学文献方面的表现。在先前证明ChatGPT-3.5与足部住院医生可比性的工作的基础上，本调查将ChatGPT-4.5与来自Foot and Ankle Quarterly的付费专业撰写摘要进行了直接比较。通过ChatGPT-4.5对10篇原创研究文章进行汇总，并匹配相应的专业撰写摘要。定量分析使用BLEU和ROUGE指标评估文本相似性，而Flesch Reading Ease和Flesch- kincaid Grade Level分数评估可读性。一项定性偏好调查在三名盲法，奖学金训练足和踝关节外科医生。结果显示，人工智能生成的摘要在73.33%的比较中更受青睐，并且没有事实不准确。虽然专业撰写的摘要在数量上更具可读性，但人工智能生成的摘要在语言复杂性上保持了更高的一致性。ROUGE分数表明人工智能生成的摘要和参考摘要之间存在大量内容重叠，而BLEU分数反映了差异，这可能归因于较短的人工智能摘要长度。这些发现表明，ChatGPT-4.5可以可靠、高效地生成准确、高质量的摘要，在某些领域有可能超过付费学术作者。更广泛的影响包括提高学术研究和文献综述的效率。持续的调查和监督是必要的，以指导人工智能工具负责任地整合到临床和学术工作流程中。证据水平ii，比较研究

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Foot & ankle surgery (New York, N.Y.) Orthopedics, Sports Medicine and Rehabilitation, Podiatry

自引率

0.00%

发文量

审稿时长

75 days