Artificial intelligence meets medical expertise: evaluating GPT-4's proficiency in generating medical article abstracts

Ergin Sağtaş, Furkan Ufuk, H. Peker, A. B. Yağcı
{"title":"Artificial intelligence meets medical expertise: evaluating GPT-4's proficiency in generating medical article abstracts","authors":"Ergin Sağtaş, Furkan Ufuk, H. Peker, A. B. Yağcı","doi":"10.31362/patd.1487575","DOIUrl":null,"url":null,"abstract":"Purpose: The advent of large language models like GPT-4 has opened new possibilities in natural language processing, with potential applications in medical literature. This study assesses GPT-4's ability to generate medical abstracts. It compares their quality to original abstracts written by human authors, aiming to understand the effectiveness of artificial intelligence in replicating complex, professional writing tasks. \nMaterials and Methods: A total of 250 original research articles from five prominent radiology journals published between 2021 and 2023 were selected. The body of these articles, excluding the abstracts, was fed into GPT-4, which then generated new abstracts. Three experienced radiologists blindly and independently evaluated all 500 abstracts using a five-point Likert scale for quality and understandability. Statistical analysis included mean score comparison inter-rater reliability using Fleiss' Kappa and Bland-Altman plots to assess agreement levels between raters. \nResults: Analysis revealed no significant difference in the mean scores between original and GPT-4 generated abstracts. The inter-rater reliability yielded kappa values indicating moderate to substantial agreement: 0.497 between Observers 1 and 2, 0.753 between Observers 1 and 3, and 0.645 between Observers 2 and 3. Bland-Altman analysis showed a slight systematic bias but was within acceptable limits of agreement. \nConclusion: The study demonstrates that GPT-4 can generate medical abstracts with a quality comparable to those written by human experts. This suggests a promising role for artificial intelligence in facilitating the abstract writing process and improving its quality.","PeriodicalId":506150,"journal":{"name":"Pamukkale Medical Journal","volume":"30 29","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pamukkale Medical Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31362/patd.1487575","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: The advent of large language models like GPT-4 has opened new possibilities in natural language processing, with potential applications in medical literature. This study assesses GPT-4's ability to generate medical abstracts. It compares their quality to original abstracts written by human authors, aiming to understand the effectiveness of artificial intelligence in replicating complex, professional writing tasks. Materials and Methods: A total of 250 original research articles from five prominent radiology journals published between 2021 and 2023 were selected. The body of these articles, excluding the abstracts, was fed into GPT-4, which then generated new abstracts. Three experienced radiologists blindly and independently evaluated all 500 abstracts using a five-point Likert scale for quality and understandability. Statistical analysis included mean score comparison inter-rater reliability using Fleiss' Kappa and Bland-Altman plots to assess agreement levels between raters. Results: Analysis revealed no significant difference in the mean scores between original and GPT-4 generated abstracts. The inter-rater reliability yielded kappa values indicating moderate to substantial agreement: 0.497 between Observers 1 and 2, 0.753 between Observers 1 and 3, and 0.645 between Observers 2 and 3. Bland-Altman analysis showed a slight systematic bias but was within acceptable limits of agreement. Conclusion: The study demonstrates that GPT-4 can generate medical abstracts with a quality comparable to those written by human experts. This suggests a promising role for artificial intelligence in facilitating the abstract writing process and improving its quality.
人工智能与医学专业知识的结合:评估 GPT-4 生成医学文章摘要的能力
目的:GPT-4 等大型语言模型的出现为自然语言处理带来了新的可能性,并有可能应用于医学文献。本研究评估了 GPT-4 生成医学摘要的能力。它将摘要的质量与人类作者撰写的原始摘要进行了比较,旨在了解人工智能在复制复杂的专业写作任务方面的有效性。材料与方法从2021年至2023年期间出版的五种著名放射学期刊中选取了250篇原创研究文章。这些文章的正文(不包括摘要)被输入到 GPT-4 中,然后由 GPT-4 生成新的摘要。三位经验丰富的放射科专家采用李克特五点量表对所有 500 篇摘要的质量和可理解性进行盲评和独立评估。统计分析包括使用 Fleiss' Kappa 和 Bland-Altman 图比较评分者之间的平均得分,以评估评分者之间的一致程度。结果:分析表明,原始摘要和 GPT-4 生成摘要的平均得分没有明显差异。评分者之间的卡帕值显示出中度到高度的一致性:观察者 1 和观察者 2 之间的卡帕值为 0.497,观察者 1 和观察者 3 之间的卡帕值为 0.753,观察者 2 和观察者 3 之间的卡帕值为 0.645。Bland-Altman分析显示存在轻微的系统性偏差,但在可接受的一致性范围内。结论研究表明,GPT-4 可以生成与人类专家撰写的摘要质量相当的医学摘要。这表明人工智能在促进摘要撰写过程和提高摘要质量方面大有可为。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信