评估大型语言模型聊天机器人在放射学中生成简单语言摘要的能力

iRadiology Pub Date : 2025-08-02 DOI:10.1002/ird3.70030
Pradosh Kumar Sarangi, Pratisruti Hui, Himel Mondal, Debasish Swapnesh Kumar Nayak, M. Sarthak Swarup,  Ishan, Swaha Panda
{"title":"评估大型语言模型聊天机器人在放射学中生成简单语言摘要的能力","authors":"Pradosh Kumar Sarangi,&nbsp;Pratisruti Hui,&nbsp;Himel Mondal,&nbsp;Debasish Swapnesh Kumar Nayak,&nbsp;M. Sarthak Swarup,&nbsp; Ishan,&nbsp;Swaha Panda","doi":"10.1002/ird3.70030","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Plain language summary (PLS) are essential for making scientific research accessible to a broader audience. With the increasing capabilities of large language models (LLMs), there is the potential to automate the generation of PLS from complex scientific abstracts. This study assessed the performance of six LLM chatbots: ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity, in generating PLS from radiology research abstracts.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>A total of 100 radiology abstracts were collected from PubMed. Six LLM chatbots were tasked with generating PLS for each abstract. Two expert radiologists independently evaluated the generated summaries for accuracy and readability, with their average scores being used for comparisons. Additionally, the Flesch–Kincaid (FK) grade level and Flesch reading ease score were applied to objectively assess readability.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Comparisons of LLM-generated PLS revealed variations in both accuracy and readability across the models. Accuracy was highest for ChatGPT (4.94 ± 0.18) followed by Claude (4.75 ± 0.31). Readability was highest for ChatGPT (4.83 ± 0.27) followed by Perplexity (4.82 ± 0.29). The Flesch reading ease score was highest for Claude (62.53 ± 10.98) and lowest for ChatGPT (40.10 ± 11.24).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>LLM chatbots show promise in the generation of PLS, but performance varies significantly between models in terms of both accuracy and readability. This study highlights the potential of LLMs to aid in science communication but underscores the need for careful model selection and human oversight.</p>\n </section>\n </div>","PeriodicalId":73508,"journal":{"name":"iRadiology","volume":"3 4","pages":"289-294"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70030","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Capability of Large Language Model Chatbots for Generating Plain Language Summaries in Radiology\",\"authors\":\"Pradosh Kumar Sarangi,&nbsp;Pratisruti Hui,&nbsp;Himel Mondal,&nbsp;Debasish Swapnesh Kumar Nayak,&nbsp;M. Sarthak Swarup,&nbsp; Ishan,&nbsp;Swaha Panda\",\"doi\":\"10.1002/ird3.70030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Background</h3>\\n \\n <p>Plain language summary (PLS) are essential for making scientific research accessible to a broader audience. With the increasing capabilities of large language models (LLMs), there is the potential to automate the generation of PLS from complex scientific abstracts. This study assessed the performance of six LLM chatbots: ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity, in generating PLS from radiology research abstracts.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>A total of 100 radiology abstracts were collected from PubMed. Six LLM chatbots were tasked with generating PLS for each abstract. Two expert radiologists independently evaluated the generated summaries for accuracy and readability, with their average scores being used for comparisons. Additionally, the Flesch–Kincaid (FK) grade level and Flesch reading ease score were applied to objectively assess readability.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>Comparisons of LLM-generated PLS revealed variations in both accuracy and readability across the models. Accuracy was highest for ChatGPT (4.94 ± 0.18) followed by Claude (4.75 ± 0.31). Readability was highest for ChatGPT (4.83 ± 0.27) followed by Perplexity (4.82 ± 0.29). The Flesch reading ease score was highest for Claude (62.53 ± 10.98) and lowest for ChatGPT (40.10 ± 11.24).</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusion</h3>\\n \\n <p>LLM chatbots show promise in the generation of PLS, but performance varies significantly between models in terms of both accuracy and readability. This study highlights the potential of LLMs to aid in science communication but underscores the need for careful model selection and human oversight.</p>\\n </section>\\n </div>\",\"PeriodicalId\":73508,\"journal\":{\"name\":\"iRadiology\",\"volume\":\"3 4\",\"pages\":\"289-294\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70030\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"iRadiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"iRadiology","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

通俗易懂的语言摘要(PLS)对于让更广泛的受众了解科学研究是必不可少的。随着大型语言模型(llm)功能的不断增强,从复杂的科学摘要中自动生成PLS具有潜力。本研究评估了六个LLM聊天机器人的性能:ChatGPT, Claude, Copilot, Gemini, Meta AI和Perplexity,从放射学研究摘要中生成PLS。方法从PubMed中收集放射学摘要100篇。六个LLM聊天机器人的任务是为每个摘要生成PLS。两名放射科专家独立评估了生成的摘要的准确性和可读性,并使用他们的平均分进行比较。此外,采用Flesch - kincaid (FK)年级水平和Flesch阅读轻松度评分客观评价可读性。结果llm生成的PLS的比较揭示了模型之间准确性和可读性的差异。ChatGPT的准确率最高(4.94±0.18),其次是Claude(4.75±0.31)。ChatGPT的可读性最高(4.83±0.27),其次是Perplexity(4.82±0.29)。Claude的Flesch阅读轻松评分最高(62.53±10.98),ChatGPT的最低(40.10±11.24)。LLM聊天机器人在PLS生成方面表现出很大的潜力,但在准确性和可读性方面,不同模型的性能差异很大。这项研究强调了法学硕士帮助科学传播的潜力,但也强调了谨慎选择模式和人为监督的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Evaluating the Capability of Large Language Model Chatbots for Generating Plain Language Summaries in Radiology

Evaluating the Capability of Large Language Model Chatbots for Generating Plain Language Summaries in Radiology

Background

Plain language summary (PLS) are essential for making scientific research accessible to a broader audience. With the increasing capabilities of large language models (LLMs), there is the potential to automate the generation of PLS from complex scientific abstracts. This study assessed the performance of six LLM chatbots: ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity, in generating PLS from radiology research abstracts.

Methods

A total of 100 radiology abstracts were collected from PubMed. Six LLM chatbots were tasked with generating PLS for each abstract. Two expert radiologists independently evaluated the generated summaries for accuracy and readability, with their average scores being used for comparisons. Additionally, the Flesch–Kincaid (FK) grade level and Flesch reading ease score were applied to objectively assess readability.

Results

Comparisons of LLM-generated PLS revealed variations in both accuracy and readability across the models. Accuracy was highest for ChatGPT (4.94 ± 0.18) followed by Claude (4.75 ± 0.31). Readability was highest for ChatGPT (4.83 ± 0.27) followed by Perplexity (4.82 ± 0.29). The Flesch reading ease score was highest for Claude (62.53 ± 10.98) and lowest for ChatGPT (40.10 ± 11.24).

Conclusion

LLM chatbots show promise in the generation of PLS, but performance varies significantly between models in terms of both accuracy and readability. This study highlights the potential of LLMs to aid in science communication but underscores the need for careful model selection and human oversight.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信