Evaluating the Capability of Large Language Model Chatbots for Generating Plain Language Summaries in Radiology

iRadiology Pub Date : 2025-08-02 DOI:10.1002/ird3.70030

Pradosh Kumar Sarangi, Pratisruti Hui, Himel Mondal, Debasish Swapnesh Kumar Nayak, M. Sarthak Swarup, Ishan, Swaha Panda

{"title":"Evaluating the Capability of Large Language Model Chatbots for Generating Plain Language Summaries in Radiology","authors":"Pradosh Kumar Sarangi, Pratisruti Hui, Himel Mondal, Debasish Swapnesh Kumar Nayak, M. Sarthak Swarup,  Ishan, Swaha Panda","doi":"10.1002/ird3.70030","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Plain language summary (PLS) are essential for making scientific research accessible to a broader audience. With the increasing capabilities of large language models (LLMs), there is the potential to automate the generation of PLS from complex scientific abstracts. This study assessed the performance of six LLM chatbots: ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity, in generating PLS from radiology research abstracts.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>A total of 100 radiology abstracts were collected from PubMed. Six LLM chatbots were tasked with generating PLS for each abstract. Two expert radiologists independently evaluated the generated summaries for accuracy and readability, with their average scores being used for comparisons. Additionally, the Flesch–Kincaid (FK) grade level and Flesch reading ease score were applied to objectively assess readability.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Comparisons of LLM-generated PLS revealed variations in both accuracy and readability across the models. Accuracy was highest for ChatGPT (4.94 ± 0.18) followed by Claude (4.75 ± 0.31). Readability was highest for ChatGPT (4.83 ± 0.27) followed by Perplexity (4.82 ± 0.29). The Flesch reading ease score was highest for Claude (62.53 ± 10.98) and lowest for ChatGPT (40.10 ± 11.24).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>LLM chatbots show promise in the generation of PLS, but performance varies significantly between models in terms of both accuracy and readability. This study highlights the potential of LLMs to aid in science communication but underscores the need for careful model selection and human oversight.</p>\n </section>\n </div>","PeriodicalId":73508,"journal":{"name":"iRadiology","volume":"3 4","pages":"289-294"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70030","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"iRadiology","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Plain language summary (PLS) are essential for making scientific research accessible to a broader audience. With the increasing capabilities of large language models (LLMs), there is the potential to automate the generation of PLS from complex scientific abstracts. This study assessed the performance of six LLM chatbots: ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity, in generating PLS from radiology research abstracts.

Methods

A total of 100 radiology abstracts were collected from PubMed. Six LLM chatbots were tasked with generating PLS for each abstract. Two expert radiologists independently evaluated the generated summaries for accuracy and readability, with their average scores being used for comparisons. Additionally, the Flesch–Kincaid (FK) grade level and Flesch reading ease score were applied to objectively assess readability.

Results

Comparisons of LLM-generated PLS revealed variations in both accuracy and readability across the models. Accuracy was highest for ChatGPT (4.94 ± 0.18) followed by Claude (4.75 ± 0.31). Readability was highest for ChatGPT (4.83 ± 0.27) followed by Perplexity (4.82 ± 0.29). The Flesch reading ease score was highest for Claude (62.53 ± 10.98) and lowest for ChatGPT (40.10 ± 11.24).

Conclusion

LLM chatbots show promise in the generation of PLS, but performance varies significantly between models in terms of both accuracy and readability. This study highlights the potential of LLMs to aid in science communication but underscores the need for careful model selection and human oversight.

Abstract Image

查看原文本刊更多论文

评估大型语言模型聊天机器人在放射学中生成简单语言摘要的能力

通俗易懂的语言摘要（PLS）对于让更广泛的受众了解科学研究是必不可少的。随着大型语言模型（llm）功能的不断增强，从复杂的科学摘要中自动生成PLS具有潜力。本研究评估了六个LLM聊天机器人的性能：ChatGPT， Claude, Copilot, Gemini， Meta AI和Perplexity，从放射学研究摘要中生成PLS。方法从PubMed中收集放射学摘要100篇。六个LLM聊天机器人的任务是为每个摘要生成PLS。两名放射科专家独立评估了生成的摘要的准确性和可读性，并使用他们的平均分进行比较。此外，采用Flesch - kincaid （FK）年级水平和Flesch阅读轻松度评分客观评价可读性。结果llm生成的PLS的比较揭示了模型之间准确性和可读性的差异。ChatGPT的准确率最高（4.94±0.18），其次是Claude（4.75±0.31）。ChatGPT的可读性最高（4.83±0.27），其次是Perplexity（4.82±0.29）。Claude的Flesch阅读轻松评分最高（62.53±10.98），ChatGPT的最低（40.10±11.24）。LLM聊天机器人在PLS生成方面表现出很大的潜力，但在准确性和可读性方面，不同模型的性能差异很大。这项研究强调了法学硕士帮助科学传播的潜力，但也强调了谨慎选择模式和人为监督的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

iRadiology

自引率

0.00%

发文量