Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality.

IF 2 3区 医学 Q2 PEDIATRICS
Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren
{"title":"Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality.","authors":"Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren","doi":"10.1186/s12887-025-05945-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.</p><p><strong>Methods: </strong>This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.</p><p><strong>Results: </strong>All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.</p><p><strong>Conclusions: </strong>This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.</p>","PeriodicalId":9144,"journal":{"name":"BMC Pediatrics","volume":"25 1","pages":"799"},"PeriodicalIF":2.0000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512316/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12887-025-05945-6","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.

Methods: This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.

Results: All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.

Conclusions: This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.

几种大型语言模型在回答儿童1型糖尿病患者常见问题时的表现:准确性、可理解性和实用性。
背景:随着自然语言处理的进步,大型语言模型(llm)在医疗保健领域的应用已经显著扩大。ChatGPT和谷歌Gemini等模型越来越多地用于对问题(包括患者及其家属提出的问题)生成类似人类的回答。随着儿童1型糖尿病(T1D)发病率的上升,家庭经常寻求有关该疾病的可靠答案。先前的研究主要集中在2型糖尿病上,但对儿科人群中T1D的研究仍然有限。本研究旨在评估和比较不同llm在回答有关T1D的常见问题时的表现和有效性。方法:本横断面比较研究采用了T1D患儿及其父母常问的问题。从通过社交媒体向儿科内分泌学家进行的调查中选出了20个问题。使用每个模型的标准提示对ChatGPT-3.5、ChatGPT-4、chatgpt - 40的性能进行评估。5位对糖尿病感兴趣的儿科内分泌学家使用一般质量量表(GQS),即5分李克特量表,评估诸如准确性、语言简洁性和移情等因素,对这些回答进行评估。结果:所有五位法学硕士都回答了20个选定的问题,并通过GQS分数对他们的表现进行了评估。chatgpt - 40评分最高(3.78±1.09),Gemini评分最低(3.40±1.24)。尽管存在这些差异,但模型之间没有观察到显著差异(p = 0.103)。然而,与ChatGPT-3.5和Gemini相比,chatgpt - 40、ChatGPT-4和Gemini Advanced给出的答案质量最高,得分一直在3到4分之间。ChatGPT-3.5的响应质量变化最小,表明一致性,但没有达到其他模型的更高性能水平。结论:本研究表明,所有评估的llm在回答有关T1D的常见问题时表现相似。chatgpt - 40和Gemini Advanced等llm可以为T1D的常见问题提供高于平均水平、准确且对患者友好的答案。虽然没有观察到明显的差异,但最新版本的法学硕士显示出整合到医疗保健中的希望,只要它们继续得到评估和改进。进一步的研究应侧重于开发专门针对儿科糖尿病护理的法学硕士。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Pediatrics
BMC Pediatrics PEDIATRICS-
CiteScore
3.70
自引率
4.20%
发文量
683
审稿时长
3-8 weeks
期刊介绍: BMC Pediatrics is an open access journal publishing peer-reviewed research articles in all aspects of health care in neonates, children and adolescents, as well as related molecular genetics, pathophysiology, and epidemiology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信