几种大型语言模型在回答儿童1型糖尿病患者常见问题时的表现：准确性、可理解性和实用性。

IF 2 3区医学 Q2 PEDIATRICS

BMC Pediatrics Pub Date : 2025-10-10 DOI:10.1186/s12887-025-05945-6

Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren

{"title":"几种大型语言模型在回答儿童1型糖尿病患者常见问题时的表现：准确性、可理解性和实用性。","authors":"Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren","doi":"10.1186/s12887-025-05945-6","DOIUrl":null,"url":null,"abstract":"Background: The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.Methods: This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.Results: All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.Conclusions: This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.","PeriodicalId":9144,"journal":{"name":"BMC Pediatrics","volume":"25 1","pages":"799"},"PeriodicalIF":2.0000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512316/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality.\",\"authors\":\"Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren\",\"doi\":\"10.1186/s12887-025-05945-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.Methods: This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.Results: All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.Conclusions: This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.\",\"PeriodicalId\":9144,\"journal\":{\"name\":\"BMC Pediatrics\",\"volume\":\"25 1\",\"pages\":\"799\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512316/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Pediatrics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12887-025-05945-6\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12887-025-05945-6","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：随着自然语言处理的进步，大型语言模型（llm）在医疗保健领域的应用已经显著扩大。ChatGPT和谷歌Gemini等模型越来越多地用于对问题（包括患者及其家属提出的问题）生成类似人类的回答。随着儿童1型糖尿病（T1D）发病率的上升，家庭经常寻求有关该疾病的可靠答案。先前的研究主要集中在2型糖尿病上，但对儿科人群中T1D的研究仍然有限。本研究旨在评估和比较不同llm在回答有关T1D的常见问题时的表现和有效性。方法：本横断面比较研究采用了T1D患儿及其父母常问的问题。从通过社交媒体向儿科内分泌学家进行的调查中选出了20个问题。使用每个模型的标准提示对ChatGPT-3.5、ChatGPT-4、chatgpt - 40的性能进行评估。5位对糖尿病感兴趣的儿科内分泌学家使用一般质量量表（GQS），即5分李克特量表，评估诸如准确性、语言简洁性和移情等因素，对这些回答进行评估。结果：所有五位法学硕士都回答了20个选定的问题，并通过GQS分数对他们的表现进行了评估。chatgpt - 40评分最高（3.78±1.09），Gemini评分最低（3.40±1.24）。尽管存在这些差异，但模型之间没有观察到显著差异（p = 0.103）。然而，与ChatGPT-3.5和Gemini相比，chatgpt - 40、ChatGPT-4和Gemini Advanced给出的答案质量最高，得分一直在3到4分之间。ChatGPT-3.5的响应质量变化最小，表明一致性，但没有达到其他模型的更高性能水平。结论：本研究表明，所有评估的llm在回答有关T1D的常见问题时表现相似。chatgpt - 40和Gemini Advanced等llm可以为T1D的常见问题提供高于平均水平、准确且对患者友好的答案。虽然没有观察到明显的差异，但最新版本的法学硕士显示出整合到医疗保健中的希望，只要它们继续得到评估和改进。进一步的研究应侧重于开发专门针对儿科糖尿病护理的法学硕士。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality.

Background: The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.

Methods: This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.

Results: All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.

Conclusions: This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Pediatrics PEDIATRICS-

CiteScore

3.70

自引率

4.20%

发文量

683

审稿时长

3-8 weeks

期刊介绍： BMC Pediatrics is an open access journal publishing peer-reviewed research articles in all aspects of health care in neonates, children and adolescents, as well as related molecular genetics, pathophysiology, and epidemiology.