Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren
{"title":"几种大型语言模型在回答儿童1型糖尿病患者常见问题时的表现:准确性、可理解性和实用性。","authors":"Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren","doi":"10.1186/s12887-025-05945-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.</p><p><strong>Methods: </strong>This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.</p><p><strong>Results: </strong>All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.</p><p><strong>Conclusions: </strong>This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.</p>","PeriodicalId":9144,"journal":{"name":"BMC Pediatrics","volume":"25 1","pages":"799"},"PeriodicalIF":2.0000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512316/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality.\",\"authors\":\"Yasemin Denkboy Ongen, Ayla İrem Aydın, Meryem Atak, Erdal Eren\",\"doi\":\"10.1186/s12887-025-05945-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.</p><p><strong>Methods: </strong>This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.</p><p><strong>Results: </strong>All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.</p><p><strong>Conclusions: </strong>This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.</p>\",\"PeriodicalId\":9144,\"journal\":{\"name\":\"BMC Pediatrics\",\"volume\":\"25 1\",\"pages\":\"799\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512316/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Pediatrics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12887-025-05945-6\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12887-025-05945-6","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
Performance of several large language models when answering common patient questions about type 1 diabetes in children: accuracy, comprehensibility and practicality.
Background: The use of large language models (LLMs) in healthcare has expanded significantly with advances in natural language processing. Models, such as ChatGPT and Google Gemini, are increasingly used to generate human-like responses to questions, including those posed by patients and their families. With the rise in the incidence of type 1 diabetes (T1D) among children, families frequently seek reliable answers regarding the disease. Previous research has focused on type 2 diabetes, but studies on T1D in a pediatric population remain limited. This study aimed to evaluate and compare the performance and effectiveness of different LLMs when answering common questions about T1D.
Methods: This cross-sectional, comparative study used questions frequently asked by children with T1D and their parents. Twenty questions were selected from inquiries made to pediatric endocrinologists via social media. The performance of ChatGPT-3.5 ChatGPT-4 ChatGPT-4o was assessed using a standard prompt for each model. The responses were evaluated by five pediatric endocrinologists interested in diabetes using the General Quality Scale (GQS), a 5-point Likert scale, assessing factors such as accuracy, language simplicity, and empathy.
Results: All five LLMs responded to the 20 selected questions, with their performance evaluated by GQS scores. ChatGPT-4o had the highest mean score (3.78 ± 1.09), while Gemini had the lowest (3.40 ± 1.24). Despite these differences, no significant variation was observed between the models (p = 0.103). However, ChatGPT-4o, ChatGPT-4, and Gemini Advanced produced the highest-quality answers compared to ChatGPT-3.5 and Gemini, scoring consistently between 3 and 4 points. ChatGPT-3.5 had the smallest variation in response quality, indicating consistency but not reaching the higher performance levels of other models.
Conclusions: This study demonstrated that all evaluated LLMs performed similarly in answering common questions about T1D. LLMs such as ChatGPT-4o and Gemini Advanced can provide above-average, accurate, and patient-friendly answers to common questions about T1D. Although no significant differences were observed, the latest versions of LLMs show promise for integration into healthcare, provided they continue to be evaluated and improved. Further research should focus on developing specialized LLMs tailored for pediatric diabetes care.
期刊介绍:
BMC Pediatrics is an open access journal publishing peer-reviewed research articles in all aspects of health care in neonates, children and adolescents, as well as related molecular genetics, pathophysiology, and epidemiology.