评估儿童发烧管理中的大语言模型:一项双层研究。

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES
Frontiers in digital health Pub Date : 2025-09-03 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1610671
Guijun Yang, Hejun Jiang, Shuhua Yuan, Mingyu Tang, Jing Zhang, Jilei Lin, Jiande Chen, Jiajun Yuan, Liebin Zhao, Yong Yin
{"title":"评估儿童发烧管理中的大语言模型:一项双层研究。","authors":"Guijun Yang, Hejun Jiang, Shuhua Yuan, Mingyu Tang, Jing Zhang, Jilei Lin, Jiande Chen, Jiajun Yuan, Liebin Zhao, Yong Yin","doi":"10.3389/fdgth.2025.1610671","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Pediatric fever is a prevalent concern, often causing parental anxiety and frequent medical consultations. While large language models (LLMs) such as ChatGPT, Perplexity, and YouChat show promise in enhancing medical communication and education, their efficacy in addressing complex pediatric fever-related questions remains underexplored, particularly from the perspectives of medical professionals and patients' relatives.</p><p><strong>Objective: </strong>This study aimed to explore the differences and similarities among four common large language models (ChatGPT3.5, ChatGPT4.0, YouChat, and Perplexity) in answering thirty pediatric fever-related questions and to examine how doctors and pediatric patients' relatives evaluate the LLM-generated answers based on predefined criteria.</p><p><strong>Methods: </strong>The study selected thirty fever-related pediatric questions answered by the four models. Twenty doctors rated these responses across four dimensions. To conduct the survey among pediatric patients' relatives, we eliminated certain responses that we deemed to pose safety risks or be misleading. Based on the doctors' questionnaire, the thirty questions were divided into six groups, each evaluated by twenty pediatric relatives. The Tukey <i>post-hoc</i> test was used to check for significant differences. Some of pediatric relatives was revisited for deeper insights into the results.</p><p><strong>Results: </strong>In the doctors' questionnaire, ChatGPT3.5 and ChatGPT4.0 outperformed YouChat and Perplexity in all dimensions, with no significant difference between ChatGPT3.5 and ChatGPT4.0 or between YouChat and Perplexity. All models scored significantly better in accuracy than other dimensions. In the pediatric relatives' questionnaire, no significant differences were found among the models, with revisits revealing some reasons for these results.</p><p><strong>Conclusions: </strong>Internet searches (YouChat and Perplexity) did not improve the ability of large language models to answer medical questions as expected. Patients lacked the ability to understand and analyze model responses due to a lack of professional knowledge and a lack of central points in model answers. When developing large language models for patient use, it's important to highlight the central points of the answers and ensure they are easily understandable.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1610671"},"PeriodicalIF":3.2000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12441047/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models in pediatric fever management: a two-layer study.\",\"authors\":\"Guijun Yang, Hejun Jiang, Shuhua Yuan, Mingyu Tang, Jing Zhang, Jilei Lin, Jiande Chen, Jiajun Yuan, Liebin Zhao, Yong Yin\",\"doi\":\"10.3389/fdgth.2025.1610671\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Pediatric fever is a prevalent concern, often causing parental anxiety and frequent medical consultations. While large language models (LLMs) such as ChatGPT, Perplexity, and YouChat show promise in enhancing medical communication and education, their efficacy in addressing complex pediatric fever-related questions remains underexplored, particularly from the perspectives of medical professionals and patients' relatives.</p><p><strong>Objective: </strong>This study aimed to explore the differences and similarities among four common large language models (ChatGPT3.5, ChatGPT4.0, YouChat, and Perplexity) in answering thirty pediatric fever-related questions and to examine how doctors and pediatric patients' relatives evaluate the LLM-generated answers based on predefined criteria.</p><p><strong>Methods: </strong>The study selected thirty fever-related pediatric questions answered by the four models. Twenty doctors rated these responses across four dimensions. To conduct the survey among pediatric patients' relatives, we eliminated certain responses that we deemed to pose safety risks or be misleading. Based on the doctors' questionnaire, the thirty questions were divided into six groups, each evaluated by twenty pediatric relatives. The Tukey <i>post-hoc</i> test was used to check for significant differences. Some of pediatric relatives was revisited for deeper insights into the results.</p><p><strong>Results: </strong>In the doctors' questionnaire, ChatGPT3.5 and ChatGPT4.0 outperformed YouChat and Perplexity in all dimensions, with no significant difference between ChatGPT3.5 and ChatGPT4.0 or between YouChat and Perplexity. All models scored significantly better in accuracy than other dimensions. In the pediatric relatives' questionnaire, no significant differences were found among the models, with revisits revealing some reasons for these results.</p><p><strong>Conclusions: </strong>Internet searches (YouChat and Perplexity) did not improve the ability of large language models to answer medical questions as expected. Patients lacked the ability to understand and analyze model responses due to a lack of professional knowledge and a lack of central points in model answers. When developing large language models for patient use, it's important to highlight the central points of the answers and ensure they are easily understandable.</p>\",\"PeriodicalId\":73078,\"journal\":{\"name\":\"Frontiers in digital health\",\"volume\":\"7 \",\"pages\":\"1610671\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12441047/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in digital health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fdgth.2025.1610671\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1610671","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景:儿童发烧是一个普遍关注的问题,经常引起父母的焦虑和频繁的医疗咨询。虽然ChatGPT、Perplexity和YouChat等大型语言模型(llm)在加强医学交流和教育方面表现出了希望,但它们在解决复杂的儿科发烧相关问题方面的功效仍有待探索,特别是从医疗专业人员和患者亲属的角度来看。目的:本研究旨在探讨四种常见的大语言模型(ChatGPT3.5、ChatGPT4.0、YouChat和Perplexity)在回答30个儿科发烧相关问题时的异同,并研究医生和儿科患者亲属如何根据预定义标准评估llm生成的答案。方法:选取四种模型回答的30个与发烧相关的儿科问题。20名医生从四个方面对这些回答打分。为了在儿科患者亲属中进行调查,我们排除了某些我们认为存在安全风险或具有误导性的回答。根据医生的问卷,将这30个问题分为6组,每组由20名儿科亲属进行评估。采用Tukey事后检验检验有无显著性差异。为了更深入地了解结果,研究人员重新访问了一些儿科亲属。结果:在医生问卷中,ChatGPT3.5和ChatGPT4.0在各维度上都优于YouChat和Perplexity, ChatGPT3.5与ChatGPT4.0之间、YouChat与Perplexity之间均无显著差异。所有模型的准确率都明显高于其他维度。在儿童亲属问卷调查中,各模型之间没有发现显著差异,重新访问揭示了这些结果的一些原因。结论:互联网搜索(YouChat和Perplexity)并没有像预期的那样提高大型语言模型回答医学问题的能力。由于缺乏专业知识和模型答案缺乏中心点,患者缺乏理解和分析模型回答的能力。在开发供患者使用的大型语言模型时,突出答案的中心点并确保它们易于理解是很重要的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Evaluating large language models in pediatric fever management: a two-layer study.

Evaluating large language models in pediatric fever management: a two-layer study.

Evaluating large language models in pediatric fever management: a two-layer study.

Evaluating large language models in pediatric fever management: a two-layer study.

Background: Pediatric fever is a prevalent concern, often causing parental anxiety and frequent medical consultations. While large language models (LLMs) such as ChatGPT, Perplexity, and YouChat show promise in enhancing medical communication and education, their efficacy in addressing complex pediatric fever-related questions remains underexplored, particularly from the perspectives of medical professionals and patients' relatives.

Objective: This study aimed to explore the differences and similarities among four common large language models (ChatGPT3.5, ChatGPT4.0, YouChat, and Perplexity) in answering thirty pediatric fever-related questions and to examine how doctors and pediatric patients' relatives evaluate the LLM-generated answers based on predefined criteria.

Methods: The study selected thirty fever-related pediatric questions answered by the four models. Twenty doctors rated these responses across four dimensions. To conduct the survey among pediatric patients' relatives, we eliminated certain responses that we deemed to pose safety risks or be misleading. Based on the doctors' questionnaire, the thirty questions were divided into six groups, each evaluated by twenty pediatric relatives. The Tukey post-hoc test was used to check for significant differences. Some of pediatric relatives was revisited for deeper insights into the results.

Results: In the doctors' questionnaire, ChatGPT3.5 and ChatGPT4.0 outperformed YouChat and Perplexity in all dimensions, with no significant difference between ChatGPT3.5 and ChatGPT4.0 or between YouChat and Perplexity. All models scored significantly better in accuracy than other dimensions. In the pediatric relatives' questionnaire, no significant differences were found among the models, with revisits revealing some reasons for these results.

Conclusions: Internet searches (YouChat and Perplexity) did not improve the ability of large language models to answer medical questions as expected. Patients lacked the ability to understand and analyze model responses due to a lack of professional knowledge and a lack of central points in model answers. When developing large language models for patient use, it's important to highlight the central points of the answers and ensure they are easily understandable.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.20
自引率
0.00%
发文量
0
审稿时长
13 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信