基于临床的大语言模型在回答医学问题中的比较研究：哮喘病例。

IF 2.1 3区医学 Q2 PEDIATRICS

Frontiers in Pediatrics Pub Date : 2025-04-25 eCollection Date: 2025-01-01 DOI:10.3389/fped.2025.1461026

Yong Yin, Mei Zeng, Hansong Wang, Haibo Yang, Caijing Zhou, Feng Jiang, Shufan Wu, Tingyue Huang, Shuahua Yuan, Jilei Lin, Mingyu Tang, Jiande Chen, Bin Dong, Jiajun Yuan, Dan Xie

{"title":"基于临床的大语言模型在回答医学问题中的比较研究：哮喘病例。","authors":"Yong Yin, Mei Zeng, Hansong Wang, Haibo Yang, Caijing Zhou, Feng Jiang, Shufan Wu, Tingyue Huang, Shuahua Yuan, Jilei Lin, Mingyu Tang, Jiande Chen, Bin Dong, Jiajun Yuan, Dan Xie","doi":"10.3389/fped.2025.1461026","DOIUrl":null,"url":null,"abstract":"Objective: This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.Materials and methods: Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.Results: GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.Conclusion: GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.","PeriodicalId":12637,"journal":{"name":"Frontiers in Pediatrics","volume":"13 ","pages":"1461026"},"PeriodicalIF":2.1000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12062090/pdf/","citationCount":"0","resultStr":"{\"title\":\"A clinician-based comparative study of large language models in answering medical questions: the case of asthma.\",\"authors\":\"Yong Yin, Mei Zeng, Hansong Wang, Haibo Yang, Caijing Zhou, Feng Jiang, Shufan Wu, Tingyue Huang, Shuahua Yuan, Jilei Lin, Mingyu Tang, Jiande Chen, Bin Dong, Jiajun Yuan, Dan Xie\",\"doi\":\"10.3389/fped.2025.1461026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.Materials and methods: Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.Results: GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.Conclusion: GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.\",\"PeriodicalId\":12637,\"journal\":{\"name\":\"Frontiers in Pediatrics\",\"volume\":\"13 \",\"pages\":\"1461026\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12062090/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Pediatrics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3389/fped.2025.1461026\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fped.2025.1461026","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在评估和比较四种大型语言模型（GPT-3.5、GPT-4.0、YouChat和Perplexity）在回答32个常见哮喘相关问题中的性能。材料与方法：来自各三级医院的75名临床医生参与了本研究。每位临床医生的任务是评估四个大型语言模型（llm）对32个与儿童哮喘相关的常见临床问题的反应。根据预先设定的标准，参与者主观地评估法学硕士答案的准确性、正确性、完整性和实用性。参与者提供了精确的分数，以确定每种语言模型在回答儿童哮喘相关问题时的表现。结果：GPT-4.0在所有维度上表现最好，而优信在所有维度上表现最差。GPT-3.5和GPT-4.0的性能都优于其他两个模型，但GPT-3.5和GPT-4.0之间以及YouChat和Perplexity之间的性能没有显著差异。结论：GPT等大型语言模型能够以一定的完整性和准确性回答医学问题。然而，临床医生应该批判性地评估互联网信息，区分真实和虚假的数据，不应该盲目地接受这些模型的输出。随着关键技术的进步，法学硕士可能有一天会成为医生寻求信息的安全选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A clinician-based comparative study of large language models in answering medical questions: the case of asthma.

Objective: This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.

Materials and methods: Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.

Results: GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.

Conclusion: GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Pediatrics Medicine-Pediatrics, Perinatology and Child Health

CiteScore

3.60

自引率

7.70%

发文量

2132

审稿时长

14 weeks

期刊介绍： Frontiers in Pediatrics (Impact Factor 2.33) publishes rigorously peer-reviewed research broadly across the field, from basic to clinical research that meets ongoing challenges in pediatric patient care and child health. Field Chief Editors Arjan Te Pas at Leiden University and Michael L. Moritz at the Children''s Hospital of Pittsburgh are supported by an outstanding Editorial Board of international experts. This multidisciplinary open-access journal is at the forefront of disseminating and communicating scientific knowledge and impactful discoveries to researchers, academics, clinicians and the public worldwide. Frontiers in Pediatrics also features Research Topics, Frontiers special theme-focused issues managed by Guest Associate Editors, addressing important areas in pediatrics. In this fashion, Frontiers serves as an outlet to publish the broadest aspects of pediatrics in both basic and clinical research, including high-quality reviews, case reports, editorials and commentaries related to all aspects of pediatrics.