Yong Yin, Mei Zeng, Hansong Wang, Haibo Yang, Caijing Zhou, Feng Jiang, Shufan Wu, Tingyue Huang, Shuahua Yuan, Jilei Lin, Mingyu Tang, Jiande Chen, Bin Dong, Jiajun Yuan, Dan Xie
{"title":"基于临床的大语言模型在回答医学问题中的比较研究:哮喘病例。","authors":"Yong Yin, Mei Zeng, Hansong Wang, Haibo Yang, Caijing Zhou, Feng Jiang, Shufan Wu, Tingyue Huang, Shuahua Yuan, Jilei Lin, Mingyu Tang, Jiande Chen, Bin Dong, Jiajun Yuan, Dan Xie","doi":"10.3389/fped.2025.1461026","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.</p><p><strong>Materials and methods: </strong>Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.</p><p><strong>Results: </strong>GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.</p><p><strong>Conclusion: </strong>GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.</p>","PeriodicalId":12637,"journal":{"name":"Frontiers in Pediatrics","volume":"13 ","pages":"1461026"},"PeriodicalIF":2.1000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12062090/pdf/","citationCount":"0","resultStr":"{\"title\":\"A clinician-based comparative study of large language models in answering medical questions: the case of asthma.\",\"authors\":\"Yong Yin, Mei Zeng, Hansong Wang, Haibo Yang, Caijing Zhou, Feng Jiang, Shufan Wu, Tingyue Huang, Shuahua Yuan, Jilei Lin, Mingyu Tang, Jiande Chen, Bin Dong, Jiajun Yuan, Dan Xie\",\"doi\":\"10.3389/fped.2025.1461026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.</p><p><strong>Materials and methods: </strong>Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.</p><p><strong>Results: </strong>GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.</p><p><strong>Conclusion: </strong>GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.</p>\",\"PeriodicalId\":12637,\"journal\":{\"name\":\"Frontiers in Pediatrics\",\"volume\":\"13 \",\"pages\":\"1461026\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12062090/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Pediatrics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3389/fped.2025.1461026\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fped.2025.1461026","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
A clinician-based comparative study of large language models in answering medical questions: the case of asthma.
Objective: This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions.
Materials and methods: Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions.
Results: GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity.
Conclusion: GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.
期刊介绍:
Frontiers in Pediatrics (Impact Factor 2.33) publishes rigorously peer-reviewed research broadly across the field, from basic to clinical research that meets ongoing challenges in pediatric patient care and child health. Field Chief Editors Arjan Te Pas at Leiden University and Michael L. Moritz at the Children''s Hospital of Pittsburgh are supported by an outstanding Editorial Board of international experts. This multidisciplinary open-access journal is at the forefront of disseminating and communicating scientific knowledge and impactful discoveries to researchers, academics, clinicians and the public worldwide.
Frontiers in Pediatrics also features Research Topics, Frontiers special theme-focused issues managed by Guest Associate Editors, addressing important areas in pediatrics. In this fashion, Frontiers serves as an outlet to publish the broadest aspects of pediatrics in both basic and clinical research, including high-quality reviews, case reports, editorials and commentaries related to all aspects of pediatrics.