[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

Q4 Medicine
C L Han, S Z Bai, T M Zhang, C Liu, Y C Liu, X X Hu, Y M Zhao
{"title":"[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].","authors":"C L Han, S Z Bai, T M Zhang, C Liu, Y C Liu, X X Hu, Y M Zhao","doi":"10.3760/cma.j.cn112144-20241107-00418","DOIUrl":null,"url":null,"abstract":"<p><p><b>Objective:</b> To evaluate the accuracy of the oral healthcare information provided by different large language models (LLM) to explore their feasibility and limitations in the application of oral auxiliary, treatment and health consultation. <b>Methods:</b> This study designed eight items comprising 47 questions in total related to the diagnosis and treatment of oral diseases [to assess the performance of LLM as an artificial intelligence (AI) medical assistant], and five items comprising 35 questions in total about oral health consultations (to assess the performance of LLM as a simulated doctor). These questions were answered individually by the five LLM models (Erine Bot, HuatuoGPT, Tongyi Qianwen, iFlytek Spark, ChatGPT). Two attending physicians with more than 5 years of experience independently rated the responses using the 3C criteria (correct, clear, concise), and the consistency between the raters was assessed using the Spearman rank correlation coefficient, and the Kruskal-Wallis test and Dunn post hoc test were used to assess the statistical differences between the models. Additionally, this study used 600 questions from the 2023 dental licensing examination to evaluate the time taken to answer, scores, and accuracy of each model. <b>Results:</b> As an AI medical assistant, LLM can assist doctors in diagnosis and treatment decision-making, with an inter-evaluator Spearman coefficient of 0.505 (<i>P</i><0.01). As a simulated doctor, LLM can carry out patient popularization, with an inter-evaluator Spearman coefficient of 0.533 (<i>P</i><0.01). The 3C scoring results were represented by the median (lower quartile, upper quartile), and the 3C scores of each model as an AI medical assistant and a simulated doctor were respectively: 2.00 (1.00, 3.00) and 2.00 (1.00, 3.00) points of Erine Bot, 1.00 (1.00, 2.00) and 2.00 (1.00, 2.00) points of HuatuoGPT, 2.00 (1.00, 2.00) and 2.00 (1.00, 3.00) points of Tongyi Qianwen, 2.00 (1.00, 2.00) and 2.00 (1.75, 2.25) points of iFlytek Spark, 3.00 (2.00, 3.00) and 3.00 (2.00, 3.00) points of ChatGPT (full score of 4 points). The Kruskal-Wallis test results showed that, as an AI medical assistant or a simulated doctor, there were statistically differences in the 3C scores among the five large language models (all <i>P</i><0.001). The average score of the 5 LLMs on the dental licensing examination was 370.2, with an accuracy rate of 61.7% (370.2/600) and a time consumption of 94.6 minutes. Specifically, Erine Bot took 115 minutes, scored 363 points with an accuracy rate of 60.5% (363/600), HuatuoGPT took 224 minutes and scored 305 points with an accuracy rate of 50.8% (305/600), Tongyi Qianwen took 43 minutes, scored 438 points with an accuracy rate of 73.0% (480/600), iFlytek Spark took 32 minutes, scored 364 points with an accuracy rate of 60.7% (364/600), and ChatGPT took 59 minutes, scored 381 points with an accuracy rate of 63.5% (381/600). <b>Conclusions:</b> Based on the evaluation of LLM's dual roles as an AI medical assistant and a simulated doctor, ChatGPT performes the best, with basically correct, clear and concise answers, followed by Erine Bot, Tongyi Qianwen and iFlytek Spark, with HuatuoGPT lagging behind significantly. In the dental licensing examination, all the 4 LLM, except for HuatuoGPT, reach the passing level, and the time consumpution for answering is significantly reduced compared to the 8 h required for the exam regulations in all of the five models. LLM has the feasibility of application in oral auxiliary, treatment and health consultation, and it can help both doctors and patients obtain medical information quickly. Howere, their outputs carry a risk of errors (since the 3C scoring results do not reach the full marks), so prudent judgment should be exercised when using them.</p>","PeriodicalId":23965,"journal":{"name":"中华口腔医学杂志","volume":"60 8","pages":"871-878"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"中华口腔医学杂志","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3760/cma.j.cn112144-20241107-00418","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To evaluate the accuracy of the oral healthcare information provided by different large language models (LLM) to explore their feasibility and limitations in the application of oral auxiliary, treatment and health consultation. Methods: This study designed eight items comprising 47 questions in total related to the diagnosis and treatment of oral diseases [to assess the performance of LLM as an artificial intelligence (AI) medical assistant], and five items comprising 35 questions in total about oral health consultations (to assess the performance of LLM as a simulated doctor). These questions were answered individually by the five LLM models (Erine Bot, HuatuoGPT, Tongyi Qianwen, iFlytek Spark, ChatGPT). Two attending physicians with more than 5 years of experience independently rated the responses using the 3C criteria (correct, clear, concise), and the consistency between the raters was assessed using the Spearman rank correlation coefficient, and the Kruskal-Wallis test and Dunn post hoc test were used to assess the statistical differences between the models. Additionally, this study used 600 questions from the 2023 dental licensing examination to evaluate the time taken to answer, scores, and accuracy of each model. Results: As an AI medical assistant, LLM can assist doctors in diagnosis and treatment decision-making, with an inter-evaluator Spearman coefficient of 0.505 (P<0.01). As a simulated doctor, LLM can carry out patient popularization, with an inter-evaluator Spearman coefficient of 0.533 (P<0.01). The 3C scoring results were represented by the median (lower quartile, upper quartile), and the 3C scores of each model as an AI medical assistant and a simulated doctor were respectively: 2.00 (1.00, 3.00) and 2.00 (1.00, 3.00) points of Erine Bot, 1.00 (1.00, 2.00) and 2.00 (1.00, 2.00) points of HuatuoGPT, 2.00 (1.00, 2.00) and 2.00 (1.00, 3.00) points of Tongyi Qianwen, 2.00 (1.00, 2.00) and 2.00 (1.75, 2.25) points of iFlytek Spark, 3.00 (2.00, 3.00) and 3.00 (2.00, 3.00) points of ChatGPT (full score of 4 points). The Kruskal-Wallis test results showed that, as an AI medical assistant or a simulated doctor, there were statistically differences in the 3C scores among the five large language models (all P<0.001). The average score of the 5 LLMs on the dental licensing examination was 370.2, with an accuracy rate of 61.7% (370.2/600) and a time consumption of 94.6 minutes. Specifically, Erine Bot took 115 minutes, scored 363 points with an accuracy rate of 60.5% (363/600), HuatuoGPT took 224 minutes and scored 305 points with an accuracy rate of 50.8% (305/600), Tongyi Qianwen took 43 minutes, scored 438 points with an accuracy rate of 73.0% (480/600), iFlytek Spark took 32 minutes, scored 364 points with an accuracy rate of 60.7% (364/600), and ChatGPT took 59 minutes, scored 381 points with an accuracy rate of 63.5% (381/600). Conclusions: Based on the evaluation of LLM's dual roles as an AI medical assistant and a simulated doctor, ChatGPT performes the best, with basically correct, clear and concise answers, followed by Erine Bot, Tongyi Qianwen and iFlytek Spark, with HuatuoGPT lagging behind significantly. In the dental licensing examination, all the 4 LLM, except for HuatuoGPT, reach the passing level, and the time consumpution for answering is significantly reduced compared to the 8 h required for the exam regulations in all of the five models. LLM has the feasibility of application in oral auxiliary, treatment and health consultation, and it can help both doctors and patients obtain medical information quickly. Howere, their outputs carry a risk of errors (since the 3C scoring results do not reach the full marks), so prudent judgment should be exercised when using them.

【五大语言模型在口腔辅助诊断、治疗和健康咨询领域的应用初探】。
目的:评价不同大语言模型(LLM)提供的口腔保健信息的准确性,探讨其在口腔辅助、治疗和健康咨询应用中的可行性和局限性。方法:本研究设计了与口腔疾病诊断和治疗相关的8个项目共47个问题[用于评估LLM作为人工智能(AI)医疗助理的表现],以及与口腔健康咨询相关的5个项目共35个问题[用于评估LLM作为模拟医生的表现]。这些问题分别由五位LLM模型(Erine Bot, HuatuoGPT, Tongyi Qianwen, iFlytek Spark, ChatGPT)回答。两名具有5年以上经验的主治医师采用3C标准(正确、清晰、简洁)对回答进行独立评分,采用Spearman等级相关系数评估评分者之间的一致性,采用Kruskal-Wallis检验和Dunn事后检验评估模型之间的统计学差异。此外,本研究使用了2023年牙科执照考试中的600个问题来评估每个模型的回答时间、分数和准确性。结果:LLM作为人工智能医疗助手,能够辅助医生进行诊断和治疗决策,其间评价者Spearman系数为0.505 (ppp)。结论:基于对LLM作为人工智能医疗助手和模拟医生双重角色的评价,ChatGPT表现最好,回答基本正确、清晰、简洁,其次是Erine Bot、同义千文和科大讯飞Spark,华图ogpt明显落后。在牙科执业资格考试中,除华图科外,4门LLM均达到合格水平,答题时间较5门机型考试规定的8小时明显缩短。LLM在口腔辅助、治疗、健康咨询等方面具有应用的可行性,可以帮助医患双方快速获取医疗信息。但是,它们的输出有出错的风险(因为3C评分结果没有达到满分),所以在使用它们时要谨慎判断。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
中华口腔医学杂志
中华口腔医学杂志 Medicine-Medicine (all)
CiteScore
0.90
自引率
0.00%
发文量
9692
期刊介绍: Founded in August 1953, Chinese Journal of Stomatology is a monthly academic journal of stomatology published publicly at home and abroad, sponsored by the Chinese Medical Association and co-sponsored by the Chinese Stomatology Association. It mainly reports the leading scientific research results and clinical diagnosis and treatment experience in the field of oral medicine, as well as the basic theoretical research that has a guiding role in oral clinical practice and is closely combined with oral clinical practice. Chinese Journal of Over the years, Stomatology has been published in Medline, Scopus database, Toxicology Abstracts Database, Chemical Abstracts Database, American Cancer database, Russian Abstracts database, China Core Journal of Science and Technology, Peking University Core Journal, CSCD and other more than 20 important journals at home and abroad Physical medicine database and retrieval system included.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信