聊天机器人能为颞下颌疾病患者提供准确可读的信息吗？

IF 2.6 3区医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE

Journal of Oral and Maxillofacial Surgery Pub Date : 2025-09-03 DOI:10.1016/j.joms.2025.08.012

Luís Eduardo Charles Pagotto, Dennys Ramon de Melo Fernandes Almeida, Thiago de Santana Santos, Everton Freitas de Morais

{"title":"聊天机器人能为颞下颌疾病患者提供准确可读的信息吗？","authors":"Luís Eduardo Charles Pagotto, Dennys Ramon de Melo Fernandes Almeida, Thiago de Santana Santos, Everton Freitas de Morais","doi":"10.1016/j.joms.2025.08.012","DOIUrl":null,"url":null,"abstract":"Background: Temporomandibular disorders (TMDs) are common musculoskeletal and neuromuscular conditions that impair jaw function and quality of life. Patients often lack access to reliable health information. Large language models (LLMs) have introduced chatbots as potential educational tools, yet concerns remain regarding accuracy, readability, empathy, and citation integrity.Purpose: This study evaluated whether LLM-based chatbots can provide clinically accurate, empathic, and readable responses to patient-friendly questions about TMDs and whether their cited references are authentic.Study design, setting, sample: This cross-sectional in silico study was conducted in March 2025. Twenty-three standardized TMD-related questions were used as prompts for each chatbot.Predictor/exposure/independent variable: The predictor variable was the chatbot platform, reflecting distinct LLM architectures: GPT-4 (transformer-based autoregressive model, OpenAI), Gemini Pro (multimodal transformer, Google), and DeepSeek-V3 (mixture-of-experts transformer, DeepSeek).Main outcome variables: Accuracy was defined as the proportion of responses judged clinically correct by two board-certified oral medicine specialists. Empathy was assessed by expert scoring of tone. Readability was determined with Flesch-Kincaid Reading Ease and Grade Level. Citation reliability was assessed by verifying whether references were authentic and retrievable in PubMed or other authoritative databases.Covariates: No formal covariates were included; exploratory correlations between variables were performed.Analyses: Descriptive statistics, 1-way analysis of variance with Tukey's post hoc tests, Pearson correlation, and χ2 tests were performed. Statistical significance was set at P < .05.Results: No statistically significant differences were observed in accuracy (P = .2) or empathy (P = .2). The mixture-of-experts transformer provided the most readable content (Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001). The transformer-based autoregressive model produced the highest proportion of hallucinated references (47.2%), compared with the multimodal transformer (18.8%) and the mixture-of-experts transformer (10.1%) (P < .001). A weak positive correlation was found between accuracy and readability (r = 0.27; P = .03), with no correlation between accuracy and empathy.Conclusions and relevance: While all LLM-based chatbots delivered generally accurate and empathetic responses, the mixture-of-experts transformer outperformed others in readability and citation reliability. The high rate of hallucinated references in the transformer-based autoregressive model underscores the need for human oversight in clinical applications.","PeriodicalId":16612,"journal":{"name":"Journal of Oral and Maxillofacial Surgery","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can Chatbots Provide Accurate and Readable Information for Patients With Temporomandibular Disorders?\",\"authors\":\"Luís Eduardo Charles Pagotto, Dennys Ramon de Melo Fernandes Almeida, Thiago de Santana Santos, Everton Freitas de Morais\",\"doi\":\"10.1016/j.joms.2025.08.012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Temporomandibular disorders (TMDs) are common musculoskeletal and neuromuscular conditions that impair jaw function and quality of life. Patients often lack access to reliable health information. Large language models (LLMs) have introduced chatbots as potential educational tools, yet concerns remain regarding accuracy, readability, empathy, and citation integrity.Purpose: This study evaluated whether LLM-based chatbots can provide clinically accurate, empathic, and readable responses to patient-friendly questions about TMDs and whether their cited references are authentic.Study design, setting, sample: This cross-sectional in silico study was conducted in March 2025. Twenty-three standardized TMD-related questions were used as prompts for each chatbot.Predictor/exposure/independent variable: The predictor variable was the chatbot platform, reflecting distinct LLM architectures: GPT-4 (transformer-based autoregressive model, OpenAI), Gemini Pro (multimodal transformer, Google), and DeepSeek-V3 (mixture-of-experts transformer, DeepSeek).Main outcome variables: Accuracy was defined as the proportion of responses judged clinically correct by two board-certified oral medicine specialists. Empathy was assessed by expert scoring of tone. Readability was determined with Flesch-Kincaid Reading Ease and Grade Level. Citation reliability was assessed by verifying whether references were authentic and retrievable in PubMed or other authoritative databases.Covariates: No formal covariates were included; exploratory correlations between variables were performed.Analyses: Descriptive statistics, 1-way analysis of variance with Tukey's post hoc tests, Pearson correlation, and χ2 tests were performed. Statistical significance was set at P < .05.Results: No statistically significant differences were observed in accuracy (P = .2) or empathy (P = .2). The mixture-of-experts transformer provided the most readable content (Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001). The transformer-based autoregressive model produced the highest proportion of hallucinated references (47.2%), compared with the multimodal transformer (18.8%) and the mixture-of-experts transformer (10.1%) (P < .001). A weak positive correlation was found between accuracy and readability (r = 0.27; P = .03), with no correlation between accuracy and empathy.Conclusions and relevance: While all LLM-based chatbots delivered generally accurate and empathetic responses, the mixture-of-experts transformer outperformed others in readability and citation reliability. The high rate of hallucinated references in the transformer-based autoregressive model underscores the need for human oversight in clinical applications.\",\"PeriodicalId\":16612,\"journal\":{\"name\":\"Journal of Oral and Maxillofacial Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Oral and Maxillofacial Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.joms.2025.08.012\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.joms.2025.08.012","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

背景：颞下颌紊乱（TMDs）是一种常见的肌肉骨骼和神经肌肉疾病，会损害颌骨功能和生活质量。患者往往无法获得可靠的健康信息。大型语言模型（llm）已经将聊天机器人作为潜在的教育工具引入，但在准确性、可读性、同理心和引用完整性方面仍然存在担忧。目的：本研究评估了基于法学硕士的聊天机器人是否能够对患者友好的tmd问题提供临床准确、共情和可读的回答，以及它们引用的参考文献是否真实。研究设计、设置、样本：这项横断面的计算机研究于2025年3月进行。23个标准化的tmd相关问题被用作每个聊天机器人的提示。预测变量/暴露变量/自变量：预测变量是聊天机器人平台，反映了不同的LLM架构：GPT-4（基于变压器的自回归模型，OpenAI）， Gemini Pro（多模态变压器，谷歌）和DeepSeek- v3（混合专家变压器，DeepSeek）。主要结局变量：准确性定义为由两名委员会认证的口腔医学专家判定临床正确的应答比例。同理心通过语调专家评分进行评估。可读性用Flesch-Kincaid Reading Ease和Grade Level测定。通过验证参考文献是否真实并可在PubMed或其他权威数据库中检索来评估引文可靠性。协变量：未纳入正式协变量；进行变量间的探索性相关性分析。分析：采用描述性统计、单因素方差分析及事后检验、Pearson相关和χ2检验。差异有统计学意义，P < 0.05。结果：准确性（P = 0.2）和共情性（P = 0.2）差异无统计学意义。混合专家变压器提供了最易读的内容（Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001）。与多模态变压器（18.8%）和混合专家变压器（10.1%）相比，基于变压器的自回归模型产生的幻觉参考比例最高（47.2%）（P < 0.001）。准确性与可读性呈弱正相关（r = 0.27; P = 0.03），准确性与共情无相关。结论和相关性：虽然所有基于法学硕士的聊天机器人都提供了普遍准确和同理心的回应，但专家混合转换器在可读性和引用可靠性方面优于其他机器人。在基于变压器的自回归模型中，高幻觉参考率强调了在临床应用中人类监督的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can Chatbots Provide Accurate and Readable Information for Patients With Temporomandibular Disorders?

Background: Temporomandibular disorders (TMDs) are common musculoskeletal and neuromuscular conditions that impair jaw function and quality of life. Patients often lack access to reliable health information. Large language models (LLMs) have introduced chatbots as potential educational tools, yet concerns remain regarding accuracy, readability, empathy, and citation integrity.

Purpose: This study evaluated whether LLM-based chatbots can provide clinically accurate, empathic, and readable responses to patient-friendly questions about TMDs and whether their cited references are authentic.

Study design, setting, sample: This cross-sectional in silico study was conducted in March 2025. Twenty-three standardized TMD-related questions were used as prompts for each chatbot.

Predictor/exposure/independent variable: The predictor variable was the chatbot platform, reflecting distinct LLM architectures: GPT-4 (transformer-based autoregressive model, OpenAI), Gemini Pro (multimodal transformer, Google), and DeepSeek-V3 (mixture-of-experts transformer, DeepSeek).

Main outcome variables: Accuracy was defined as the proportion of responses judged clinically correct by two board-certified oral medicine specialists. Empathy was assessed by expert scoring of tone. Readability was determined with Flesch-Kincaid Reading Ease and Grade Level. Citation reliability was assessed by verifying whether references were authentic and retrievable in PubMed or other authoritative databases.

Covariates: No formal covariates were included; exploratory correlations between variables were performed.

Analyses: Descriptive statistics, 1-way analysis of variance with Tukey's post hoc tests, Pearson correlation, and χ² tests were performed. Statistical significance was set at P < .05.

Results: No statistically significant differences were observed in accuracy (P = .2) or empathy (P = .2). The mixture-of-experts transformer provided the most readable content (Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001). The transformer-based autoregressive model produced the highest proportion of hallucinated references (47.2%), compared with the multimodal transformer (18.8%) and the mixture-of-experts transformer (10.1%) (P < .001). A weak positive correlation was found between accuracy and readability (r = 0.27; P = .03), with no correlation between accuracy and empathy.

Conclusions and relevance: While all LLM-based chatbots delivered generally accurate and empathetic responses, the mixture-of-experts transformer outperformed others in readability and citation reliability. The high rate of hallucinated references in the transformer-based autoregressive model underscores the need for human oversight in clinical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Oral and Maxillofacial Surgery 医学-牙科与口腔外科

CiteScore

4.00

自引率

5.30%

发文量

审稿时长

41 days

期刊介绍： This monthly journal offers comprehensive coverage of new techniques, important developments and innovative ideas in oral and maxillofacial surgery. Practice-applicable articles help develop the methods used to handle dentoalveolar surgery, facial injuries and deformities, TMJ disorders, oral cancer, jaw reconstruction, anesthesia and analgesia. The journal also includes specifics on new instruments and diagnostic equipment and modern therapeutic drugs and devices. Journal of Oral and Maxillofacial Surgery is recommended for first or priority subscription by the Dental Section of the Medical Library Association.