Luís Eduardo Charles Pagotto, Dennys Ramon de Melo Fernandes Almeida, Thiago de Santana Santos, Everton Freitas de Morais
{"title":"聊天机器人能为颞下颌疾病患者提供准确可读的信息吗?","authors":"Luís Eduardo Charles Pagotto, Dennys Ramon de Melo Fernandes Almeida, Thiago de Santana Santos, Everton Freitas de Morais","doi":"10.1016/j.joms.2025.08.012","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Temporomandibular disorders (TMDs) are common musculoskeletal and neuromuscular conditions that impair jaw function and quality of life. Patients often lack access to reliable health information. Large language models (LLMs) have introduced chatbots as potential educational tools, yet concerns remain regarding accuracy, readability, empathy, and citation integrity.</p><p><strong>Purpose: </strong>This study evaluated whether LLM-based chatbots can provide clinically accurate, empathic, and readable responses to patient-friendly questions about TMDs and whether their cited references are authentic.</p><p><strong>Study design, setting, sample: </strong>This cross-sectional in silico study was conducted in March 2025. Twenty-three standardized TMD-related questions were used as prompts for each chatbot.</p><p><strong>Predictor/exposure/independent variable: </strong>The predictor variable was the chatbot platform, reflecting distinct LLM architectures: GPT-4 (transformer-based autoregressive model, OpenAI), Gemini Pro (multimodal transformer, Google), and DeepSeek-V3 (mixture-of-experts transformer, DeepSeek).</p><p><strong>Main outcome variables: </strong>Accuracy was defined as the proportion of responses judged clinically correct by two board-certified oral medicine specialists. Empathy was assessed by expert scoring of tone. Readability was determined with Flesch-Kincaid Reading Ease and Grade Level. Citation reliability was assessed by verifying whether references were authentic and retrievable in PubMed or other authoritative databases.</p><p><strong>Covariates: </strong>No formal covariates were included; exploratory correlations between variables were performed.</p><p><strong>Analyses: </strong>Descriptive statistics, 1-way analysis of variance with Tukey's post hoc tests, Pearson correlation, and χ<sup>2</sup> tests were performed. Statistical significance was set at P < .05.</p><p><strong>Results: </strong>No statistically significant differences were observed in accuracy (P = .2) or empathy (P = .2). The mixture-of-experts transformer provided the most readable content (Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001). The transformer-based autoregressive model produced the highest proportion of hallucinated references (47.2%), compared with the multimodal transformer (18.8%) and the mixture-of-experts transformer (10.1%) (P < .001). A weak positive correlation was found between accuracy and readability (r = 0.27; P = .03), with no correlation between accuracy and empathy.</p><p><strong>Conclusions and relevance: </strong>While all LLM-based chatbots delivered generally accurate and empathetic responses, the mixture-of-experts transformer outperformed others in readability and citation reliability. The high rate of hallucinated references in the transformer-based autoregressive model underscores the need for human oversight in clinical applications.</p>","PeriodicalId":16612,"journal":{"name":"Journal of Oral and Maxillofacial Surgery","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can Chatbots Provide Accurate and Readable Information for Patients With Temporomandibular Disorders?\",\"authors\":\"Luís Eduardo Charles Pagotto, Dennys Ramon de Melo Fernandes Almeida, Thiago de Santana Santos, Everton Freitas de Morais\",\"doi\":\"10.1016/j.joms.2025.08.012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Temporomandibular disorders (TMDs) are common musculoskeletal and neuromuscular conditions that impair jaw function and quality of life. Patients often lack access to reliable health information. Large language models (LLMs) have introduced chatbots as potential educational tools, yet concerns remain regarding accuracy, readability, empathy, and citation integrity.</p><p><strong>Purpose: </strong>This study evaluated whether LLM-based chatbots can provide clinically accurate, empathic, and readable responses to patient-friendly questions about TMDs and whether their cited references are authentic.</p><p><strong>Study design, setting, sample: </strong>This cross-sectional in silico study was conducted in March 2025. Twenty-three standardized TMD-related questions were used as prompts for each chatbot.</p><p><strong>Predictor/exposure/independent variable: </strong>The predictor variable was the chatbot platform, reflecting distinct LLM architectures: GPT-4 (transformer-based autoregressive model, OpenAI), Gemini Pro (multimodal transformer, Google), and DeepSeek-V3 (mixture-of-experts transformer, DeepSeek).</p><p><strong>Main outcome variables: </strong>Accuracy was defined as the proportion of responses judged clinically correct by two board-certified oral medicine specialists. Empathy was assessed by expert scoring of tone. Readability was determined with Flesch-Kincaid Reading Ease and Grade Level. Citation reliability was assessed by verifying whether references were authentic and retrievable in PubMed or other authoritative databases.</p><p><strong>Covariates: </strong>No formal covariates were included; exploratory correlations between variables were performed.</p><p><strong>Analyses: </strong>Descriptive statistics, 1-way analysis of variance with Tukey's post hoc tests, Pearson correlation, and χ<sup>2</sup> tests were performed. Statistical significance was set at P < .05.</p><p><strong>Results: </strong>No statistically significant differences were observed in accuracy (P = .2) or empathy (P = .2). The mixture-of-experts transformer provided the most readable content (Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001). The transformer-based autoregressive model produced the highest proportion of hallucinated references (47.2%), compared with the multimodal transformer (18.8%) and the mixture-of-experts transformer (10.1%) (P < .001). A weak positive correlation was found between accuracy and readability (r = 0.27; P = .03), with no correlation between accuracy and empathy.</p><p><strong>Conclusions and relevance: </strong>While all LLM-based chatbots delivered generally accurate and empathetic responses, the mixture-of-experts transformer outperformed others in readability and citation reliability. The high rate of hallucinated references in the transformer-based autoregressive model underscores the need for human oversight in clinical applications.</p>\",\"PeriodicalId\":16612,\"journal\":{\"name\":\"Journal of Oral and Maxillofacial Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Oral and Maxillofacial Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.joms.2025.08.012\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.joms.2025.08.012","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
Can Chatbots Provide Accurate and Readable Information for Patients With Temporomandibular Disorders?
Background: Temporomandibular disorders (TMDs) are common musculoskeletal and neuromuscular conditions that impair jaw function and quality of life. Patients often lack access to reliable health information. Large language models (LLMs) have introduced chatbots as potential educational tools, yet concerns remain regarding accuracy, readability, empathy, and citation integrity.
Purpose: This study evaluated whether LLM-based chatbots can provide clinically accurate, empathic, and readable responses to patient-friendly questions about TMDs and whether their cited references are authentic.
Study design, setting, sample: This cross-sectional in silico study was conducted in March 2025. Twenty-three standardized TMD-related questions were used as prompts for each chatbot.
Predictor/exposure/independent variable: The predictor variable was the chatbot platform, reflecting distinct LLM architectures: GPT-4 (transformer-based autoregressive model, OpenAI), Gemini Pro (multimodal transformer, Google), and DeepSeek-V3 (mixture-of-experts transformer, DeepSeek).
Main outcome variables: Accuracy was defined as the proportion of responses judged clinically correct by two board-certified oral medicine specialists. Empathy was assessed by expert scoring of tone. Readability was determined with Flesch-Kincaid Reading Ease and Grade Level. Citation reliability was assessed by verifying whether references were authentic and retrievable in PubMed or other authoritative databases.
Covariates: No formal covariates were included; exploratory correlations between variables were performed.
Analyses: Descriptive statistics, 1-way analysis of variance with Tukey's post hoc tests, Pearson correlation, and χ2 tests were performed. Statistical significance was set at P < .05.
Results: No statistically significant differences were observed in accuracy (P = .2) or empathy (P = .2). The mixture-of-experts transformer provided the most readable content (Flesch-Kincaid Reading Ease = 28.47; Flesch-Kincaid Grade Level = 12.19; P < .001). The transformer-based autoregressive model produced the highest proportion of hallucinated references (47.2%), compared with the multimodal transformer (18.8%) and the mixture-of-experts transformer (10.1%) (P < .001). A weak positive correlation was found between accuracy and readability (r = 0.27; P = .03), with no correlation between accuracy and empathy.
Conclusions and relevance: While all LLM-based chatbots delivered generally accurate and empathetic responses, the mixture-of-experts transformer outperformed others in readability and citation reliability. The high rate of hallucinated references in the transformer-based autoregressive model underscores the need for human oversight in clinical applications.
期刊介绍:
This monthly journal offers comprehensive coverage of new techniques, important developments and innovative ideas in oral and maxillofacial surgery. Practice-applicable articles help develop the methods used to handle dentoalveolar surgery, facial injuries and deformities, TMJ disorders, oral cancer, jaw reconstruction, anesthesia and analgesia. The journal also includes specifics on new instruments and diagnostic equipment and modern therapeutic drugs and devices. Journal of Oral and Maxillofacial Surgery is recommended for first or priority subscription by the Dental Section of the Medical Library Association.