chatgpt - 40和Grok-3在唇腭裂和婴幼儿整形外科手术中的比较评价：一项由正畸医生、儿科医生和整形外科医生进行的多学科评估。

IF 1.3 4区医学 Q2 Dentistry

Cleft Palate-Craniofacial Journal Pub Date : 2025-09-16 DOI:10.1177/10556656251378591

Esra Ekizer, Kevser Kurt Demirsoy, Süleyman Kutalmış Büyük, Semih Canpolat, Ahmet Bilirer

{"title":"chatgpt - 40和Grok-3在唇腭裂和婴幼儿整形外科手术中的比较评价：一项由正畸医生、儿科医生和整形外科医生进行的多学科评估。","authors":"Esra Ekizer, Kevser Kurt Demirsoy, Süleyman Kutalmış Büyük, Semih Canpolat, Ahmet Bilirer","doi":"10.1177/10556656251378591","DOIUrl":null,"url":null,"abstract":"Objective: This study aimed to evaluate and compare the accuracy, clarity, and clinical applicability of 2 state-of-the-art large language models (LLMs), Chat Generative Pretrained Transformer (ChatGPT)-4o and Grok-3, in generating health information related to cleft lip and palate (CLP) and presurgical infant orthopedics (PSIO). To ensure a multidisciplinary perspective, experts from orthodontics, pediatrics, and plastic surgery independently evaluated the responses. Methods: Six structured questions addressing general and presurgical aspects of CLP were submitted to both ChatGPT-4o and Grok-3. Forty-five blinded specialists (15 from each specialty) assessed the 12 generated responses using 2 validated instruments: the DISCERN tool and the Global Quality Scale (GQS). We conducted interspecialty comparisons to explore variations in model evaluation. Results: We observed no statistically significant differences between ChatGPT-4o and Grok-3 in DISCERN or GQS scores (P > .05). However, pediatricians consistently assigned higher ratings than orthodontists and plastic surgeons in terms of reliability, clarity, and treatment-related content. Patient-directed questions received higher overall scores than those aimed at healthcare professionals. Grok-3 performed slightly better on questions about PSIO, whereas ChatGPT-4o provided more comprehensive and structured answers. Conclusion: Both LLMs demonstrated notable potential in producing readable, informative responses about CLP and PSIO. While they may aid in patient communication and support clinical education, professional oversight remains critical to ensure medical accuracy. The inclusion of Grok-3 in this orthodontic evaluation provides valuable insights and sets the stage for future research on artificial intelligence integration in interdisciplinary cleft care.","PeriodicalId":49220,"journal":{"name":"Cleft Palate-Craniofacial Journal","volume":" ","pages":"10556656251378591"},"PeriodicalIF":1.3000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative Evaluation of ChatGPT-4o and Grok-3 on Cleft Lip and Palate and Presurgical Infant Orthopedics: A Multidisciplinary Assessment by Orthodontists, Pediatricians, and Plastic Surgeons.\",\"authors\":\"Esra Ekizer, Kevser Kurt Demirsoy, Süleyman Kutalmış Büyük, Semih Canpolat, Ahmet Bilirer\",\"doi\":\"10.1177/10556656251378591\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: This study aimed to evaluate and compare the accuracy, clarity, and clinical applicability of 2 state-of-the-art large language models (LLMs), Chat Generative Pretrained Transformer (ChatGPT)-4o and Grok-3, in generating health information related to cleft lip and palate (CLP) and presurgical infant orthopedics (PSIO). To ensure a multidisciplinary perspective, experts from orthodontics, pediatrics, and plastic surgery independently evaluated the responses. Methods: Six structured questions addressing general and presurgical aspects of CLP were submitted to both ChatGPT-4o and Grok-3. Forty-five blinded specialists (15 from each specialty) assessed the 12 generated responses using 2 validated instruments: the DISCERN tool and the Global Quality Scale (GQS). We conducted interspecialty comparisons to explore variations in model evaluation. Results: We observed no statistically significant differences between ChatGPT-4o and Grok-3 in DISCERN or GQS scores (P > .05). However, pediatricians consistently assigned higher ratings than orthodontists and plastic surgeons in terms of reliability, clarity, and treatment-related content. Patient-directed questions received higher overall scores than those aimed at healthcare professionals. Grok-3 performed slightly better on questions about PSIO, whereas ChatGPT-4o provided more comprehensive and structured answers. Conclusion: Both LLMs demonstrated notable potential in producing readable, informative responses about CLP and PSIO. While they may aid in patient communication and support clinical education, professional oversight remains critical to ensure medical accuracy. The inclusion of Grok-3 in this orthodontic evaluation provides valuable insights and sets the stage for future research on artificial intelligence integration in interdisciplinary cleft care.\",\"PeriodicalId\":49220,\"journal\":{\"name\":\"Cleft Palate-Craniofacial Journal\",\"volume\":\" \",\"pages\":\"10556656251378591\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cleft Palate-Craniofacial Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/10556656251378591\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Dentistry\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cleft Palate-Craniofacial Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10556656251378591","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Dentistry","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在评估和比较两种最先进的大型语言模型（LLMs）聊天生成预训练转换器(ChatGPT)- 40和Grok-3在生成唇腭裂（CLP）和手术前婴儿骨科（PSIO）相关健康信息中的准确性、清晰度和临床适用性。为了确保多学科的观点，来自正畸，儿科和整形外科的专家独立评估了反应。方法：向chatgpt - 40和Grok-3提交了六个结构化问题，涉及CLP的一般和术前方面。45名盲法专家（每个专业15名）使用两种经过验证的工具（DISCERN工具和全球质量量表（GQS））评估了12个生成的回答。我们进行了跨专业比较，以探索模型评估的差异。结果：chatgpt - 40与Grok-3在DISCERN或GQS评分上无统计学差异（P < 0.05）。然而，在可靠性、清晰度和治疗相关内容方面，儿科医生的评分始终高于正畸医生和整形外科医生。以患者为导向的问题比针对医疗保健专业人员的问题得分更高。Grok-3在PSIO问题上的表现略好，而chatgpt - 40提供了更全面、更有条理的答案。结论：两种LLMs在产生关于CLP和PSIO的可读、信息丰富的反应方面表现出显著的潜力。虽然他们可以帮助患者沟通和支持临床教育，但专业监督仍然是确保医疗准确性的关键。将Grok-3纳入正畸评估提供了有价值的见解，并为未来跨学科唇腭裂护理中人工智能整合的研究奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparative Evaluation of ChatGPT-4o and Grok-3 on Cleft Lip and Palate and Presurgical Infant Orthopedics: A Multidisciplinary Assessment by Orthodontists, Pediatricians, and Plastic Surgeons.

Objective: This study aimed to evaluate and compare the accuracy, clarity, and clinical applicability of 2 state-of-the-art large language models (LLMs), Chat Generative Pretrained Transformer (ChatGPT)-4o and Grok-3, in generating health information related to cleft lip and palate (CLP) and presurgical infant orthopedics (PSIO). To ensure a multidisciplinary perspective, experts from orthodontics, pediatrics, and plastic surgery independently evaluated the responses. Methods: Six structured questions addressing general and presurgical aspects of CLP were submitted to both ChatGPT-4o and Grok-3. Forty-five blinded specialists (15 from each specialty) assessed the 12 generated responses using 2 validated instruments: the DISCERN tool and the Global Quality Scale (GQS). We conducted interspecialty comparisons to explore variations in model evaluation. Results: We observed no statistically significant differences between ChatGPT-4o and Grok-3 in DISCERN or GQS scores (P > .05). However, pediatricians consistently assigned higher ratings than orthodontists and plastic surgeons in terms of reliability, clarity, and treatment-related content. Patient-directed questions received higher overall scores than those aimed at healthcare professionals. Grok-3 performed slightly better on questions about PSIO, whereas ChatGPT-4o provided more comprehensive and structured answers. Conclusion: Both LLMs demonstrated notable potential in producing readable, informative responses about CLP and PSIO. While they may aid in patient communication and support clinical education, professional oversight remains critical to ensure medical accuracy. The inclusion of Grok-3 in this orthodontic evaluation provides valuable insights and sets the stage for future research on artificial intelligence integration in interdisciplinary cleft care.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cleft Palate-Craniofacial Journal DENTISTRY, ORAL SURGERY & MEDICINE-SURGERY

CiteScore

2.20

自引率

36.40%

发文量

审稿时长

4-8 weeks

期刊介绍： The Cleft Palate-Craniofacial Journal (CPCJ) is the premiere peer-reviewed, interdisciplinary, international journal dedicated to current research on etiology, prevention, diagnosis, and treatment in all areas pertaining to craniofacial anomalies. CPCJ reports on basic science and clinical research aimed at better elucidating the pathogenesis, pathology, and optimal methods of treatment of cleft and craniofacial anomalies. The journal strives to foster communication and cooperation among professionals from all specialties.