评估大语言模型在美容外科咨询中的时间依赖一致性及其在不同临床领域的表现比较。

IF 2.8 3区医学 Q2 SURGERY

Aesthetic Plastic Surgery Pub Date : 2025-10-22 DOI:10.1007/s00266-025-05308-7

Munur Selcuk Kendir, Muaz Zuhurlu

{"title":"评估大语言模型在美容外科咨询中的时间依赖一致性及其在不同临床领域的表现比较。","authors":"Munur Selcuk Kendir, Muaz Zuhurlu","doi":"10.1007/s00266-025-05308-7","DOIUrl":null,"url":null,"abstract":"Background: The integration of large language models (LLM) into plastic and aesthetic surgery has shown promise. However, research comparing different LLMs in handling clinical scenarios and their temporal consistency remains limited. This study evaluated the performances of ChatGPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet in aesthetic surgery scenarios. The objectives were to compare their overall performance, analyze reliability in complicated and uncomplicated cases, assess temporal consistency, evaluate performance across five clinical domains: preoperative cautions, postoperative care, holistic approach, algorithmic approach, and surgical planning.Methods: Twenty-four case scenarios (12 complicated, 12 uncomplicated) were input into the LLMs at three time points (T1, T2, T3) over two weeks. Three blinded board-certified plastic surgeons evaluated responses using a 5-point Likert scale. Statistical analyses were applied.Results: Chat GPT-4 achieved the highest mean score (4.92), outperforming Gemini 1.5 Pro (3.62) and Claude 3.5 Sonnet (3.21) (p < 0.001). It performed consistently across complicated (4.87) and uncomplicated cases (4.96) (p > 0.05) and demonstrated temporal stability (p > 0.05). Gemini 1.5 Pro showed temporal consistency for complicated cases (p > 0.05), but not in uncomplicated cases. Claude 3.5 Sonnet exhibited significant temporal inconsistencies (p < 0.05). In the domain specific analyzes, GPT-4 was superior to others. Claude 3.5 Sonnet had the lowest scores in most domains, except algorithmic approach, where it outperformed Gemini (4.4 vs. 4.1, p < 0.05).Conclusions: LLMs could be a promising tool for supporting surgical decision-making. Future research should aim to enhance LLM reliability and validate its real-world applications.Level of evidence i: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .","PeriodicalId":7609,"journal":{"name":"Aesthetic Plastic Surgery","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation Large Language Models' Time Dependent Consistency in Aesthetic Surgery Consultations and Comparison of Their Performance Across Different Clinical Domains.\",\"authors\":\"Munur Selcuk Kendir, Muaz Zuhurlu\",\"doi\":\"10.1007/s00266-025-05308-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The integration of large language models (LLM) into plastic and aesthetic surgery has shown promise. However, research comparing different LLMs in handling clinical scenarios and their temporal consistency remains limited. This study evaluated the performances of ChatGPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet in aesthetic surgery scenarios. The objectives were to compare their overall performance, analyze reliability in complicated and uncomplicated cases, assess temporal consistency, evaluate performance across five clinical domains: preoperative cautions, postoperative care, holistic approach, algorithmic approach, and surgical planning.Methods: Twenty-four case scenarios (12 complicated, 12 uncomplicated) were input into the LLMs at three time points (T1, T2, T3) over two weeks. Three blinded board-certified plastic surgeons evaluated responses using a 5-point Likert scale. Statistical analyses were applied.Results: Chat GPT-4 achieved the highest mean score (4.92), outperforming Gemini 1.5 Pro (3.62) and Claude 3.5 Sonnet (3.21) (p < 0.001). It performed consistently across complicated (4.87) and uncomplicated cases (4.96) (p > 0.05) and demonstrated temporal stability (p > 0.05). Gemini 1.5 Pro showed temporal consistency for complicated cases (p > 0.05), but not in uncomplicated cases. Claude 3.5 Sonnet exhibited significant temporal inconsistencies (p < 0.05). In the domain specific analyzes, GPT-4 was superior to others. Claude 3.5 Sonnet had the lowest scores in most domains, except algorithmic approach, where it outperformed Gemini (4.4 vs. 4.1, p < 0.05).Conclusions: LLMs could be a promising tool for supporting surgical decision-making. Future research should aim to enhance LLM reliability and validate its real-world applications.Level of evidence i: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .\",\"PeriodicalId\":7609,\"journal\":{\"name\":\"Aesthetic Plastic Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Aesthetic Plastic Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00266-025-05308-7\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"SURGERY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aesthetic Plastic Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00266-025-05308-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

摘要

背景：将大语言模型（LLM）集成到整形和美容手术中已经显示出前景。然而，比较不同llm在处理临床场景及其时间一致性方面的研究仍然有限。本研究评估chatgpt - 40、Gemini 1.5 Pro和Claude 3.5 Sonnet在美容手术场景中的表现。目的是比较他们的整体表现，分析复杂和非复杂病例的可靠性，评估时间一致性，评估五个临床领域的表现：术前注意事项，术后护理，整体方法，算法方法和手术计划。方法：将24例病例（复杂12例，非复杂12例）分别于2周内的T1、T2、T3三个时间点输入LLMs。三名委员会认证的盲法整形外科医生用5分李克特量表评估了反应。应用统计学分析。结果：Chat GPT-4平均得分最高（4.92），优于Gemini 1.5 Pro（3.62）和Claude 3.5 Sonnet (3.21) （p < 0.001）。在复杂病例（4.87）和非复杂病例（4.96）中表现一致（p > 0.05），并表现出时间稳定性（p > 0.05）。Gemini 1.5 Pro在复杂病例中表现出时间一致性（p < 0.05），而在非复杂病例中表现出时间一致性。Claude 3.5 Sonnet表现出显著的时间不一致性（p < 0.05）。在特定领域的分析中，GPT-4优于其他。Claude 3.5 Sonnet在大多数领域得分最低，除了算法方法，它优于Gemini（4.4比4.1,p < 0.05）。结论：LLMs是一种很有前途的辅助手术决策的工具。未来的研究应以提高LLM的可靠性和验证其实际应用为目标。证据等级i：本刊要求作者为每篇文章指定证据等级。有关这些循证医学评级的完整描述，请参阅目录或在线作者说明www.springer.com/00266。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation Large Language Models' Time Dependent Consistency in Aesthetic Surgery Consultations and Comparison of Their Performance Across Different Clinical Domains.

Background: The integration of large language models (LLM) into plastic and aesthetic surgery has shown promise. However, research comparing different LLMs in handling clinical scenarios and their temporal consistency remains limited. This study evaluated the performances of ChatGPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet in aesthetic surgery scenarios. The objectives were to compare their overall performance, analyze reliability in complicated and uncomplicated cases, assess temporal consistency, evaluate performance across five clinical domains: preoperative cautions, postoperative care, holistic approach, algorithmic approach, and surgical planning.

Methods: Twenty-four case scenarios (12 complicated, 12 uncomplicated) were input into the LLMs at three time points (T1, T2, T3) over two weeks. Three blinded board-certified plastic surgeons evaluated responses using a 5-point Likert scale. Statistical analyses were applied.

Results: Chat GPT-4 achieved the highest mean score (4.92), outperforming Gemini 1.5 Pro (3.62) and Claude 3.5 Sonnet (3.21) (p < 0.001). It performed consistently across complicated (4.87) and uncomplicated cases (4.96) (p > 0.05) and demonstrated temporal stability (p > 0.05). Gemini 1.5 Pro showed temporal consistency for complicated cases (p > 0.05), but not in uncomplicated cases. Claude 3.5 Sonnet exhibited significant temporal inconsistencies (p < 0.05). In the domain specific analyzes, GPT-4 was superior to others. Claude 3.5 Sonnet had the lowest scores in most domains, except algorithmic approach, where it outperformed Gemini (4.4 vs. 4.1, p < 0.05).

Conclusions: LLMs could be a promising tool for supporting surgical decision-making. Future research should aim to enhance LLM reliability and validate its real-world applications.

Level of evidence i: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Aesthetic Plastic Surgery 医学-外科

CiteScore

4.40

自引率

25.00%

发文量

479

审稿时长

3 months

期刊介绍： Aesthetic Plastic Surgery is a publication of the International Society of Aesthetic Plastic Surgery and the official journal of the European Association of Societies of Aesthetic Plastic Surgery (EASAPS), Società Italiana di Chirurgia Plastica Ricostruttiva ed Estetica (SICPRE), Vereinigung der Deutschen Aesthetisch Plastischen Chirurgen (VDAPC), the Romanian Aesthetic Surgery Society (RASS), Asociación Española de Cirugía Estética Plástica (AECEP), La Sociedad Argentina de Cirugía Plástica, Estética y Reparadora (SACPER), the Rhinoplasty Society of Europe (RSE), the Iranian Society of Plastic and Aesthetic Surgeons (ISPAS), the Singapore Association of Plastic Surgeons (SAPS), the Australasian Society of Aesthetic Plastic Surgeons (ASAPS), the Egyptian Society of Plastic and Reconstructive Surgeons (ESPRS), and the Sociedad Chilena de Cirugía Plástica, Reconstructiva y Estética (SCCP). Aesthetic Plastic Surgery provides a forum for original articles advancing the art of aesthetic plastic surgery. Many describe surgical craftsmanship; others deal with complications in surgical procedures and methods by which to treat or avoid them. Coverage includes "second thoughts" on established techniques, which might be abandoned, modified, or improved. Also included are case histories; improvements in surgical instruments, pharmaceuticals, and operating room equipment; and discussions of problems such as the role of psychosocial factors in the doctor-patient and the patient-public interrelationships. Aesthetic Plastic Surgery is covered in Current Contents/Clinical Medicine, SciSearch, Research Alert, Index Medicus-Medline, and Excerpta Medica/Embase.