{"title":"评估大语言模型在美容外科咨询中的时间依赖一致性及其在不同临床领域的表现比较。","authors":"Munur Selcuk Kendir, Muaz Zuhurlu","doi":"10.1007/s00266-025-05308-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The integration of large language models (LLM) into plastic and aesthetic surgery has shown promise. However, research comparing different LLMs in handling clinical scenarios and their temporal consistency remains limited. This study evaluated the performances of ChatGPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet in aesthetic surgery scenarios. The objectives were to compare their overall performance, analyze reliability in complicated and uncomplicated cases, assess temporal consistency, evaluate performance across five clinical domains: preoperative cautions, postoperative care, holistic approach, algorithmic approach, and surgical planning.</p><p><strong>Methods: </strong>Twenty-four case scenarios (12 complicated, 12 uncomplicated) were input into the LLMs at three time points (T1, T2, T3) over two weeks. Three blinded board-certified plastic surgeons evaluated responses using a 5-point Likert scale. Statistical analyses were applied.</p><p><strong>Results: </strong>Chat GPT-4 achieved the highest mean score (4.92), outperforming Gemini 1.5 Pro (3.62) and Claude 3.5 Sonnet (3.21) (p < 0.001). It performed consistently across complicated (4.87) and uncomplicated cases (4.96) (p > 0.05) and demonstrated temporal stability (p > 0.05). Gemini 1.5 Pro showed temporal consistency for complicated cases (p > 0.05), but not in uncomplicated cases. Claude 3.5 Sonnet exhibited significant temporal inconsistencies (p < 0.05). In the domain specific analyzes, GPT-4 was superior to others. Claude 3.5 Sonnet had the lowest scores in most domains, except algorithmic approach, where it outperformed Gemini (4.4 vs. 4.1, p < 0.05).</p><p><strong>Conclusions: </strong>LLMs could be a promising tool for supporting surgical decision-making. Future research should aim to enhance LLM reliability and validate its real-world applications.</p><p><strong>Level of evidence i: </strong>This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .</p>","PeriodicalId":7609,"journal":{"name":"Aesthetic Plastic Surgery","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation Large Language Models' Time Dependent Consistency in Aesthetic Surgery Consultations and Comparison of Their Performance Across Different Clinical Domains.\",\"authors\":\"Munur Selcuk Kendir, Muaz Zuhurlu\",\"doi\":\"10.1007/s00266-025-05308-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The integration of large language models (LLM) into plastic and aesthetic surgery has shown promise. However, research comparing different LLMs in handling clinical scenarios and their temporal consistency remains limited. This study evaluated the performances of ChatGPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet in aesthetic surgery scenarios. The objectives were to compare their overall performance, analyze reliability in complicated and uncomplicated cases, assess temporal consistency, evaluate performance across five clinical domains: preoperative cautions, postoperative care, holistic approach, algorithmic approach, and surgical planning.</p><p><strong>Methods: </strong>Twenty-four case scenarios (12 complicated, 12 uncomplicated) were input into the LLMs at three time points (T1, T2, T3) over two weeks. Three blinded board-certified plastic surgeons evaluated responses using a 5-point Likert scale. Statistical analyses were applied.</p><p><strong>Results: </strong>Chat GPT-4 achieved the highest mean score (4.92), outperforming Gemini 1.5 Pro (3.62) and Claude 3.5 Sonnet (3.21) (p < 0.001). It performed consistently across complicated (4.87) and uncomplicated cases (4.96) (p > 0.05) and demonstrated temporal stability (p > 0.05). Gemini 1.5 Pro showed temporal consistency for complicated cases (p > 0.05), but not in uncomplicated cases. Claude 3.5 Sonnet exhibited significant temporal inconsistencies (p < 0.05). In the domain specific analyzes, GPT-4 was superior to others. Claude 3.5 Sonnet had the lowest scores in most domains, except algorithmic approach, where it outperformed Gemini (4.4 vs. 4.1, p < 0.05).</p><p><strong>Conclusions: </strong>LLMs could be a promising tool for supporting surgical decision-making. Future research should aim to enhance LLM reliability and validate its real-world applications.</p><p><strong>Level of evidence i: </strong>This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .</p>\",\"PeriodicalId\":7609,\"journal\":{\"name\":\"Aesthetic Plastic Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Aesthetic Plastic Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00266-025-05308-7\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"SURGERY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aesthetic Plastic Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00266-025-05308-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}
Evaluation Large Language Models' Time Dependent Consistency in Aesthetic Surgery Consultations and Comparison of Their Performance Across Different Clinical Domains.
Background: The integration of large language models (LLM) into plastic and aesthetic surgery has shown promise. However, research comparing different LLMs in handling clinical scenarios and their temporal consistency remains limited. This study evaluated the performances of ChatGPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet in aesthetic surgery scenarios. The objectives were to compare their overall performance, analyze reliability in complicated and uncomplicated cases, assess temporal consistency, evaluate performance across five clinical domains: preoperative cautions, postoperative care, holistic approach, algorithmic approach, and surgical planning.
Methods: Twenty-four case scenarios (12 complicated, 12 uncomplicated) were input into the LLMs at three time points (T1, T2, T3) over two weeks. Three blinded board-certified plastic surgeons evaluated responses using a 5-point Likert scale. Statistical analyses were applied.
Results: Chat GPT-4 achieved the highest mean score (4.92), outperforming Gemini 1.5 Pro (3.62) and Claude 3.5 Sonnet (3.21) (p < 0.001). It performed consistently across complicated (4.87) and uncomplicated cases (4.96) (p > 0.05) and demonstrated temporal stability (p > 0.05). Gemini 1.5 Pro showed temporal consistency for complicated cases (p > 0.05), but not in uncomplicated cases. Claude 3.5 Sonnet exhibited significant temporal inconsistencies (p < 0.05). In the domain specific analyzes, GPT-4 was superior to others. Claude 3.5 Sonnet had the lowest scores in most domains, except algorithmic approach, where it outperformed Gemini (4.4 vs. 4.1, p < 0.05).
Conclusions: LLMs could be a promising tool for supporting surgical decision-making. Future research should aim to enhance LLM reliability and validate its real-world applications.
Level of evidence i: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
期刊介绍:
Aesthetic Plastic Surgery is a publication of the International Society of Aesthetic Plastic Surgery and the official journal of the European Association of Societies of Aesthetic Plastic Surgery (EASAPS), Società Italiana di Chirurgia Plastica Ricostruttiva ed Estetica (SICPRE), Vereinigung der Deutschen Aesthetisch Plastischen Chirurgen (VDAPC), the Romanian Aesthetic Surgery Society (RASS), Asociación Española de Cirugía Estética Plástica (AECEP), La Sociedad Argentina de Cirugía Plástica, Estética y Reparadora (SACPER), the Rhinoplasty Society of Europe (RSE), the Iranian Society of Plastic and Aesthetic Surgeons (ISPAS), the Singapore Association of Plastic Surgeons (SAPS), the Australasian Society of Aesthetic Plastic Surgeons (ASAPS), the Egyptian Society of Plastic and Reconstructive Surgeons (ESPRS), and the Sociedad Chilena de Cirugía Plástica, Reconstructiva y Estética (SCCP).
Aesthetic Plastic Surgery provides a forum for original articles advancing the art of aesthetic plastic surgery. Many describe surgical craftsmanship; others deal with complications in surgical procedures and methods by which to treat or avoid them. Coverage includes "second thoughts" on established techniques, which might be abandoned, modified, or improved. Also included are case histories; improvements in surgical instruments, pharmaceuticals, and operating room equipment; and discussions of problems such as the role of psychosocial factors in the doctor-patient and the patient-public interrelationships.
Aesthetic Plastic Surgery is covered in Current Contents/Clinical Medicine, SciSearch, Research Alert, Index Medicus-Medline, and Excerpta Medica/Embase.