Thisun Udagedara, Ashley Tran, Sumaya Bokhari, Sharon Shiraga, Stuart Abel, Caitlin Houghton, Katie Galvin, Kamran Samakar, Luke R Putnam
{"title":"流行人工智能聊天机器人治疗腹股沟疝的准确性和可读性对比分析。","authors":"Thisun Udagedara, Ashley Tran, Sumaya Bokhari, Sharon Shiraga, Stuart Abel, Caitlin Houghton, Katie Galvin, Kamran Samakar, Luke R Putnam","doi":"10.1177/00031348251353065","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundArtificial intelligence (AI), particularly large language models (LLMs), has gained attention for its clinical applications. While LLMs have shown utility in various medical fields, their performance in inguinal hernia repair (IHR) remains understudied. This study seeks to evaluate the accuracy and readability of LLM-generated responses to IHR-related questions, as well as their performance across distinct clinical categories.MethodsThirty questions were developed based on clinical guidelines for IHR and categorized into four subgroups: diagnosis, perioperative care, surgical management, and other. Questions were entered into Microsoft Copilot®, Google Gemini®, and OpenAI ChatGPT-4®. Responses were anonymized and evaluated by six fellowship-trained, minimally invasive surgeons using a validated 5-point Likert scale. Readability was assessed with six validated formulae.ResultsGPT-4 and Gemini outperformed Copilot in overall mean scores for response accuracy (Copilot: 3.75 ± 0.99, Gemini: 4.35 ± 0.82, and GPT-4: 4.30 ± 0.89; <i>P</i> < 0.001). Subgroup analysis revealed significantly higher scores for Gemini and GPT-4 in perioperative care (<i>P</i> = 0.025) and surgical management (<i>P</i> < 0.001). Readability scores were comparable across models, with all responses at college to college-graduate reading levels.DiscussionThis study highlights the variability in LLM performance, with GPT-4 and Gemini producing higher-quality responses than Copilot for IHR-related questions. However, the consistently high reading level of responses may limit accessibility for patients. These findings underscore the potential of LLMs to serve as valuable adjunct tools in surgical practice, with ongoing advancements expected to further enhance their accuracy, readability, and applicability.</p>","PeriodicalId":7782,"journal":{"name":"American Surgeon","volume":" ","pages":"31348251353065"},"PeriodicalIF":1.0000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparative Analysis of the Accuracy and Readability of Popular Artificial Intelligence-Chat Bots for Inguinal Hernia Management.\",\"authors\":\"Thisun Udagedara, Ashley Tran, Sumaya Bokhari, Sharon Shiraga, Stuart Abel, Caitlin Houghton, Katie Galvin, Kamran Samakar, Luke R Putnam\",\"doi\":\"10.1177/00031348251353065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>BackgroundArtificial intelligence (AI), particularly large language models (LLMs), has gained attention for its clinical applications. While LLMs have shown utility in various medical fields, their performance in inguinal hernia repair (IHR) remains understudied. This study seeks to evaluate the accuracy and readability of LLM-generated responses to IHR-related questions, as well as their performance across distinct clinical categories.MethodsThirty questions were developed based on clinical guidelines for IHR and categorized into four subgroups: diagnosis, perioperative care, surgical management, and other. Questions were entered into Microsoft Copilot®, Google Gemini®, and OpenAI ChatGPT-4®. Responses were anonymized and evaluated by six fellowship-trained, minimally invasive surgeons using a validated 5-point Likert scale. Readability was assessed with six validated formulae.ResultsGPT-4 and Gemini outperformed Copilot in overall mean scores for response accuracy (Copilot: 3.75 ± 0.99, Gemini: 4.35 ± 0.82, and GPT-4: 4.30 ± 0.89; <i>P</i> < 0.001). Subgroup analysis revealed significantly higher scores for Gemini and GPT-4 in perioperative care (<i>P</i> = 0.025) and surgical management (<i>P</i> < 0.001). Readability scores were comparable across models, with all responses at college to college-graduate reading levels.DiscussionThis study highlights the variability in LLM performance, with GPT-4 and Gemini producing higher-quality responses than Copilot for IHR-related questions. However, the consistently high reading level of responses may limit accessibility for patients. These findings underscore the potential of LLMs to serve as valuable adjunct tools in surgical practice, with ongoing advancements expected to further enhance their accuracy, readability, and applicability.</p>\",\"PeriodicalId\":7782,\"journal\":{\"name\":\"American Surgeon\",\"volume\":\" \",\"pages\":\"31348251353065\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Surgeon\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/00031348251353065\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"SURGERY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Surgeon","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/00031348251353065","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}
A Comparative Analysis of the Accuracy and Readability of Popular Artificial Intelligence-Chat Bots for Inguinal Hernia Management.
BackgroundArtificial intelligence (AI), particularly large language models (LLMs), has gained attention for its clinical applications. While LLMs have shown utility in various medical fields, their performance in inguinal hernia repair (IHR) remains understudied. This study seeks to evaluate the accuracy and readability of LLM-generated responses to IHR-related questions, as well as their performance across distinct clinical categories.MethodsThirty questions were developed based on clinical guidelines for IHR and categorized into four subgroups: diagnosis, perioperative care, surgical management, and other. Questions were entered into Microsoft Copilot®, Google Gemini®, and OpenAI ChatGPT-4®. Responses were anonymized and evaluated by six fellowship-trained, minimally invasive surgeons using a validated 5-point Likert scale. Readability was assessed with six validated formulae.ResultsGPT-4 and Gemini outperformed Copilot in overall mean scores for response accuracy (Copilot: 3.75 ± 0.99, Gemini: 4.35 ± 0.82, and GPT-4: 4.30 ± 0.89; P < 0.001). Subgroup analysis revealed significantly higher scores for Gemini and GPT-4 in perioperative care (P = 0.025) and surgical management (P < 0.001). Readability scores were comparable across models, with all responses at college to college-graduate reading levels.DiscussionThis study highlights the variability in LLM performance, with GPT-4 and Gemini producing higher-quality responses than Copilot for IHR-related questions. However, the consistently high reading level of responses may limit accessibility for patients. These findings underscore the potential of LLMs to serve as valuable adjunct tools in surgical practice, with ongoing advancements expected to further enhance their accuracy, readability, and applicability.
期刊介绍:
The American Surgeon is a monthly peer-reviewed publication published by the Southeastern Surgical Congress. Its area of concentration is clinical general surgery, as defined by the content areas of the American Board of Surgery: alimentary tract (including bariatric surgery), abdomen and its contents, breast, skin and soft tissue, endocrine system, solid organ transplantation, pediatric surgery, surgical critical care, surgical oncology (including head and neck surgery), trauma and emergency surgery, and vascular surgery.