Inteligencia Artificial en patología de pierna y pie: ¿Pueden los los grandes modelos de lenguaje reemplazar nuestra práctica?

Florencio Pablo Segura, Facundo Segura, J. Porta, Natalia Heredia, Ignacio Masquijo, Federico Anain, Leando Casola, Agustina Trevisson, V. Cafruni, Maria Paz Lucero Zudaire, I. Toledo, Florencio Vicente Segura
{"title":"Inteligencia Artificial en patología de pierna y pie: ¿Pueden los los grandes modelos de lenguaje reemplazar nuestra práctica?","authors":"Florencio Pablo Segura, Facundo Segura, J. Porta, Natalia Heredia, Ignacio Masquijo, Federico Anain, Leando Casola, Agustina Trevisson, V. Cafruni, Maria Paz Lucero Zudaire, I. Toledo, Florencio Vicente Segura","doi":"10.30795/jfootankle.2024.v18.1757","DOIUrl":null,"url":null,"abstract":"Objective: Determine if large language models (LLMs) provide better or similar information compared to an expert trained in foot and ankle pathology in various aspects of daily practice (definition and treatment of pathology, general questions). Methods: Three experts and two artificial intelligent (AI) models, ChatGPT (GPT-4) and Google Bard, answered 15 specialty-related questions, divided equally among definitions, treatments, and general queries. After coding, responses were redistributed and evaluated by five additional experts, assessing aspects like clarity, factual accuracy, and patient usefulness. The Likert scale was used to score each question, enabling experts to gauge their agreement with the provided information. Results: Using the Likert scale, each question could score between 5 and 25 points, totaling 375 or 75 points for evaluations. Expert 2 led with 69.86%, followed by Expert 1 at 68.53%, ChatGPT at 64.80%, Expert 3 at 58.40%, and Google Bard at 54.93%. Comparing experts, significant differences emerged, especially with Google Bard. The rankings varied in specific sections like definitions and treatments, highlighting GPT-4’s variability across sections. The results emphasize the differences in performance among experts and AI models. Conclusion: Our findings indicate that GPT-4 often performed comparably to or even better than experts, particularly in definition and general question sections. However, both LLMs lagged notably in the treatment section. These results underscore the potential of LLMs as valuable tools in orthopedics but highlight their limitations, emphasizing the irreplaceable role of expert expertise in intricate medical contexts. Evidence Level: III, observational, analytics.","PeriodicalId":436014,"journal":{"name":"Journal of the Foot & Ankle","volume":"57 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Foot & Ankle","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30795/jfootankle.2024.v18.1757","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: Determine if large language models (LLMs) provide better or similar information compared to an expert trained in foot and ankle pathology in various aspects of daily practice (definition and treatment of pathology, general questions). Methods: Three experts and two artificial intelligent (AI) models, ChatGPT (GPT-4) and Google Bard, answered 15 specialty-related questions, divided equally among definitions, treatments, and general queries. After coding, responses were redistributed and evaluated by five additional experts, assessing aspects like clarity, factual accuracy, and patient usefulness. The Likert scale was used to score each question, enabling experts to gauge their agreement with the provided information. Results: Using the Likert scale, each question could score between 5 and 25 points, totaling 375 or 75 points for evaluations. Expert 2 led with 69.86%, followed by Expert 1 at 68.53%, ChatGPT at 64.80%, Expert 3 at 58.40%, and Google Bard at 54.93%. Comparing experts, significant differences emerged, especially with Google Bard. The rankings varied in specific sections like definitions and treatments, highlighting GPT-4’s variability across sections. The results emphasize the differences in performance among experts and AI models. Conclusion: Our findings indicate that GPT-4 often performed comparably to or even better than experts, particularly in definition and general question sections. However, both LLMs lagged notably in the treatment section. These results underscore the potential of LLMs as valuable tools in orthopedics but highlight their limitations, emphasizing the irreplaceable role of expert expertise in intricate medical contexts. Evidence Level: III, observational, analytics.
人工智能在腿足病理学中的应用:大型语言模型能否取代我们的实践?
目标:确定在日常实践的各个方面(病理学的定义和治疗、一般问题),大语言模型(LLMs)是否能提供比受过足踝病理学培训的专家更好或类似的信息。方法:三名专家和两个人工智能(AI)模型(ChatGPT (GPT-4) 和 Google Bard)回答了 15 个与专科相关的问题,这些问题平均分为定义、治疗和一般疑问。编码后,回答被重新分配,并由另外五位专家进行评估,评估内容包括清晰度、事实准确性和对患者的实用性。每个问题都采用李克特量表进行评分,以便专家衡量他们对所提供信息的认同度。结果:使用李克特量表,每个问题可得 5 到 25 分,总分 375 或 75 分。专家 2 以 69.86% 领先,专家 1 以 68.53% 紧随其后,ChatGPT 为 64.80%,专家 3 为 58.40%,Google Bard 为 54.93%。在对专家进行比较时,出现了明显的差异,尤其是 Google Bard。在定义和处理方法等特定部分的排名也不尽相同,这凸显了 GPT-4 在不同部分的可变性。这些结果突显了专家和人工智能模型之间的性能差异。结论我们的研究结果表明,GPT-4 的表现往往与专家相当,甚至优于专家,尤其是在定义和一般问题部分。然而,两种 LLM 在处理部分的表现都明显落后。这些结果凸显了 LLMs 作为骨科宝贵工具的潜力,但也强调了其局限性,强调了专家的专业知识在错综复杂的医疗环境中不可替代的作用。证据等级:III,观察,分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信