Arshia P. Javidan MD, MSc , Tiam Feridooni MD, PhD , Lauren Gordon MD, PhD , Sean A. Crawford MD, PhD
{"title":"通过比较分析 ChatGPT-3.5 和 ChatGPT-4 在生成血管外科手术建议方面的作用,评估人工智能和大语言模型在医学中的应用进展","authors":"Arshia P. Javidan MD, MSc , Tiam Feridooni MD, PhD , Lauren Gordon MD, PhD , Sean A. Crawford MD, PhD","doi":"10.1016/j.jvsvi.2023.100049","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Artificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly large language models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the US Medical Licensing Exam, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to (1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and (2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs.</p></div><div><h3>Methods</h3><p>A set of 40 clinician-level questions spanning 4 domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cutoff date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were input into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid grade level of each response was also determined. Independent samples <em>t</em> test and Fisher's exact test were used for comparative analysis.</p></div><div><h3>Results</h3><p>ChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 of 40 questions (95%) as compared with 13 of 40 (32.5%) by ChatGPT-3.5 (Fisher's exact test <em>P</em> < .001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs chatGPT-3.5 mean 265 ± 74 words; <em>P</em> < .001), the reading ease of both models remained similar, corresponding with college-level graduate texts.</p></div><div><h3>Conclusions</h3><p>ChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared with its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that, with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.</p></div>","PeriodicalId":74034,"journal":{"name":"JVS-vascular insights","volume":"2 ","pages":"Article 100049"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949912723000466/pdfft?md5=c3655f45a31080cfd13797a6738f0b01&pid=1-s2.0-S2949912723000466-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Evaluating the progression of artificial intelligence and large language models in medicine through comparative analysis of ChatGPT-3.5 and ChatGPT-4 in generating vascular surgery recommendations\",\"authors\":\"Arshia P. Javidan MD, MSc , Tiam Feridooni MD, PhD , Lauren Gordon MD, PhD , Sean A. Crawford MD, PhD\",\"doi\":\"10.1016/j.jvsvi.2023.100049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><p>Artificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly large language models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the US Medical Licensing Exam, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to (1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and (2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs.</p></div><div><h3>Methods</h3><p>A set of 40 clinician-level questions spanning 4 domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cutoff date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were input into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid grade level of each response was also determined. Independent samples <em>t</em> test and Fisher's exact test were used for comparative analysis.</p></div><div><h3>Results</h3><p>ChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 of 40 questions (95%) as compared with 13 of 40 (32.5%) by ChatGPT-3.5 (Fisher's exact test <em>P</em> < .001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs chatGPT-3.5 mean 265 ± 74 words; <em>P</em> < .001), the reading ease of both models remained similar, corresponding with college-level graduate texts.</p></div><div><h3>Conclusions</h3><p>ChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared with its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that, with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.</p></div>\",\"PeriodicalId\":74034,\"journal\":{\"name\":\"JVS-vascular insights\",\"volume\":\"2 \",\"pages\":\"Article 100049\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949912723000466/pdfft?md5=c3655f45a31080cfd13797a6738f0b01&pid=1-s2.0-S2949912723000466-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JVS-vascular insights\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949912723000466\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JVS-vascular insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949912723000466","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating the progression of artificial intelligence and large language models in medicine through comparative analysis of ChatGPT-3.5 and ChatGPT-4 in generating vascular surgery recommendations
Objective
Artificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly large language models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the US Medical Licensing Exam, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to (1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and (2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs.
Methods
A set of 40 clinician-level questions spanning 4 domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cutoff date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were input into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid grade level of each response was also determined. Independent samples t test and Fisher's exact test were used for comparative analysis.
Results
ChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 of 40 questions (95%) as compared with 13 of 40 (32.5%) by ChatGPT-3.5 (Fisher's exact test P < .001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs chatGPT-3.5 mean 265 ± 74 words; P < .001), the reading ease of both models remained similar, corresponding with college-level graduate texts.
Conclusions
ChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared with its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that, with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.