Esteban Cabezas, David Toro-Tobon, Thomas Johnson, Marco Álvarez, Javad R Azadi, Camilo Gonzalez-Velasquez, Naykky Singh Ospina, Oscar J Ponce, Megan E Branda, Juan P Brito
{"title":"ChatGPT-4在超声图像中评估甲状腺结节特征和癌症风险的准确性。","authors":"Esteban Cabezas, David Toro-Tobon, Thomas Johnson, Marco Álvarez, Javad R Azadi, Camilo Gonzalez-Velasquez, Naykky Singh Ospina, Oscar J Ponce, Megan E Branda, Juan P Brito","doi":"10.1016/j.eprac.2025.03.008","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound (TUS) images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS).</p><p><strong>Methods: </strong>This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound (TUS) images sourced from three open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard.</p><p><strong>Results: </strong>GPT-4 demonstrated high specificity but low sensitivity when assessing complete TUS images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (TR1), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk TR4 category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs, 46%), and echogenic foci (3% vs. 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images.</p><p><strong>Conclusion: </strong>While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.</p>","PeriodicalId":11682,"journal":{"name":"Endocrine Practice","volume":" ","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk from Ultrasound Images.\",\"authors\":\"Esteban Cabezas, David Toro-Tobon, Thomas Johnson, Marco Álvarez, Javad R Azadi, Camilo Gonzalez-Velasquez, Naykky Singh Ospina, Oscar J Ponce, Megan E Branda, Juan P Brito\",\"doi\":\"10.1016/j.eprac.2025.03.008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound (TUS) images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS).</p><p><strong>Methods: </strong>This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound (TUS) images sourced from three open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard.</p><p><strong>Results: </strong>GPT-4 demonstrated high specificity but low sensitivity when assessing complete TUS images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (TR1), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk TR4 category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs, 46%), and echogenic foci (3% vs. 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images.</p><p><strong>Conclusion: </strong>While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.</p>\",\"PeriodicalId\":11682,\"journal\":{\"name\":\"Endocrine Practice\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endocrine Practice\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.eprac.2025.03.008\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine Practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.eprac.2025.03.008","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk from Ultrasound Images.
Objective: To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound (TUS) images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS).
Methods: This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound (TUS) images sourced from three open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard.
Results: GPT-4 demonstrated high specificity but low sensitivity when assessing complete TUS images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (TR1), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk TR4 category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs, 46%), and echogenic foci (3% vs. 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images.
Conclusion: While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.
期刊介绍:
Endocrine Practice (ISSN: 1530-891X), a peer-reviewed journal published twelve times a year, is the official journal of the American Association of Clinical Endocrinologists (AACE). The primary mission of Endocrine Practice is to enhance the health care of patients with endocrine diseases through continuing education of practicing endocrinologists.