使用大型语言模型进行乳腺成像报告和数据系统分类及恶性肿瘤预测以增强乳腺超声诊断：回顾性研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-06-11 DOI:10.2196/70924

Su Miaojiao, Liang Xia, Zeng Xian Tao, Hong Zhi Liang, Cheng Sheng, Wu Songsong

{"title":"使用大型语言模型进行乳腺成像报告和数据系统分类及恶性肿瘤预测以增强乳腺超声诊断：回顾性研究。","authors":"Su Miaojiao, Liang Xia, Zeng Xian Tao, Hong Zhi Liang, Cheng Sheng, Wu Songsong","doi":"10.2196/70924","DOIUrl":null,"url":null,"abstract":"Background: Breast ultrasound is essential for evaluating breast nodules, with Breast Imaging Reporting and Data System (BI-RADS) providing standardized classification. However, interobserver variability among radiologists can affect diagnostic accuracy. Large language models (LLMs) like ChatGPT-4 have shown potential in medical imaging interpretation. This study explores its feasibility in improving BI-RADS classification consistency and malignancy prediction compared to radiologists.Objective: This study aims to evaluate the feasibility of using LLMs, particularly ChatGPT-4, to assess the consistency and diagnostic accuracy of standardized breast ultrasound imaging reports, using pathology as the reference standard.Methods: This retrospective study analyzed breast nodule ultrasound data from 671 female patients (mean 45.82, SD 9.20 years; range 26-75 years) who underwent biopsy or surgical excision at our hospital between June 2019 and June 2024. ChatGPT-4 was used to interpret BI-RADS classifications and predict benign versus malignant nodules. The study compared the model's performance to that of two senior radiologists (≥15 years of experience) and two junior radiologists (<5 years of experience) using key diagnostic metrics, including accuracy, sensitivity, specificity, area under the receiver operating characteristic curve, P values, and odds ratios with 95% CIs. Two diagnostic models were evaluated: (1) image interpretation model, where ChatGPT-4 classified nodules based on BI-RADS features, and (2) image-to-text-LLM model, where radiologists provided textual descriptions, and ChatGPT-4 determined malignancy probability based on keywords. Radiologists were blinded to pathological outcomes, and BI-RADS classifications were finalized through consensus.Results: ChatGPT-4 achieved an overall BI-RADS classification accuracy of 96.87%, outperforming junior radiologists (617/671, 91.95% and 604/671, 90.01%, P<.01). For malignancy prediction, ChatGPT-4 achieved an area under the receiver operating characteristic curve of 0.82 (95% CI 0.79-0.85), an accuracy of 80.63% (541/671 cases), a sensitivity of 90.56% (259/286 cases), and a specificity of 73.51% (283/385 cases). The image interpretation model demonstrated performance comparable to senior radiologists, while the image-to-text-LLM model further improved diagnostic accuracy for all radiologists, increasing their sensitivity and specificity significantly (P<.001). Statistical analyses, including the McNemar test and DeLong test, confirmed that ChatGPT-4 outperformed junior radiologists (P<.01) and showed noninferiority compared to senior radiologists (P>.05). Pathological diagnoses served as the reference standard, ensuring robust evaluation reliability.Conclusions: Integrating ChatGPT-4 into an image-to-text-LLM workflow improves BI-RADS classification accuracy and supports radiologists in breast ultrasound diagnostics. These results demonstrate its potential as a decision-support tool to enhance diagnostic consistency and reduce variability.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e70924"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12175873/pdf/","citationCount":"0","resultStr":"{\"title\":\"Using a Large Language Model for Breast Imaging Reporting and Data System Classification and Malignancy Prediction to Enhance Breast Ultrasound Diagnosis: Retrospective Study.\",\"authors\":\"Su Miaojiao, Liang Xia, Zeng Xian Tao, Hong Zhi Liang, Cheng Sheng, Wu Songsong\",\"doi\":\"10.2196/70924\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Breast ultrasound is essential for evaluating breast nodules, with Breast Imaging Reporting and Data System (BI-RADS) providing standardized classification. However, interobserver variability among radiologists can affect diagnostic accuracy. Large language models (LLMs) like ChatGPT-4 have shown potential in medical imaging interpretation. This study explores its feasibility in improving BI-RADS classification consistency and malignancy prediction compared to radiologists.Objective: This study aims to evaluate the feasibility of using LLMs, particularly ChatGPT-4, to assess the consistency and diagnostic accuracy of standardized breast ultrasound imaging reports, using pathology as the reference standard.Methods: This retrospective study analyzed breast nodule ultrasound data from 671 female patients (mean 45.82, SD 9.20 years; range 26-75 years) who underwent biopsy or surgical excision at our hospital between June 2019 and June 2024. ChatGPT-4 was used to interpret BI-RADS classifications and predict benign versus malignant nodules. The study compared the model's performance to that of two senior radiologists (≥15 years of experience) and two junior radiologists (<5 years of experience) using key diagnostic metrics, including accuracy, sensitivity, specificity, area under the receiver operating characteristic curve, P values, and odds ratios with 95% CIs. Two diagnostic models were evaluated: (1) image interpretation model, where ChatGPT-4 classified nodules based on BI-RADS features, and (2) image-to-text-LLM model, where radiologists provided textual descriptions, and ChatGPT-4 determined malignancy probability based on keywords. Radiologists were blinded to pathological outcomes, and BI-RADS classifications were finalized through consensus.Results: ChatGPT-4 achieved an overall BI-RADS classification accuracy of 96.87%, outperforming junior radiologists (617/671, 91.95% and 604/671, 90.01%, P<.01). For malignancy prediction, ChatGPT-4 achieved an area under the receiver operating characteristic curve of 0.82 (95% CI 0.79-0.85), an accuracy of 80.63% (541/671 cases), a sensitivity of 90.56% (259/286 cases), and a specificity of 73.51% (283/385 cases). The image interpretation model demonstrated performance comparable to senior radiologists, while the image-to-text-LLM model further improved diagnostic accuracy for all radiologists, increasing their sensitivity and specificity significantly (P<.001). Statistical analyses, including the McNemar test and DeLong test, confirmed that ChatGPT-4 outperformed junior radiologists (P<.01) and showed noninferiority compared to senior radiologists (P>.05). Pathological diagnoses served as the reference standard, ensuring robust evaluation reliability.Conclusions: Integrating ChatGPT-4 into an image-to-text-LLM workflow improves BI-RADS classification accuracy and supports radiologists in breast ultrasound diagnostics. These results demonstrate its potential as a decision-support tool to enhance diagnostic consistency and reduce variability.\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e70924\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12175873/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/70924\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/70924","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：乳腺超声对评估乳腺结节至关重要，乳腺成像报告和数据系统（BI-RADS）提供了标准化的分类。然而，放射科医生之间的观察者差异会影响诊断的准确性。像ChatGPT-4这样的大型语言模型（LLMs）在医学成像解释中显示出了潜力。本研究探讨其在提高BI-RADS分类一致性及恶性肿瘤预测方面的可行性。目的：本研究旨在评价以病理为参考标准，使用LLMs，特别是ChatGPT-4，评估标准化乳腺超声成像报告的一致性和诊断准确性的可行性。方法：回顾性分析671例女性患者的乳腺结节超声资料(平均45.82岁，SD 9.20岁；年龄在26-75岁之间)，于2019年6月至2024年6月期间在我院接受活检或手术切除。ChatGPT-4用于解释BI-RADS分类并预测良性与恶性结节。研究将该模型的性能与两名资深放射科医生（≥15年经验）和两名初级放射科医生进行了比较(结果：ChatGPT-4总体BI-RADS分类准确率为96.87%，优于初级放射科医生（617/671,91.95%和604/671,90.01%，P.05）。病理诊断作为参考标准，保证了稳健的评估信度。结论：将ChatGPT-4集成到图像到文本的llm工作流程中可以提高BI-RADS分类的准确性，并支持放射科医生进行乳腺超声诊断。这些结果证明了它作为决策支持工具的潜力，以提高诊断一致性和减少可变性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using a Large Language Model for Breast Imaging Reporting and Data System Classification and Malignancy Prediction to Enhance Breast Ultrasound Diagnosis: Retrospective Study.

Background: Breast ultrasound is essential for evaluating breast nodules, with Breast Imaging Reporting and Data System (BI-RADS) providing standardized classification. However, interobserver variability among radiologists can affect diagnostic accuracy. Large language models (LLMs) like ChatGPT-4 have shown potential in medical imaging interpretation. This study explores its feasibility in improving BI-RADS classification consistency and malignancy prediction compared to radiologists.

Objective: This study aims to evaluate the feasibility of using LLMs, particularly ChatGPT-4, to assess the consistency and diagnostic accuracy of standardized breast ultrasound imaging reports, using pathology as the reference standard.

Methods: This retrospective study analyzed breast nodule ultrasound data from 671 female patients (mean 45.82, SD 9.20 years; range 26-75 years) who underwent biopsy or surgical excision at our hospital between June 2019 and June 2024. ChatGPT-4 was used to interpret BI-RADS classifications and predict benign versus malignant nodules. The study compared the model's performance to that of two senior radiologists (≥15 years of experience) and two junior radiologists (<5 years of experience) using key diagnostic metrics, including accuracy, sensitivity, specificity, area under the receiver operating characteristic curve, P values, and odds ratios with 95% CIs. Two diagnostic models were evaluated: (1) image interpretation model, where ChatGPT-4 classified nodules based on BI-RADS features, and (2) image-to-text-LLM model, where radiologists provided textual descriptions, and ChatGPT-4 determined malignancy probability based on keywords. Radiologists were blinded to pathological outcomes, and BI-RADS classifications were finalized through consensus.

Results: ChatGPT-4 achieved an overall BI-RADS classification accuracy of 96.87%, outperforming junior radiologists (617/671, 91.95% and 604/671, 90.01%, P<.01). For malignancy prediction, ChatGPT-4 achieved an area under the receiver operating characteristic curve of 0.82 (95% CI 0.79-0.85), an accuracy of 80.63% (541/671 cases), a sensitivity of 90.56% (259/286 cases), and a specificity of 73.51% (283/385 cases). The image interpretation model demonstrated performance comparable to senior radiologists, while the image-to-text-LLM model further improved diagnostic accuracy for all radiologists, increasing their sensitivity and specificity significantly (P<.001). Statistical analyses, including the McNemar test and DeLong test, confirmed that ChatGPT-4 outperformed junior radiologists (P<.01) and showed noninferiority compared to senior radiologists (P>.05). Pathological diagnoses served as the reference standard, ensuring robust evaluation reliability.

Conclusions: Integrating ChatGPT-4 into an image-to-text-LLM workflow improves BI-RADS classification accuracy and supports radiologists in breast ultrasound diagnostics. These results demonstrate its potential as a decision-support tool to enhance diagnostic consistency and reduce variability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.