{"title":"胸部 X 射线病理学放射报告自动标注:大型语言模型框架的开发与评估。","authors":"Abdullah Abdullah, Seong Tae Kim","doi":"10.2196/68618","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on Bidirectional Encoder Representations from Transformers (BERT)-based methods or manual expert annotations, which have limitations in terms of scalability and performance.</p><p><strong>Objective: </strong>This study aimed to evaluate the effectiveness of a generative pretrained transformer (GPT)-based large language model (LLM) in labeling radiology reports, comparing it with 2 existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC Chest X-ray [MIMIC-CXR]).</p><p><strong>Methods: </strong>In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances.</p><p><strong>Results: </strong>The GPT-based LLM model achieved an average F1 score of 0.9014 across all certainty levels, outperforming CheXpert (0.8864) and approaching CheXbert's performance (0.9047). For positive and negative certainty levels, our model scored 0.8708, surpassing CheXpert (0.8525) and closely matching CheXbert (0.8733). Statistically, paired t tests indicated no significant difference between our model and CheXbert (P=.35) but a significant improvement over CheXpert (P=.01). Wilcoxon signed-rank tests corroborated these findings, showing no significant difference between our model and CheXbert (P=.14) but confirming a significant difference with CheXpert (P=.005). The LLM also demonstrated superior performance for pathologies with longer and more complex descriptions, leveraging its extended context length.</p><p><strong>Conclusions: </strong>The GPT-based LLM model demonstrates competitive performance compared with CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Furthermore, with large context length LLM-based models are better suited for this task as compared with the small context length of BERT based models.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e68618"},"PeriodicalIF":3.1000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11970564/pdf/","citationCount":"0","resultStr":"{\"title\":\"Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework.\",\"authors\":\"Abdullah Abdullah, Seong Tae Kim\",\"doi\":\"10.2196/68618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on Bidirectional Encoder Representations from Transformers (BERT)-based methods or manual expert annotations, which have limitations in terms of scalability and performance.</p><p><strong>Objective: </strong>This study aimed to evaluate the effectiveness of a generative pretrained transformer (GPT)-based large language model (LLM) in labeling radiology reports, comparing it with 2 existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC Chest X-ray [MIMIC-CXR]).</p><p><strong>Methods: </strong>In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances.</p><p><strong>Results: </strong>The GPT-based LLM model achieved an average F1 score of 0.9014 across all certainty levels, outperforming CheXpert (0.8864) and approaching CheXbert's performance (0.9047). For positive and negative certainty levels, our model scored 0.8708, surpassing CheXpert (0.8525) and closely matching CheXbert (0.8733). Statistically, paired t tests indicated no significant difference between our model and CheXbert (P=.35) but a significant improvement over CheXpert (P=.01). Wilcoxon signed-rank tests corroborated these findings, showing no significant difference between our model and CheXbert (P=.14) but confirming a significant difference with CheXpert (P=.005). The LLM also demonstrated superior performance for pathologies with longer and more complex descriptions, leveraging its extended context length.</p><p><strong>Conclusions: </strong>The GPT-based LLM model demonstrates competitive performance compared with CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Furthermore, with large context length LLM-based models are better suited for this task as compared with the small context length of BERT based models.</p>\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e68618\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11970564/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/68618\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/68618","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework.
Background: Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on Bidirectional Encoder Representations from Transformers (BERT)-based methods or manual expert annotations, which have limitations in terms of scalability and performance.
Objective: This study aimed to evaluate the effectiveness of a generative pretrained transformer (GPT)-based large language model (LLM) in labeling radiology reports, comparing it with 2 existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC Chest X-ray [MIMIC-CXR]).
Methods: In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances.
Results: The GPT-based LLM model achieved an average F1 score of 0.9014 across all certainty levels, outperforming CheXpert (0.8864) and approaching CheXbert's performance (0.9047). For positive and negative certainty levels, our model scored 0.8708, surpassing CheXpert (0.8525) and closely matching CheXbert (0.8733). Statistically, paired t tests indicated no significant difference between our model and CheXbert (P=.35) but a significant improvement over CheXpert (P=.01). Wilcoxon signed-rank tests corroborated these findings, showing no significant difference between our model and CheXbert (P=.14) but confirming a significant difference with CheXpert (P=.005). The LLM also demonstrated superior performance for pathologies with longer and more complex descriptions, leveraging its extended context length.
Conclusions: The GPT-based LLM model demonstrates competitive performance compared with CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Furthermore, with large context length LLM-based models are better suited for this task as compared with the small context length of BERT based models.
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.