{"title":"Classification of Interventional Radiology Reports into Technique Categories with a Fine-Tuned Large Language Model.","authors":"Koichiro Yasaka, Takuto Nomura, Jun Kamohara, Hiroshi Hirakawa, Takatoshi Kubo, Shigeru Kiryu, Osamu Abe","doi":"10.1007/s10278-024-01370-w","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study is to develop a fine-tuned large language model that classifies interventional radiology reports into technique categories and to compare its performance with readers. This retrospective study included 3198 patients (1758 males and 1440 females; age, 62.8 ± 16.8 years) who underwent interventional radiology from January 2018 to July 2024. Training, validation, and test datasets involved 2292, 250, and 656 patients, respectively. Input data involved texts in clinical indication, imaging diagnosis, and image-finding sections of interventional radiology reports. Manually classified technique categories (15 categories in total) were utilized as reference data. Fine-tuning of the Bidirectional Encoder Representations model was performed using training and validation datasets. This process was repeated 15 times due to the randomness of the learning process. The best-performed model, which showed the highest accuracy among 15 trials, was selected to further evaluate its performance in the independent test dataset. The report classification involved one radiologist (reader 1) and two radiology residents (readers 2 and 3). The accuracy and macrosensitivity (average of each category's sensitivity) of the best-performed model in the validation dataset were 0.996 and 0.994, respectively. For the test dataset, the accuracy/macrosensitivity were 0.988/0.980, 0.986/0.977, 0.989/0.979, and 0.988/0.980 in the best model, reader 1, reader 2, and reader 3, respectively. The model required 0.178 s required for classification per patient, which was 17.5-19.9 times faster than readers. In conclusion, fine-tuned large language model classified interventional radiology reports into technique categories with high accuracy similar to readers within a remarkably shorter time.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-024-01370-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The aim of this study is to develop a fine-tuned large language model that classifies interventional radiology reports into technique categories and to compare its performance with readers. This retrospective study included 3198 patients (1758 males and 1440 females; age, 62.8 ± 16.8 years) who underwent interventional radiology from January 2018 to July 2024. Training, validation, and test datasets involved 2292, 250, and 656 patients, respectively. Input data involved texts in clinical indication, imaging diagnosis, and image-finding sections of interventional radiology reports. Manually classified technique categories (15 categories in total) were utilized as reference data. Fine-tuning of the Bidirectional Encoder Representations model was performed using training and validation datasets. This process was repeated 15 times due to the randomness of the learning process. The best-performed model, which showed the highest accuracy among 15 trials, was selected to further evaluate its performance in the independent test dataset. The report classification involved one radiologist (reader 1) and two radiology residents (readers 2 and 3). The accuracy and macrosensitivity (average of each category's sensitivity) of the best-performed model in the validation dataset were 0.996 and 0.994, respectively. For the test dataset, the accuracy/macrosensitivity were 0.988/0.980, 0.986/0.977, 0.989/0.979, and 0.988/0.980 in the best model, reader 1, reader 2, and reader 3, respectively. The model required 0.178 s required for classification per patient, which was 17.5-19.9 times faster than readers. In conclusion, fine-tuned large language model classified interventional radiology reports into technique categories with high accuracy similar to readers within a remarkably shorter time.