Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning.

IF 4.1 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
José Guilherme de Almeida, Ana Sofia Castro Verde, Carlos Bilreiro, Inês Santiago, Joana Ip, Manolis Tsiknakis, Kostas Marias, Daniele Regge, Celso Matos, Nickolas Papanikolaou
{"title":"Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning.","authors":"José Guilherme de Almeida, Ana Sofia Castro Verde, Carlos Bilreiro, Inês Santiago, Joana Ip, Manolis Tsiknakis, Kostas Marias, Daniele Regge, Celso Matos, Nickolas Papanikolaou","doi":"10.1186/s13244-025-01938-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML.</p><p><strong>Methods: </strong>Retrospective prostate mpMRI studies were classified into 5 series types-T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets-hold-out test set and temporal split-were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data.</p><p><strong>Results: </strong>4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data-controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (-0.146, -0.181 and -0.179, respectively), with larger performance decreases for CatBoost (-0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets.</p><p><strong>Conclusions: </strong>Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance.</p><p><strong>Critical relevance statement: </strong>Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification.</p><p><strong>Key points: </strong>Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically.</p>","PeriodicalId":13639,"journal":{"name":"Insights into Imaging","volume":"16 1","pages":"75"},"PeriodicalIF":4.1000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Insights into Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13244-025-01938-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML.

Methods: Retrospective prostate mpMRI studies were classified into 5 series types-T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets-hold-out test set and temporal split-were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data.

Results: 4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data-controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (-0.146, -0.181 and -0.179, respectively), with larger performance decreases for CatBoost (-0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets.

Conclusions: Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance.

Critical relevance statement: Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification.

Key points: Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Insights into Imaging
Insights into Imaging Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
7.30
自引率
4.30%
发文量
182
审稿时长
13 weeks
期刊介绍: Insights into Imaging (I³) is a peer-reviewed open access journal published under the brand SpringerOpen. All content published in the journal is freely available online to anyone, anywhere! I³ continuously updates scientific knowledge and progress in best-practice standards in radiology through the publication of original articles and state-of-the-art reviews and opinions, along with recommendations and statements from the leading radiological societies in Europe. Founded by the European Society of Radiology (ESR), I³ creates a platform for educational material, guidelines and recommendations, and a forum for topics of controversy. A balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes I³ an indispensable source for current information in this field. I³ is owned by the ESR, however authors retain copyright to their article according to the Creative Commons Attribution License (see Copyright and License Agreement). All articles can be read, redistributed and reused for free, as long as the author of the original work is cited properly. The open access fees (article-processing charges) for this journal are kindly sponsored by ESR for all Members. The journal went open access in 2012, which means that all articles published since then are freely available online.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信