用于临床机器学习的多中心前列腺多参数MRI数据集的自动序列识别。

IF 4.5 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Insights into Imaging Pub Date : 2025-03-27 DOI:10.1186/s13244-025-01938-2

José Guilherme de Almeida, Ana Sofia Castro Verde, Carlos Bilreiro, Inês Santiago, Joana Ip, Manolis Tsiknakis, Kostas Marias, Daniele Regge, Celso Matos, Nickolas Papanikolaou

{"title":"用于临床机器学习的多中心前列腺多参数MRI数据集的自动序列识别。","authors":"José Guilherme de Almeida, Ana Sofia Castro Verde, Carlos Bilreiro, Inês Santiago, Joana Ip, Manolis Tsiknakis, Kostas Marias, Daniele Regge, Celso Matos, Nickolas Papanikolaou","doi":"10.1186/s13244-025-01938-2","DOIUrl":null,"url":null,"abstract":"Objectives: To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML.Methods: Retrospective prostate mpMRI studies were classified into 5 series types-T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets-hold-out test set and temporal split-were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data.Results: 4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data-controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (-0.146, -0.181 and -0.179, respectively), with larger performance decreases for CatBoost (-0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets.Conclusions: Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance.Critical relevance statement: Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification.Key points: Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically.","PeriodicalId":13639,"journal":{"name":"Insights into Imaging","volume":"16 1","pages":"75"},"PeriodicalIF":4.5000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12187622/pdf/","citationCount":"0","resultStr":"{\"title\":\"Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning.\",\"authors\":\"José Guilherme de Almeida, Ana Sofia Castro Verde, Carlos Bilreiro, Inês Santiago, Joana Ip, Manolis Tsiknakis, Kostas Marias, Daniele Regge, Celso Matos, Nickolas Papanikolaou\",\"doi\":\"10.1186/s13244-025-01938-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objectives: To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML.Methods: Retrospective prostate mpMRI studies were classified into 5 series types-T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets-hold-out test set and temporal split-were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data.Results: 4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data-controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (-0.146, -0.181 and -0.179, respectively), with larger performance decreases for CatBoost (-0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets.Conclusions: Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance.Critical relevance statement: Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification.Key points: Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically.\",\"PeriodicalId\":13639,\"journal\":{\"name\":\"Insights into Imaging\",\"volume\":\"16 1\",\"pages\":\"75\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12187622/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Insights into Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s13244-025-01938-2\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Insights into Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13244-025-01938-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

目的：为前列腺癌（PCa）的多中心多参数MRI （mpMRI）数据集的序列类型自动识别提供一种准确的机器学习（ML）方法和基于知识的启发方法。方法：回顾性前列腺mpMRI研究分为t2加权（T2W）、弥散加权（DWI）、表观扩散系数（ADC）、动态对比增强（DCE）和其他序列类型（其他）5个系列类型。对所有系列的元数据进行处理，并使用不同数据分量的5倍交叉验证（CV）训练两个模型（自定义分类标记化后的XGBoost和原始分类数据的CatBoost）进行学习曲线分析。为了验证，我们使用了两个测试集——保持测试集和时间分裂测试集。以中心为组进行留一组（LOGO） CV分析，以了解数据集特定数据的影响。结果：来自11个中心的4045项研究（31,053个系列）和1004项研究（7891个系列）分别用于训练和测试系列识别模型。测试f1分数始终高于0.95 （CatBoost）和0.97 （XGBoost）。学习曲线表明学习饱和，而时间验证表明模型仍然能够正确识别所有T2W/DWI/ADC三联体。然而，最佳性能需要对模型和使用的特征集进行中心特定的数据控制，当将CV与LOGOCV进行比较时，T2W， DCE和其他的f1分数下降（分别为-0.146,-0.181和-0.179），CatBoost的性能下降更大（-0.265）。最后，我们描述了启发式方法，以帮助研究人员对PCa mpMRI数据集进行系列分类。结论：系列型自动识别是可行的，可以实现数据的自动化管理。但是，应该包含特定于数据集的数据以实现最佳性能。关键相关性声明：组织大量数据集是耗时的，但对于训练临床机器学习模型是必要的。为了解决这个问题，我们概述并验证了一种可以促进此过程的自动系列识别方法。最后，我们概述了一组基于元数据的启发式方法，可用于进一步自动化系列类型识别。重点：多中心前列腺MRI研究用于序列注释模型训练。自动序列注释只需要很少的实例，并且是暂时泛化的。序列注释是临床人工智能模型训练所必需的，可以自动执行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning.

Objectives: To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML.

Methods: Retrospective prostate mpMRI studies were classified into 5 series types-T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets-hold-out test set and temporal split-were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data.

Results: 4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data-controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (-0.146, -0.181 and -0.179, respectively), with larger performance decreases for CatBoost (-0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets.

Conclusions: Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance.

Critical relevance statement: Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification.

Key points: Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Insights into Imaging Medicine-Radiology, Nuclear Medicine and Imaging

CiteScore

7.30

自引率

4.30%

发文量

182

审稿时长

13 weeks

期刊介绍： Insights into Imaging (I³) is a peer-reviewed open access journal published under the brand SpringerOpen. All content published in the journal is freely available online to anyone, anywhere! I³ continuously updates scientific knowledge and progress in best-practice standards in radiology through the publication of original articles and state-of-the-art reviews and opinions, along with recommendations and statements from the leading radiological societies in Europe. Founded by the European Society of Radiology (ESR), I³ creates a platform for educational material, guidelines and recommendations, and a forum for topics of controversy. A balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes I³ an indispensable source for current information in this field. I³ is owned by the ESR, however authors retain copyright to their article according to the Creative Commons Attribution License (see Copyright and License Agreement). All articles can be read, redistributed and reused for free, as long as the author of the original work is cited properly. The open access fees (article-processing charges) for this journal are kindly sponsored by ESR for all Members. The journal went open access in 2012, which means that all articles published since then are freely available online.