Matias Aiskovich, Eduardo Castro, Jenna M. Reinen, S. Fadnavis, Anushree Mehta, Hongyang Li, Amit Dhurandhar, Guillermo Cecchi, Pablo Polosecki
{"title":"Fusion of biomedical imaging studies for increased sample size and diversity: a case study of brain MRI","authors":"Matias Aiskovich, Eduardo Castro, Jenna M. Reinen, S. Fadnavis, Anushree Mehta, Hongyang Li, Amit Dhurandhar, Guillermo Cecchi, Pablo Polosecki","doi":"10.3389/fradi.2024.1283392","DOIUrl":null,"url":null,"abstract":"Data collection, curation, and cleaning constitute a crucial phase in Machine Learning (ML) projects. In biomedical ML, it is often desirable to leverage multiple datasets to increase sample size and diversity, but this poses unique challenges, which arise from heterogeneity in study design, data descriptors, file system organization, and metadata. In this study, we present an approach to the integration of multiple brain MRI datasets with a focus on homogenization of their organization and preprocessing for ML. We use our own fusion example (approximately 84,000 images from 54,000 subjects, 12 studies, and 88 individual scanners) to illustrate and discuss the issues faced by study fusion efforts, and we examine key decisions necessary during dataset homogenization, presenting in detail a database structure flexible enough to accommodate multiple observational MRI datasets. We believe our approach can provide a basis for future similarly-minded biomedical ML projects.","PeriodicalId":73101,"journal":{"name":"Frontiers in radiology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fradi.2024.1283392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Data collection, curation, and cleaning constitute a crucial phase in Machine Learning (ML) projects. In biomedical ML, it is often desirable to leverage multiple datasets to increase sample size and diversity, but this poses unique challenges, which arise from heterogeneity in study design, data descriptors, file system organization, and metadata. In this study, we present an approach to the integration of multiple brain MRI datasets with a focus on homogenization of their organization and preprocessing for ML. We use our own fusion example (approximately 84,000 images from 54,000 subjects, 12 studies, and 88 individual scanners) to illustrate and discuss the issues faced by study fusion efforts, and we examine key decisions necessary during dataset homogenization, presenting in detail a database structure flexible enough to accommodate multiple observational MRI datasets. We believe our approach can provide a basis for future similarly-minded biomedical ML projects.