Overcoming Site Variability in Multisite fMRI Studies: an Autoencoder Framework for Enhanced Generalizability of Machine Learning Models.

IF 3.1 4区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Neuroinformatics Pub Date : 2025-09-02 DOI:10.1007/s12021-025-09746-1

Fahad Almuqhim, Fahad Saeed

{"title":"Overcoming Site Variability in Multisite fMRI Studies: an Autoencoder Framework for Enhanced Generalizability of Machine Learning Models.","authors":"Fahad Almuqhim, Fahad Saeed","doi":"10.1007/s12021-025-09746-1","DOIUrl":null,"url":null,"abstract":"<p><p>Harmonizing multisite functional magnetic resonance imaging (fMRI) data is crucial for eliminating site-specific variability that hinders the generalizability of machine learning models. Traditional harmonization techniques, such as ComBat, depend on additive and multiplicative factors, and may struggle to capture the non-linear interactions between scanner hardware, acquisition protocols, and signal variations between different imaging sites. In addition, these statistical techniques require data from all the sites during their model training which may have the unintended consequence of data leakage for ML models trained using this harmonized data. The ML models trained using this harmonized data may result in low reliability and reproducibility when tested on unseen data sets, limiting their applicability for general clinical usage. In this study, we propose Autoencoders (AEs) as an alternative for harmonizing multisite fMRI data. Our designed and developed framework leverages the non-linear representation learning capabilities of AEs to reduce site-specific effects while preserving biologically meaningful features. Our evaluation using Autism Brain Imaging Data Exchange I (ABIDE-I) dataset, containing 1,035 subjects collected from 17 centers demonstrates statistically significant improvements in leave-one-site-out (LOSO) cross-validation evaluations. All AE variants (AE, SAE, TAE, and DAE) significantly outperformed the baseline mode (p < 0.01), with mean accuracy improvements ranging from 3.41% to 5.04%. Our findings demonstrate the potential of AEs to harmonize multisite neuroimaging data effectively enabling robust downstream analyses across various neuroscience applications while reducing data-leakage, and preservation of neurobiological features. Our open-source code is made available at https://github.com/pcdslab/Autoencoder-fMRI-Harmonization .</p>","PeriodicalId":49761,"journal":{"name":"Neuroinformatics","volume":"23 3","pages":"46"},"PeriodicalIF":3.1000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neuroinformatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12021-025-09746-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Harmonizing multisite functional magnetic resonance imaging (fMRI) data is crucial for eliminating site-specific variability that hinders the generalizability of machine learning models. Traditional harmonization techniques, such as ComBat, depend on additive and multiplicative factors, and may struggle to capture the non-linear interactions between scanner hardware, acquisition protocols, and signal variations between different imaging sites. In addition, these statistical techniques require data from all the sites during their model training which may have the unintended consequence of data leakage for ML models trained using this harmonized data. The ML models trained using this harmonized data may result in low reliability and reproducibility when tested on unseen data sets, limiting their applicability for general clinical usage. In this study, we propose Autoencoders (AEs) as an alternative for harmonizing multisite fMRI data. Our designed and developed framework leverages the non-linear representation learning capabilities of AEs to reduce site-specific effects while preserving biologically meaningful features. Our evaluation using Autism Brain Imaging Data Exchange I (ABIDE-I) dataset, containing 1,035 subjects collected from 17 centers demonstrates statistically significant improvements in leave-one-site-out (LOSO) cross-validation evaluations. All AE variants (AE, SAE, TAE, and DAE) significantly outperformed the baseline mode (p < 0.01), with mean accuracy improvements ranging from 3.41% to 5.04%. Our findings demonstrate the potential of AEs to harmonize multisite neuroimaging data effectively enabling robust downstream analyses across various neuroscience applications while reducing data-leakage, and preservation of neurobiological features. Our open-source code is made available at https://github.com/pcdslab/Autoencoder-fMRI-Harmonization .

查看原文本刊更多论文

克服多位点功能磁共振成像研究中的位点变异：一个增强机器学习模型可泛化性的自编码器框架。

协调多位点功能性磁共振成像（fMRI）数据对于消除阻碍机器学习模型泛化的位点特异性变异性至关重要。传统的协调技术，如ComBat，依赖于加法和乘法因素，并且可能难以捕获扫描仪硬件、采集协议和不同成像点之间的信号变化之间的非线性相互作用。此外，这些统计技术在模型训练期间需要来自所有站点的数据，这可能会对使用这些统一数据训练的ML模型产生意想不到的数据泄漏后果。使用统一数据训练的ML模型在未见过的数据集上进行测试时，可能会导致低可靠性和可重复性，限制了它们在一般临床应用中的适用性。在这项研究中，我们提出自动编码器（AEs）作为协调多位点fMRI数据的替代方案。我们设计和开发的框架利用ae的非线性表示学习能力来减少特定位点的影响，同时保留有生物学意义的特征。我们使用自闭症脑成像数据交换I （ABIDE-I）数据集进行评估，该数据集包含来自17个中心的1,035名受试者，结果显示，在留一位点（LOSO）交叉验证评估方面有统计学显著改善。所有AE变体（AE、SAE、TAE和DAE）的表现都明显优于基线模式(p

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neuroinformatics 医学-计算机：跨学科应用

CiteScore

6.00

自引率

6.70%

发文量

审稿时长

3 months

期刊介绍： Neuroinformatics publishes original articles and reviews with an emphasis on data structure and software tools related to analysis, modeling, integration, and sharing in all areas of neuroscience research. The editors particularly invite contributions on: (1) Theory and methodology, including discussions on ontologies, modeling approaches, database design, and meta-analyses; (2) Descriptions of developed databases and software tools, and of the methods for their distribution; (3) Relevant experimental results, such as reports accompanie by the release of massive data sets; (4) Computational simulations of models integrating and organizing complex data; and (5) Neuroengineering approaches, including hardware, robotics, and information theory studies.