Simon Schlumbohm, Julia E Neumann, Philipp Neumann
{"title":"HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation.","authors":"Simon Schlumbohm, Julia E Neumann, Philipp Neumann","doi":"10.1186/s12859-025-06073-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Data adjustment is an essential tool for increasing statistical power during analysis, for example in case of complex multi-experiment data from (single-cell) RNA, proteomics and other omics data. Despite its benefits, data integration introduces internal biases-so-called batch effects. Due to the inherent presence of missing values by such methods and their additional introduction by means of data integration, renowned algorithms such as ComBat and limma are unable to perform batch effect adjustment. Recently, the HarmonizR framework was presented for these cases, which is a tool for missing value tolerant data adjustment.</p><p><strong>Results: </strong>In this contribution, we provide significant improvements to the HarmonizR approach. A novel blocking strategy is introduced to severely reduce runtime, while still supporting parallel architectures. Additionally, a \"unique removal\" strategy has been integrated into HarmonizR to maintain even more features for adjustment in datasets, showing a feature rescue of up to 103.9% for our tested datasets. In this work, we show (1) severely improved runtime for both small and large, real datasets and (2) the ability retain more features from the integrated dataset during adjustment, showing a feature rescue of up to 103.9% for our tested datasets.</p><p><strong>Conclusion: </strong>The proposed improvements tackle the previous shortcomings of the published HarmonizR version. Since HarmonizR was mainly developed for dataset integration on rare tumor entities, it did not include runtime improvements beyond parallelization, which has been addressed in this update. An additionally welcome update regarding improved feature rescue furthermore enhances the algorithms ability to quickly and robustly perform batch effect reduction.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"47"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06073-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Data adjustment is an essential tool for increasing statistical power during analysis, for example in case of complex multi-experiment data from (single-cell) RNA, proteomics and other omics data. Despite its benefits, data integration introduces internal biases-so-called batch effects. Due to the inherent presence of missing values by such methods and their additional introduction by means of data integration, renowned algorithms such as ComBat and limma are unable to perform batch effect adjustment. Recently, the HarmonizR framework was presented for these cases, which is a tool for missing value tolerant data adjustment.
Results: In this contribution, we provide significant improvements to the HarmonizR approach. A novel blocking strategy is introduced to severely reduce runtime, while still supporting parallel architectures. Additionally, a "unique removal" strategy has been integrated into HarmonizR to maintain even more features for adjustment in datasets, showing a feature rescue of up to 103.9% for our tested datasets. In this work, we show (1) severely improved runtime for both small and large, real datasets and (2) the ability retain more features from the integrated dataset during adjustment, showing a feature rescue of up to 103.9% for our tested datasets.
Conclusion: The proposed improvements tackle the previous shortcomings of the published HarmonizR version. Since HarmonizR was mainly developed for dataset integration on rare tumor entities, it did not include runtime improvements beyond parallelization, which has been addressed in this update. An additionally welcome update regarding improved feature rescue furthermore enhances the algorithms ability to quickly and robustly perform batch effect reduction.
期刊介绍:
BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology.
BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.