HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation.

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-02-11 DOI:10.1186/s12859-025-06073-9

Simon Schlumbohm, Julia E Neumann, Philipp Neumann

{"title":"HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation.","authors":"Simon Schlumbohm, Julia E Neumann, Philipp Neumann","doi":"10.1186/s12859-025-06073-9","DOIUrl":null,"url":null,"abstract":"Background: Data adjustment is an essential tool for increasing statistical power during analysis, for example in case of complex multi-experiment data from (single-cell) RNA, proteomics and other omics data. Despite its benefits, data integration introduces internal biases-so-called batch effects. Due to the inherent presence of missing values by such methods and their additional introduction by means of data integration, renowned algorithms such as ComBat and limma are unable to perform batch effect adjustment. Recently, the HarmonizR framework was presented for these cases, which is a tool for missing value tolerant data adjustment.Results: In this contribution, we provide significant improvements to the HarmonizR approach. A novel blocking strategy is introduced to severely reduce runtime, while still supporting parallel architectures. Additionally, a \"unique removal\" strategy has been integrated into HarmonizR to maintain even more features for adjustment in datasets, showing a feature rescue of up to 103.9% for our tested datasets. In this work, we show (1) severely improved runtime for both small and large, real datasets and (2) the ability retain more features from the integrated dataset during adjustment, showing a feature rescue of up to 103.9% for our tested datasets.Conclusion: The proposed improvements tackle the previous shortcomings of the published HarmonizR version. Since HarmonizR was mainly developed for dataset integration on rare tumor entities, it did not include runtime improvements beyond parallelization, which has been addressed in this update. An additionally welcome update regarding improved feature rescue furthermore enhances the algorithms ability to quickly and robustly perform batch effect reduction.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"47"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11817103/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06073-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Data adjustment is an essential tool for increasing statistical power during analysis, for example in case of complex multi-experiment data from (single-cell) RNA, proteomics and other omics data. Despite its benefits, data integration introduces internal biases-so-called batch effects. Due to the inherent presence of missing values by such methods and their additional introduction by means of data integration, renowned algorithms such as ComBat and limma are unable to perform batch effect adjustment. Recently, the HarmonizR framework was presented for these cases, which is a tool for missing value tolerant data adjustment.

Results: In this contribution, we provide significant improvements to the HarmonizR approach. A novel blocking strategy is introduced to severely reduce runtime, while still supporting parallel architectures. Additionally, a "unique removal" strategy has been integrated into HarmonizR to maintain even more features for adjustment in datasets, showing a feature rescue of up to 103.9% for our tested datasets. In this work, we show (1) severely improved runtime for both small and large, real datasets and (2) the ability retain more features from the integrated dataset during adjustment, showing a feature rescue of up to 103.9% for our tested datasets.

Conclusion: The proposed improvements tackle the previous shortcomings of the published HarmonizR version. Since HarmonizR was mainly developed for dataset integration on rare tumor entities, it did not include runtime improvements beyond parallelization, which has been addressed in this update. An additionally welcome update regarding improved feature rescue furthermore enhances the algorithms ability to quickly and robustly perform batch effect reduction.

查看原文本刊更多论文

HarmonizR：分块和奇异特征数据调整提高了运行效率和数据保存。

背景：数据调整是在分析过程中提高统计能力的重要工具，例如在（单细胞）RNA、蛋白质组学和其他组学数据中复杂的多实验数据。尽管数据集成有好处，但它也引入了内部偏差——即所谓的批处理效应。由于这些方法固有的缺失值的存在以及通过数据集成引入的附加值，著名的算法如ComBat和limma无法进行批量效果调整。最近，针对这些情况提出了HarmonizR框架，它是一个允许缺失值的数据调整工具。结果：在这篇文章中，我们对HarmonizR方法进行了重大改进。引入了一种新的阻塞策略，在支持并行架构的同时，大大缩短了运行时间。此外，HarmonizR还集成了一个“独特的删除”策略，以维护数据集调整的更多功能，显示我们测试的数据集的功能挽救率高达103.9%。在这项工作中，我们展示了(1)大大改善了小型和大型真实数据集的运行时间；(2)在调整过程中保留集成数据集更多特征的能力，对我们测试的数据集显示了高达103.9%的特征拯救。结论：提出的改进解决了先前发布的HarmonizR版本的缺点。由于HarmonizR主要是为罕见肿瘤实体的数据集集成而开发的，因此它不包括并行化之外的运行时改进，这在本次更新中已经得到了解决。另外一个受欢迎的关于改进特征拯救的更新进一步增强了算法快速和稳健地执行批处理效果减少的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.