Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-09-02 DOI:10.1186/s12911-024-02642-9

Yingxia Li, Tobias Herold, Ulrich Mansmann, Roman Hornung

{"title":"Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study.","authors":"Yingxia Li, Tobias Herold, Ulrich Mansmann, Roman Hornung","doi":"10.1186/s12911-024-02642-9","DOIUrl":null,"url":null,"abstract":"Background: Predictive modeling based on multi-omics data, which incorporates several types of omics data for the same patients, has shown potential to outperform single-omics predictive modeling. Most research in this domain focuses on incorporating numerous data types, despite the complexity and cost of acquiring them. The prevailing assumption is that increasing the number of data types necessarily improves predictive performance. However, the integration of less informative or redundant data types could potentially hinder this performance. Therefore, identifying the most effective combinations of omics data types that enhance predictive performance is critical for cost-effective and accurate predictions.Methods: In this study, we systematically evaluated the predictive performance of all 31 possible combinations including at least one of five genomic data types (mRNA, miRNA, methylation, DNAseq, and copy number variation) using 14 cancer datasets with right-censored survival outcomes, publicly available from the TCGA database. We employed various prediction methods and up-weighted clinical data in every model to leverage their predictive importance. Harrell's C-index and the integrated Brier Score were used as performance measures. To assess the robustness of our findings, we performed a bootstrap analysis at the level of the included datasets. Statistical testing was conducted for key results, limiting the number of tests to ensure a low risk of false positives.Results: Contrary to expectations, we found that using only mRNA data or a combination of mRNA and miRNA data was sufficient for most cancer types. For some cancer types, the additional inclusion of methylation data led to improved prediction results. Far from enhancing performance, the introduction of more data types most often resulted in a decline in performance, which varied between the two performance measures.Conclusions: Our findings challenge the prevailing notion that combining multiple omics data types in multi-omics survival prediction improves predictive performance. Thus, the widespread approach in multi-omics prediction of incorporating as many data types as possible should be reconsidered to avoid suboptimal prediction results and unnecessary expenditure.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370316/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02642-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Predictive modeling based on multi-omics data, which incorporates several types of omics data for the same patients, has shown potential to outperform single-omics predictive modeling. Most research in this domain focuses on incorporating numerous data types, despite the complexity and cost of acquiring them. The prevailing assumption is that increasing the number of data types necessarily improves predictive performance. However, the integration of less informative or redundant data types could potentially hinder this performance. Therefore, identifying the most effective combinations of omics data types that enhance predictive performance is critical for cost-effective and accurate predictions.

Methods: In this study, we systematically evaluated the predictive performance of all 31 possible combinations including at least one of five genomic data types (mRNA, miRNA, methylation, DNAseq, and copy number variation) using 14 cancer datasets with right-censored survival outcomes, publicly available from the TCGA database. We employed various prediction methods and up-weighted clinical data in every model to leverage their predictive importance. Harrell's C-index and the integrated Brier Score were used as performance measures. To assess the robustness of our findings, we performed a bootstrap analysis at the level of the included datasets. Statistical testing was conducted for key results, limiting the number of tests to ensure a low risk of false positives.

Results: Contrary to expectations, we found that using only mRNA data or a combination of mRNA and miRNA data was sufficient for most cancer types. For some cancer types, the additional inclusion of methylation data led to improved prediction results. Far from enhancing performance, the introduction of more data types most often resulted in a decline in performance, which varied between the two performance measures.

Conclusions: Our findings challenge the prevailing notion that combining multiple omics data types in multi-omics survival prediction improves predictive performance. Thus, the widespread approach in multi-omics prediction of incorporating as many data types as possible should be reconsidered to avoid suboptimal prediction results and unnecessary expenditure.

查看原文本刊更多论文

多组学数据中多种数据类型的结合会提高还是阻碍生存预测的性能？大规模基准研究的启示。

背景：基于多组学数据的预测建模结合了同一患者的多种组学数据，已显示出优于单组学预测建模的潜力。尽管获取数据的复杂性和成本较高，但这一领域的大多数研究都侧重于纳入多种类型的数据。普遍的假设是，增加数据类型的数量必然会提高预测性能。然而，整合信息量较少或冗余的数据类型可能会阻碍预测性能的提高。因此，确定能提高预测性能的 omics 数据类型的最有效组合，对于成本效益和准确预测至关重要：在这项研究中，我们利用 TCGA 数据库公开提供的 14 个具有右删失生存结果的癌症数据集，系统评估了所有 31 种可能组合的预测性能，其中至少包括五种基因组数据类型（mRNA、miRNA、甲基化、DNAseq 和拷贝数变异）中的一种。我们采用了各种预测方法，并在每个模型中增加了临床数据的权重，以充分利用它们的预测重要性。哈雷尔 C 指数和综合布赖尔得分被用作性能衡量标准。为了评估研究结果的稳健性，我们在所纳入数据集的层面上进行了引导分析。我们对关键结果进行了统计测试，并限制了测试次数，以确保出现假阳性结果的风险较低：结果：与预期相反，我们发现仅使用 mRNA 数据或结合使用 mRNA 和 miRNA 数据就足以分析大多数癌症类型。对于某些癌症类型，额外加入甲基化数据可改善预测结果。引入更多的数据类型非但不能提高性能，反而常常导致性能下降，而性能下降的幅度在两种性能测量方法之间有所不同：我们的研究结果对目前流行的观点提出了质疑，即在多组学生存预测中结合多种组学数据类型可提高预测性能。因此，应该重新考虑多组学预测中普遍采用的尽可能多的数据类型的方法，以避免预测结果不理想和不必要的开支。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.