Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease.

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics Pub Date : 2024-12-31 DOI:10.1093/biostatistics/kxaf001

Sandra E Safo, Han Lu

{"title":"Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease.","authors":"Sandra E Safo, Han Lu","doi":"10.1093/biostatistics/kxaf001","DOIUrl":null,"url":null,"abstract":"<p><p>There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839864/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biostatistics/kxaf001","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.

查看原文本刊更多论文

多视图数据集成与预测的可扩展随机核方法及其在冠状病毒病中的应用。

尽管大流行已经过去了4年，但关于冠状病毒病（COVID-19）的病理生物学，我们还有更多需要了解的。多组学方法提供了对该疾病的全面看法，并有可能对该疾病的发病机制产生更深入的了解。以往对COVID-19严重程度和病情的多组学综合分析和预测研究假设组学数据之间以及组学与COVID-19结局之间存在简单关系（即线性关系）。然而，这些线性方法并没有考虑到与这些不同类型的数据相关的固有的潜在非线性结构。这项工作背后的动机是模拟多组学和COVID-19结果之间的非线性关系，并确定与该疾病相关的关键多维分子。为了实现这一目标，我们开发了可扩展的随机核方法，用于联合关联来自多个来源或视图的数据，并同时预测结果或将单元分类为两个或多个类之一。我们还确定最有助于视图之间关系的变量或变量组。我们使用随机傅里叶基可以近似移位不变核函数的思想来构造每个视图的非线性映射，并使用这些映射和结果变量来学习与视图无关的低维表示。我们通过大量的仿真证明了所提出方法的有效性。将所提出的方法应用于与COVID-19相关的基因表达、代谢组学、蛋白质组学和脂质组学数据时，我们确定了COVID-19状态和严重程度的几个分子特征。我们的结果与先前的发现一致，并为未来的研究提供了潜在的途径。我们的算法是在Pytorch中实现的，并在R中接口，可在：https://github.com/lasandrall/RandMVLearn。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biostatistics 生物-数学与计算生物学

CiteScore

5.10

自引率

4.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Among the important scientific developments of the 20th century is the explosive growth in statistical reasoning and methods for application to studies of human health. Examples include developments in likelihood methods for inference, epidemiologic statistics, clinical trials, survival analysis, and statistical genetics. Substantive problems in public health and biomedical research have fueled the development of statistical methods, which in turn have improved our ability to draw valid inferences from data. The objective of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public''s health.