Fast matrix completion in epigenetic methylation studies with informative covariates.

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics Pub Date : 2024-10-01 DOI:10.1093/biostatistics/kxae016

Mélina Ribaud, Aurélie Labbe, Khaled Fouda, Karim Oualkacha

{"title":"Fast matrix completion in epigenetic methylation studies with informative covariates.","authors":"Mélina Ribaud, Aurélie Labbe, Khaled Fouda, Karim Oualkacha","doi":"10.1093/biostatistics/kxae016","DOIUrl":null,"url":null,"abstract":"<p><p>DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1062-1078"},"PeriodicalIF":2.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471954/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biostatistics/kxae016","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.

查看原文本刊更多论文

在带有信息协变量的表观遗传甲基化研究中快速完成矩阵。

DNA 甲基化是一种重要的表观遗传标记，它通过抑制转录蛋白与 DNA 的结合来调节基因表达。与许多其他 omics 实验一样，缺失值也是一个重要问题，适当的估算技术对于避免不必要的样本量减少以及优化利用收集到的信息非常重要。我们考虑的情况是，通过昂贵的高密度全基因组亚硫酸氢盐测序（WGBS）策略处理的样本相对较少，而通过价格更低廉的基于阵列的低密度技术处理的样本数量较多。在这种情况下，我们可以利用 WGBS 样本提供的高密度信息来推算低覆盖率（基于阵列的）甲基化数据。在本文中，我们提出了一种高效的带有信息协变量的核心区域化线性模型（LMCC），用于根据观测值和协变量预测缺失值。我们的模型假定，在每个位点，所有样本的甲基化向量都与一组固定因子（协变量）和一组潜在因子相关联。此外，我们还利用了数据的函数性质和不同位点间的空间相关性，分别假设了固定系数向量和潜在系数向量的一些高斯过程。我们的模拟结果表明，协变量的使用可以显著提高估算值的准确性，尤其是在缺失数据包含一些解释变量相关信息的情况下。我们还表明，当列数远大于行数时，我们提出的模型尤其有效--甲基化数据分析中通常就是这种情况。最后，我们在两个真实的甲基化数据集上应用并比较了我们提出的方法和其他方法，展示了细胞类型、组织类型或年龄等协变量如何提高估算值的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biostatistics 生物-数学与计算生物学

CiteScore

5.10

自引率

4.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Among the important scientific developments of the 20th century is the explosive growth in statistical reasoning and methods for application to studies of human health. Examples include developments in likelihood methods for inference, epidemiologic statistics, clinical trials, survival analysis, and statistical genetics. Substantive problems in public health and biomedical research have fueled the development of statistical methods, which in turn have improved our ability to draw valid inferences from data. The objective of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public''s health.