Fast matrix completion in epigenetic methylation studies with informative covariates.

IF 1.8 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Mélina Ribaud, Aurélie Labbe, Khaled Fouda, Karim Oualkacha
{"title":"Fast matrix completion in epigenetic methylation studies with informative covariates.","authors":"Mélina Ribaud, Aurélie Labbe, Khaled Fouda, Karim Oualkacha","doi":"10.1093/biostatistics/kxae016","DOIUrl":null,"url":null,"abstract":"<p><p>DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1062-1078"},"PeriodicalIF":1.8000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471954/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biostatistics/kxae016","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.

在带有信息协变量的表观遗传甲基化研究中快速完成矩阵。
DNA 甲基化是一种重要的表观遗传标记,它通过抑制转录蛋白与 DNA 的结合来调节基因表达。与许多其他 omics 实验一样,缺失值也是一个重要问题,适当的估算技术对于避免不必要的样本量减少以及优化利用收集到的信息非常重要。我们考虑的情况是,通过昂贵的高密度全基因组亚硫酸氢盐测序(WGBS)策略处理的样本相对较少,而通过价格更低廉的基于阵列的低密度技术处理的样本数量较多。在这种情况下,我们可以利用 WGBS 样本提供的高密度信息来推算低覆盖率(基于阵列的)甲基化数据。在本文中,我们提出了一种高效的带有信息协变量的核心区域化线性模型(LMCC),用于根据观测值和协变量预测缺失值。我们的模型假定,在每个位点,所有样本的甲基化向量都与一组固定因子(协变量)和一组潜在因子相关联。此外,我们还利用了数据的函数性质和不同位点间的空间相关性,分别假设了固定系数向量和潜在系数向量的一些高斯过程。我们的模拟结果表明,协变量的使用可以显著提高估算值的准确性,尤其是在缺失数据包含一些解释变量相关信息的情况下。我们还表明,当列数远大于行数时,我们提出的模型尤其有效--甲基化数据分析中通常就是这种情况。最后,我们在两个真实的甲基化数据集上应用并比较了我们提出的方法和其他方法,展示了细胞类型、组织类型或年龄等协变量如何提高估算值的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Biostatistics
Biostatistics 生物-数学与计算生物学
CiteScore
5.10
自引率
4.80%
发文量
45
审稿时长
6-12 weeks
期刊介绍: Among the important scientific developments of the 20th century is the explosive growth in statistical reasoning and methods for application to studies of human health. Examples include developments in likelihood methods for inference, epidemiologic statistics, clinical trials, survival analysis, and statistical genetics. Substantive problems in public health and biomedical research have fueled the development of statistical methods, which in turn have improved our ability to draw valid inferences from data. The objective of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public''s health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信