Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics.

IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY
Oliver M Crook, Kathryn S Lilley, Laurent Gatto, Paul D W Kirk
{"title":"Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics.","authors":"Oliver M Crook, Kathryn S Lilley, Laurent Gatto, Paul D W Kirk","doi":"10.1214/22-AOAS1603","DOIUrl":null,"url":null,"abstract":"<p><p>Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of <i>marker proteins</i> (i.e. proteins with <i>a priori</i> known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on <i>Drosophila</i> embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7613899/pdf/EMS143956.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/22-AOAS1603","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of marker proteins (i.e. proteins with a priori known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on Drosophila embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.

空间蛋白质组学的半监督非参数贝叶斯建模
了解亚细胞蛋白质定位是分析特定环境蛋白质功能的重要组成部分。定量质谱分析(MS)技术的最新进展,已将数千种蛋白质高分辨率地绘制到细胞内的亚细胞位置。因此有必要采用新的建模方法来捕捉这些数据的复杂性质。我们在非参数贝叶斯框架下,利用高斯过程回归模型的 K 分量混合物来分析空间蛋白质组学数据。高斯过程回归模型考虑了亚细胞龛内的相关结构,每个混合物成分捕捉每个龛内观察到的不同相关结构。标记蛋白质(即具有先验已知标记位置的蛋白质)的可用性促使我们采用半监督学习方法为高斯过程超参数提供信息。此外,我们还为我们的模型提供了一个高效的哈密顿-内-吉布斯采样器(Hamiltonian-within-Gibbs sampler)。此外,我们还利用协方差矩阵的结构,减轻了与协方差矩阵反演相关的计算负担。通过对协方差矩阵进行张量分解,可以应用扩展的 Trench 和 Durbin 算法来降低反演的计算复杂度,从而加快计算速度。我们提供了果蝇胚胎和小鼠多能胚胎干细胞的详细案例研究,以说明半监督功能贝叶斯数据建模的好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of Applied Statistics
Annals of Applied Statistics 社会科学-统计学与概率论
CiteScore
3.10
自引率
5.60%
发文量
131
审稿时长
6-12 weeks
期刊介绍: Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信