Penalized likelihood optimization for censored missing value imputation in proteomics.

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics Pub Date : 2024-12-31 DOI:10.1093/biostatistics/kxaf006

Lucas Etourneau, Laura Fancello, Samuel Wieczorek, Nelle Varoquaux, Thomas Burger

{"title":"Penalized likelihood optimization for censored missing value imputation in proteomics.","authors":"Lucas Etourneau, Laura Fancello, Samuel Wieczorek, Nelle Varoquaux, Thomas Burger","doi":"10.1093/biostatistics/kxaf006","DOIUrl":null,"url":null,"abstract":"<p><p>Label-free bottom-up proteomics using mass spectrometry and liquid chromatography has long been established as one of the most popular high-throughput analysis workflows for proteome characterization. However, it produces data hindered by complex and heterogeneous missing values, which imputation has long remained problematic. To cope with this, we introduce Pirat, an algorithm that harnesses this challenge using an original likelihood maximization strategy. Notably, it models the instrument limit by learning a global censoring mechanism from the data available. Moreover, it estimates the covariance matrix between enzymatic cleavage products (ie peptides or precursor ions), while offering a natural way to integrate complementary transcriptomic information when multi-omic assays are available. Our benchmarking on several datasets covering a variety of experimental designs (number of samples, acquisition mode, missingness patterns, etc.) and using a variety of metrics (differential analysis ground truth or imputation errors) shows that Pirat outperforms all pre-existing imputation methods. Beyond the interest of Pirat as an imputation tool, these results pinpoint the need for a paradigm change in proteomics imputation, as most pre-existing strategies could be boosted by incorporating similar models to account for the instrument censorship or for the correlation structures, either grounded to the analytical pipeline or arising from a multi-omic approach.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biostatistics/kxaf006","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Label-free bottom-up proteomics using mass spectrometry and liquid chromatography has long been established as one of the most popular high-throughput analysis workflows for proteome characterization. However, it produces data hindered by complex and heterogeneous missing values, which imputation has long remained problematic. To cope with this, we introduce Pirat, an algorithm that harnesses this challenge using an original likelihood maximization strategy. Notably, it models the instrument limit by learning a global censoring mechanism from the data available. Moreover, it estimates the covariance matrix between enzymatic cleavage products (ie peptides or precursor ions), while offering a natural way to integrate complementary transcriptomic information when multi-omic assays are available. Our benchmarking on several datasets covering a variety of experimental designs (number of samples, acquisition mode, missingness patterns, etc.) and using a variety of metrics (differential analysis ground truth or imputation errors) shows that Pirat outperforms all pre-existing imputation methods. Beyond the interest of Pirat as an imputation tool, these results pinpoint the need for a paradigm change in proteomics imputation, as most pre-existing strategies could be boosted by incorporating similar models to account for the instrument censorship or for the correlation structures, either grounded to the analytical pipeline or arising from a multi-omic approach.

查看原文本刊更多论文

蛋白质组学中缺失值估算的惩罚似然优化。

使用质谱和液相色谱的无标签自下而上的蛋白质组学长期以来一直是蛋白质组学表征最流行的高通量分析工作流程之一。然而，它产生的数据受到复杂和异构缺失值的阻碍，这是长期以来一直存在的问题。为了解决这个问题，我们引入了Pirat，这是一种利用原始可能性最大化策略来应对这一挑战的算法。值得注意的是，它通过从现有数据中学习全球审查机制来模拟工具限制。此外，它估计了酶裂解产物（即肽或前体离子）之间的协方差矩阵，同时提供了一种自然的方法来整合互补的转录组信息，当多组分析是可用的。我们对涵盖各种实验设计（样本数量、采集模式、缺失模式等）的多个数据集进行基准测试，并使用各种度量（差分分析基础真值或输入误差）表明Pirat优于所有现有的输入方法。除了Pirat作为一种输入工具的兴趣之外，这些结果还指出了蛋白质组学输入的范式改变的必要性，因为大多数现有的策略可以通过合并类似的模型来促进仪器审查或相关结构，无论是基于分析管道还是来自多组学方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biostatistics 生物-数学与计算生物学

CiteScore

5.10

自引率

4.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Among the important scientific developments of the 20th century is the explosive growth in statistical reasoning and methods for application to studies of human health. Examples include developments in likelihood methods for inference, epidemiologic statistics, clinical trials, survival analysis, and statistical genetics. Substantive problems in public health and biomedical research have fueled the development of statistical methods, which in turn have improved our ability to draw valid inferences from data. The objective of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public''s health.