Flexible model-based non-negative matrix factorization with application to mutational signatures.

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology Pub Date : 2024-05-16 eCollection Date: 2024-01-01 DOI:10.1515/sagmb-2023-0034

Ragnhild Laursen, Lasse Maretty, Asger Hobolth

{"title":"Flexible model-based non-negative matrix factorization with application to mutational signatures.","authors":"Ragnhild Laursen, Lasse Maretty, Asger Hobolth","doi":"10.1515/sagmb-2023-0034","DOIUrl":null,"url":null,"abstract":"<p><p>Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2023-0034","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.

查看原文本刊更多论文

基于模型的灵活非负矩阵因式分解，应用于突变特征。

癌症中的体细胞突变可以看作是多种突变特征的混合分布，可以通过非负矩阵因式分解（NMF）来推断。突变特征以前是通过简单的单核苷酸相互作用模型或一般的三核苷酸相互作用模型进行参数化的。我们描述了一个灵活而新颖的框架，用于识别突变特征的生物合理参数化，特别是用于估算二核苷酸相互作用模型。我们新颖的估计程序基于期望最大化（EM）算法和对数线性准泊松模型回归。我们的研究表明，二核苷酸相互作用特征在统计学上是稳定的，而且足够复杂，可以拟合突变模式。二核苷酸相互作用特征通常能在适当拟合数据和避免过度拟合之间取得恰当的平衡。与单核苷酸相互作用特征相比，二核苷酸相互作用特征能更好地拟合数据，在生物学上更可信，而且参数化比参数丰富的三核苷酸相互作用特征更稳定。我们在一项大型模拟研究中说明了我们的框架，并将其与最先进的方法进行了比较，还展示了乳腺癌、肝癌和泌尿系统癌症患者体细胞突变计数的三个数据集的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Applications in Genetics and Molecular Biology 生物-生化与分子生物学

CiteScore

1.20

自引率

11.10%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.