基因表达数据完全反褶积的几何结构导向模型和算法

IF 1.7 Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.) Pub Date : 2022-09-01 DOI:10.3934/fods.2022013

Duan Chen, Shaoyu Li, Xue Wang

{"title":"基因表达数据完全反褶积的几何结构导向模型和算法","authors":"Duan Chen, Shaoyu Li, Xue Wang","doi":"10.3934/fods.2022013","DOIUrl":null,"url":null,"abstract":"Complete deconvolution analysis for bulk RNA-seq data is important and helpful to distinguish whether the differences of disease-associated GEPs (gene expression profiles) in tissues of patients and normal controls are due to changes in cellular composition of tissue samples, or due to GEPs changes in specific cells. One of the major techniques to perform complete deconvolution is nonnegative matrix factorization (NMF), which also has a wide-range of applications in the machine learning community. However, the NMF is a well-known strongly ill-posed problem, so a direct application of NMF to RNA-seq data will suffer severe difficulties in the interpretability of solutions. In this paper, we develop an NMF-based mathematical model and corresponding computational algorithms to improve the solution identifiability of deconvoluting bulk RNA-seq data. In our approach, we combine the biological concept of marker genes with the solvability conditions of the NMF theories, and develop a geometric structures guided optimization model. In this strategy, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. Then, the identified information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. Both synthetic and biological data are used to validate the proposed model and algorithms, from which solution interpretability and accuracy are significantly improved.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":"441-466"},"PeriodicalIF":1.7000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10798655/pdf/","citationCount":"0","resultStr":"{\"title\":\"GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA.\",\"authors\":\"Duan Chen, Shaoyu Li, Xue Wang\",\"doi\":\"10.3934/fods.2022013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Complete deconvolution analysis for bulk RNA-seq data is important and helpful to distinguish whether the differences of disease-associated GEPs (gene expression profiles) in tissues of patients and normal controls are due to changes in cellular composition of tissue samples, or due to GEPs changes in specific cells. One of the major techniques to perform complete deconvolution is nonnegative matrix factorization (NMF), which also has a wide-range of applications in the machine learning community. However, the NMF is a well-known strongly ill-posed problem, so a direct application of NMF to RNA-seq data will suffer severe difficulties in the interpretability of solutions. In this paper, we develop an NMF-based mathematical model and corresponding computational algorithms to improve the solution identifiability of deconvoluting bulk RNA-seq data. In our approach, we combine the biological concept of marker genes with the solvability conditions of the NMF theories, and develop a geometric structures guided optimization model. In this strategy, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. Then, the identified information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. Both synthetic and biological data are used to validate the proposed model and algorithms, from which solution interpretability and accuracy are significantly improved.\",\"PeriodicalId\":73054,\"journal\":{\"name\":\"Foundations of data science (Springfield, Mo.)\",\"volume\":\"1 1\",\"pages\":\"441-466\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2022-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10798655/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Foundations of data science (Springfield, Mo.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3934/fods.2022013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations of data science (Springfield, Mo.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3934/fods.2022013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

摘要

对大量RNA-seq数据进行完整的去卷积分析非常重要，有助于区分患者和正常对照组组织中疾病相关GEP（基因表达谱）的差异是由于组织样本的细胞组成变化，还是由于特定细胞中GEP的变化。执行完全反褶积的主要技术之一是非负矩阵分解（NMF），它在机器学习社区中也有广泛的应用。然而，NMF是一个众所周知的强不适定问题，因此将NMF直接应用于RNA-seq数据将在解决方案的可解释性方面遇到严重困难。在本文中，我们开发了一个基于NMF的数学模型和相应的计算算法，以提高解卷积批量RNA-seq数据的解可识别性。在我们的方法中，我们将标记基因的生物学概念与NMF理论的可解性条件相结合，并开发了一个几何结构引导的优化模型。在该策略中，首先通过光谱聚类技术来探索大块组织数据的几何结构。然后，标记基因的识别信息被整合为可解性约束，而整体相关图被用作流形正则化。使用合成和生物数据来验证所提出的模型和算法，从而显著提高了解决方案的可解释性和准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA.

Complete deconvolution analysis for bulk RNA-seq data is important and helpful to distinguish whether the differences of disease-associated GEPs (gene expression profiles) in tissues of patients and normal controls are due to changes in cellular composition of tissue samples, or due to GEPs changes in specific cells. One of the major techniques to perform complete deconvolution is nonnegative matrix factorization (NMF), which also has a wide-range of applications in the machine learning community. However, the NMF is a well-known strongly ill-posed problem, so a direct application of NMF to RNA-seq data will suffer severe difficulties in the interpretability of solutions. In this paper, we develop an NMF-based mathematical model and corresponding computational algorithms to improve the solution identifiability of deconvoluting bulk RNA-seq data. In our approach, we combine the biological concept of marker genes with the solvability conditions of the NMF theories, and develop a geometric structures guided optimization model. In this strategy, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. Then, the identified information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. Both synthetic and biological data are used to validate the proposed model and algorithms, from which solution interpretability and accuracy are significantly improved.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Foundations of data science (Springfield, Mo.)

CiteScore

3.30

自引率

0.00%

发文量