{"title":"应用于单细胞测序数据分析的高维过度分散广义因子模型","authors":"Jinyu Nie, Zhilong Qin, Wei Liu","doi":"10.1002/sim.10213","DOIUrl":null,"url":null,"abstract":"<p><p>The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"4836-4849"},"PeriodicalIF":1.8000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.\",\"authors\":\"Jinyu Nie, Zhilong Qin, Wei Liu\",\"doi\":\"10.1002/sim.10213\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.</p>\",\"PeriodicalId\":21879,\"journal\":{\"name\":\"Statistics in Medicine\",\"volume\":\" \",\"pages\":\"4836-4849\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistics in Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/sim.10213\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/5 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.10213","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/5 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
目前的高维线性因子模型无法解释不同类型的变量,而高维非线性因子模型往往忽略了混合型数据中存在的过度分散性。然而,在实际应用中,尤其是在生物医学和基因组学研究等领域,超分散现象十分普遍。针对这一实际需求,我们提出了一种超分散广义因子模型(OverGFM),用于对超分散混合型数据进行高维非线性因子分析。我们的方法包含一个额外的误差项,以捕捉仅靠因子无法解释的超分散性。然而,由于非线性模型中涉及两个高维潜在随机矩阵,这给计算带来了巨大挑战。为了克服这些挑战,我们提出了一种整合拉普拉斯和泰勒近似的新型变分电磁算法。该算法为复杂的变分参数提供了迭代显式解,并被证明具有出色的收敛特性。我们还开发了一种基于奇异值比率的标准,以确定最佳因子数。数值结果证明了这一标准的有效性。通过全面的模拟研究,我们表明 OverGFM 在估计精度和计算效率方面都优于最先进的方法。此外,我们还通过将该方法应用于两个基因组学数据集,证明了它的实用性。为了方便使用,我们将 OverGFM 的实现集成到了 R 软件包 GFM 中。
High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.
The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.
期刊介绍:
The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.