Tukey的g- &-h分布的有限混合估计和模型选择。

IF 1.6 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing Pub Date : 2025-01-01 Epub Date: 2025-03-15 DOI:10.1007/s11222-025-10596-9

Tingting Zhan, Misung Yi, Amy R Peck, Hallgeir Rui, Inna Chervoneva

{"title":"Tukey的g- &-h分布的有限混合估计和模型选择。","authors":"Tingting Zhan, Misung Yi, Amy R Peck, Hallgeir Rui, Inna Chervoneva","doi":"10.1007/s11222-025-10596-9","DOIUrl":null,"url":null,"abstract":"A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey's g- &-h distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey's g- &-h distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey's g- &-h mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey's g- &-h mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey's g- &-h mixtures and compare to performance of mixtures of skew-normal or skew-t distributions. The Tukey's g- &-h mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"67"},"PeriodicalIF":1.6000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11910465/pdf/","citationCount":"0","resultStr":"{\"title\":\"Estimation and model selection for finite mixtures of Tukey's g- &-h distributions.\",\"authors\":\"Tingting Zhan, Misung Yi, Amy R Peck, Hallgeir Rui, Inna Chervoneva\",\"doi\":\"10.1007/s11222-025-10596-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey's g- &-h distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey's g- &-h distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey's g- &-h mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey's g- &-h mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey's g- &-h mixtures and compare to performance of mixtures of skew-normal or skew-t distributions. The Tukey's g- &-h mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.\",\"PeriodicalId\":22058,\"journal\":{\"name\":\"Statistics and Computing\",\"volume\":\"35 3\",\"pages\":\"67\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11910465/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistics and Computing\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1007/s11222-025-10596-9\",\"RegionNum\":2,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics and Computing","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s11222-025-10596-9","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/15 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

有限混合分布是一种流行的统计模型，当感兴趣的总体可能包括不同的子总体时，这种模型特别有意义。这项工作的动机是利用免疫荧光免疫组织化学方法定量分析人体组织的蛋白质表达水平。组织中细胞蛋白表达水平的分布通常表现为多模态、偏态和重尾，但不同对象的不同组织中的分布之间存在很大的变异性，而其中一些混合分布包括符合正态分布假设的成分。为了适应这种多样性，我们提出了一个4参数Tukey的g- &-h分布的混合物，用于拟合具有高斯和非高斯分量的有限混合物。Tukey的g- &-h分布是一种灵活的模型，允许混合成分的偏度和峰度变化，包括正态分布作为一种特殊情况。由于Tukey的g- &-h混合物的似然不具有封闭的解析形式，我们提出了这种混合物参数的分位数最小马氏距离（QLMD）估计量。qmd是一种间接估计量，它最小化了样本和基于模型的分位数之间的马氏距离，其渐近性质遵循间接估计的一般理论。我们开发了一种逐步选择简洁的Tukey的g- &-h混合模型的算法，并在CRAN上可用的R包QuantileGH中实现了所有提出的方法。进行了模拟研究，以评估Tukey的g- &-h混合物的性能，并将其与斜正态分布或斜t分布的混合物的性能进行比较。Tukey的g- &-h混合物用于模拟乳腺癌组织中Cyclin D1蛋白的细胞表达，并将结果参数估计作为无进展生存期的预测指标进行评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Estimation and model selection for finite mixtures of Tukey's g- &-h distributions.

查看原文本刊更多论文

Estimation and model selection for finite mixtures of Tukey's g- &-h distributions.

A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey's g- &-h distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey's g- &-h distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey's g- &-h mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey's g- &-h mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey's g- &-h mixtures and compare to performance of mixtures of skew-normal or skew-t distributions. The Tukey's g- &-h mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistics and Computing 数学-计算机：理论方法

CiteScore

3.20

自引率

4.50%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistics and Computing is a bi-monthly refereed journal which publishes papers covering the range of the interface between the statistical and computing sciences. In particular, it addresses the use of statistical concepts in computing science, for example in machine learning, computer vision and data analytics, as well as the use of computers in data modelling, prediction and analysis. Specific topics which are covered include: techniques for evaluating analytically intractable problems such as bootstrap resampling, Markov chain Monte Carlo, sequential Monte Carlo, approximate Bayesian computation, search and optimization methods, stochastic simulation and Monte Carlo, graphics, computer environments, statistical approaches to software errors, information retrieval, machine learning, statistics of databases and database technology, huge data sets and big data analytics, computer algebra, graphical models, image processing, tomography, inverse problems and uncertainty quantification. In addition, the journal contains original research reports, authoritative review papers, discussed papers, and occasional special issues on particular topics or carrying proceedings of relevant conferences. Statistics and Computing also publishes book review and software review sections.