Testing for the existence of clusters.

Pub Date : 2009-07-01

Claudio Fuentes, George Casella

{"title":"Testing for the existence of clusters.","authors":"Claudio Fuentes, George Casella","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Detecting and determining clusters present in a certain sample has been an important concern, among researchers from different fields, for a long time. In particular, assessing whether the clusters are statistically significant, is a question that has been asked by a number of experimenters. Recently, this question arose again in a study in maize genetics, where determining the significance of clusters is crucial as a primary step in the identification of a genome-wide collection of mutants that may affect the kernel composition.Although several efforts have been made in this direction, not much has been done with the aim of developing an actual hypothesis test in order to assess the significance of clusters. In this paper, we propose a new methodology that allows the examination of the hypothesis test H(0) : κ=1 vs. H(1) : κ=k, where κ denotes the number of clusters present in a certain population. Our procedure, based on Bayesian tools, permits us to obtain closed form expressions for the posterior probabilities corresponding to the null hypothesis. From here, we calibrate our results by estimating the frequentist null distribution of the posterior probabilities in order to obtain the p-values associated with the observed posterior probabilities. In most cases, actual evaluation of the posterior probabilities is computationally intensive and several algorithms have been discussed in the literature. Here, we propose a simple estimation procedure, based on MCMC techniques, that permits an efficient and easily implementable evaluation of the test. Finally, we present simulation studies that support our conclusions, and we apply our method to the analysis of NIR spectroscopy data coming from the genetic study that motivated this work.</p>","PeriodicalId":89482,"journal":{"name":"","volume":"33 2","pages":"115-157"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3184008/pdf/nihms238157.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"100","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting and determining clusters present in a certain sample has been an important concern, among researchers from different fields, for a long time. In particular, assessing whether the clusters are statistically significant, is a question that has been asked by a number of experimenters. Recently, this question arose again in a study in maize genetics, where determining the significance of clusters is crucial as a primary step in the identification of a genome-wide collection of mutants that may affect the kernel composition.Although several efforts have been made in this direction, not much has been done with the aim of developing an actual hypothesis test in order to assess the significance of clusters. In this paper, we propose a new methodology that allows the examination of the hypothesis test H(0) : κ=1 vs. H(1) : κ=k, where κ denotes the number of clusters present in a certain population. Our procedure, based on Bayesian tools, permits us to obtain closed form expressions for the posterior probabilities corresponding to the null hypothesis. From here, we calibrate our results by estimating the frequentist null distribution of the posterior probabilities in order to obtain the p-values associated with the observed posterior probabilities. In most cases, actual evaluation of the posterior probabilities is computationally intensive and several algorithms have been discussed in the literature. Here, we propose a simple estimation procedure, based on MCMC techniques, that permits an efficient and easily implementable evaluation of the test. Finally, we present simulation studies that support our conclusions, and we apply our method to the analysis of NIR spectroscopy data coming from the genetic study that motivated this work.

本刊更多论文

测试是否存在群集。

长期以来，在不同领域的研究人员中，检测和确定某个样本中存在的簇一直是一个重要的问题。特别是，评估集群是否具有统计意义，这是许多实验者提出的问题。最近，这个问题在玉米遗传学的一项研究中再次出现，其中确定簇的重要性是鉴定可能影响籽粒组成的全基因组突变体的关键一步。尽管在这个方向上已经做出了一些努力，但为了评估聚类的重要性而开发一个实际的假设检验的目标却做得不多。在本文中，我们提出了一种新的方法，允许对假设检验H(0): κ=1与H(1): κ=k进行检验，其中κ表示在某一种群中存在的簇数。我们的程序，基于贝叶斯工具，允许我们获得后验概率对应于零假设的封闭形式表达式。从这里开始，我们通过估计后验概率的频率零分布来校准我们的结果，以便获得与观察到的后验概率相关的p值。在大多数情况下，后验概率的实际评估是计算密集型的，文献中已经讨论了几种算法。在这里，我们提出了一个简单的评估过程，基于MCMC技术，允许一个有效的和容易实现的测试评估。最后，我们提出了支持我们结论的模拟研究，并将我们的方法应用于分析来自基因研究的近红外光谱数据，这些数据推动了这项工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文