{"title":"Concerning the NJ algorithm and its unweighted version, UNJ","authors":"O. Gascuel","doi":"10.1090/dimacs/037/09","DOIUrl":null,"url":null,"abstract":"In this paper we will present UNJ, an unweighted version of the NJ algorithm (Saitou and Nei 1987; Studier and Keppler 1988). We will demonstrate that UNJ is well suited when the data are of the ( ) ( ) δ ε ij ij ij d = + type, where ( ) d ij is a tree distance, and when the εij are independent and identically distributed noise variables. Simulations confirm this theory. On a more general level, we will study the three main components of the agglomerative approach, applied to the reconstruction of tree distances. (i) We will demonstrate that the selection criterion for the pair to be agglomerated, used by NJ and UNJ, retains its meaning whatever the variances and covariances of the δij estimates. We will also provide a new proof of the correction of this criterion, based on an interpretation in acentrality terms proposed by Mirkin (1996). (ii) Using the results of Vach (1989), of which we will provide a simple new demonstration, we propose an analytical formula which enables the correct least-squares estimation of edge lengths in ( ) O n time, where n is the number of objects. (iii) We will provide a class of admissible reduction formulae which guarantee the finding of the true tree with additive data. We propose to choose, among these formulae, the minimum variance reduction, so that at each step we use estimates which are as reliable as possible in choosing the pair to be agglomerated. We will present the general solution, and apply it to the particular data model retained here.","PeriodicalId":336874,"journal":{"name":"Mathematical Hierarchies and Biology","volume":"50 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"137","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Hierarchies and Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1090/dimacs/037/09","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 137
Abstract
In this paper we will present UNJ, an unweighted version of the NJ algorithm (Saitou and Nei 1987; Studier and Keppler 1988). We will demonstrate that UNJ is well suited when the data are of the ( ) ( ) δ ε ij ij ij d = + type, where ( ) d ij is a tree distance, and when the εij are independent and identically distributed noise variables. Simulations confirm this theory. On a more general level, we will study the three main components of the agglomerative approach, applied to the reconstruction of tree distances. (i) We will demonstrate that the selection criterion for the pair to be agglomerated, used by NJ and UNJ, retains its meaning whatever the variances and covariances of the δij estimates. We will also provide a new proof of the correction of this criterion, based on an interpretation in acentrality terms proposed by Mirkin (1996). (ii) Using the results of Vach (1989), of which we will provide a simple new demonstration, we propose an analytical formula which enables the correct least-squares estimation of edge lengths in ( ) O n time, where n is the number of objects. (iii) We will provide a class of admissible reduction formulae which guarantee the finding of the true tree with additive data. We propose to choose, among these formulae, the minimum variance reduction, so that at each step we use estimates which are as reliable as possible in choosing the pair to be agglomerated. We will present the general solution, and apply it to the particular data model retained here.
在本文中,我们将介绍UNJ, NJ算法的非加权版本(Saitou和Nei 1987;studer and Keppler, 1988)。我们将证明当数据为()()δ εij ij ij d = +类型时,UNJ是很适合的,其中()d ij是树距离,并且εij是独立且分布相同的噪声变量。模拟证实了这一理论。在更一般的层面上,我们将研究应用于树距离重建的凝聚方法的三个主要组成部分。(i)我们将证明,无论δij估计的方差和协方差如何,NJ和UNJ使用的待凝聚对的选择标准都保持其意义。我们还将根据Mirkin(1996)提出的中心性术语的解释,提供对这一标准的修正的新证明。(ii)利用Vach(1989)的结果,我们将提供一个简单的新演示,我们提出了一个解析公式,该公式能够在()O n时间内对边长度进行正确的最小二乘估计,其中n是对象的数量。(iii)我们将提供一类可容许的约简公式,保证发现具有可加数据的真树。我们建议在这些公式中选择最小方差缩减,以便在每一步中我们使用尽可能可靠的估计来选择要聚集的对。我们将给出一般的解决方案,并将其应用于这里保留的特定数据模型。