{"title":"一个更准确和有效的全基因组系统发育","authors":"P. Chan, T. Lam, S. Yiu","doi":"10.1142/9781860947292_0037","DOIUrl":null,"url":null,"abstract":"To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment. Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings. (1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny. (2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances. From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"15 1","pages":"337-352"},"PeriodicalIF":0.0000,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A More Accurate and Efficient Whole Genome Phylogeny\",\"authors\":\"P. Chan, T. Lam, S. Yiu\",\"doi\":\"10.1142/9781860947292_0037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment. Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings. (1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny. (2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances. From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.\",\"PeriodicalId\":74513,\"journal\":{\"name\":\"Proceedings of the ... Asia-Pacific bioinformatics conference\",\"volume\":\"15 1\",\"pages\":\"337-352\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... Asia-Pacific bioinformatics conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/9781860947292_0037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... Asia-Pacific bioinformatics conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781860947292_0037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A More Accurate and Efficient Whole Genome Phylogeny
To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment. Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings. (1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny. (2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances. From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.