Thu Kim Le, L. Vinh, Dong Do Due, Bui Ngoc Thang, Thao Thi Phuong Nguyen
{"title":"iK-means: an improvement of the iterative k-means partitioning algorithm","authors":"Thu Kim Le, L. Vinh, Dong Do Due, Bui Ngoc Thang, Thao Thi Phuong Nguyen","doi":"10.1109/KSE50997.2020.9287221","DOIUrl":null,"url":null,"abstract":"The evolutionary processes vary among sites of an alignment that strongly affect the accuracy of phylogenetic tree reconstruction. Partitioning an alignment into sub-alignments of sites such that the evolutionary processes at sites in the same sub-alignment are highly similar is a proper strategy. Gene features might be used as reasonable indicators to partition an alignment. However, the gene feature information is not always available or efficient Computational partitioning methods like iterative k-means has been proposed to automatically partition sites into groups based on the similarity of evolutionary rates of sites. Despite obtaining compelling results in terms of AICc and BIC measurements, the k-means method forms a group of all and only invariant sites that results in bias/wrong phylogenetic trees. In this paper, we improve the k-means algorithm by re-classifying invariant sites into different sub-alignments based on their likelihood values. Experimental results on simulated and empirical DNA datasets showed that the new method, called iK-means, overcame the pitfall of the K-means method, consequently, helps improve the quality of the partitioning sub-alignments. We recommend using the iK-means method to level up the accuracy in inferring phylogenetic trees.","PeriodicalId":275683,"journal":{"name":"2020 12th International Conference on Knowledge and Systems Engineering (KSE)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 12th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE50997.2020.9287221","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The evolutionary processes vary among sites of an alignment that strongly affect the accuracy of phylogenetic tree reconstruction. Partitioning an alignment into sub-alignments of sites such that the evolutionary processes at sites in the same sub-alignment are highly similar is a proper strategy. Gene features might be used as reasonable indicators to partition an alignment. However, the gene feature information is not always available or efficient Computational partitioning methods like iterative k-means has been proposed to automatically partition sites into groups based on the similarity of evolutionary rates of sites. Despite obtaining compelling results in terms of AICc and BIC measurements, the k-means method forms a group of all and only invariant sites that results in bias/wrong phylogenetic trees. In this paper, we improve the k-means algorithm by re-classifying invariant sites into different sub-alignments based on their likelihood values. Experimental results on simulated and empirical DNA datasets showed that the new method, called iK-means, overcame the pitfall of the K-means method, consequently, helps improve the quality of the partitioning sub-alignments. We recommend using the iK-means method to level up the accuracy in inferring phylogenetic trees.