iK-means: an improvement of the iterative k-means partitioning algorithm

2020 12th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2020-11-12 DOI:10.1109/KSE50997.2020.9287221

Thu Kim Le, L. Vinh, Dong Do Due, Bui Ngoc Thang, Thao Thi Phuong Nguyen

{"title":"iK-means: an improvement of the iterative k-means partitioning algorithm","authors":"Thu Kim Le, L. Vinh, Dong Do Due, Bui Ngoc Thang, Thao Thi Phuong Nguyen","doi":"10.1109/KSE50997.2020.9287221","DOIUrl":null,"url":null,"abstract":"The evolutionary processes vary among sites of an alignment that strongly affect the accuracy of phylogenetic tree reconstruction. Partitioning an alignment into sub-alignments of sites such that the evolutionary processes at sites in the same sub-alignment are highly similar is a proper strategy. Gene features might be used as reasonable indicators to partition an alignment. However, the gene feature information is not always available or efficient Computational partitioning methods like iterative k-means has been proposed to automatically partition sites into groups based on the similarity of evolutionary rates of sites. Despite obtaining compelling results in terms of AICc and BIC measurements, the k-means method forms a group of all and only invariant sites that results in bias/wrong phylogenetic trees. In this paper, we improve the k-means algorithm by re-classifying invariant sites into different sub-alignments based on their likelihood values. Experimental results on simulated and empirical DNA datasets showed that the new method, called iK-means, overcame the pitfall of the K-means method, consequently, helps improve the quality of the partitioning sub-alignments. We recommend using the iK-means method to level up the accuracy in inferring phylogenetic trees.","PeriodicalId":275683,"journal":{"name":"2020 12th International Conference on Knowledge and Systems Engineering (KSE)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 12th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE50997.2020.9287221","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The evolutionary processes vary among sites of an alignment that strongly affect the accuracy of phylogenetic tree reconstruction. Partitioning an alignment into sub-alignments of sites such that the evolutionary processes at sites in the same sub-alignment are highly similar is a proper strategy. Gene features might be used as reasonable indicators to partition an alignment. However, the gene feature information is not always available or efficient Computational partitioning methods like iterative k-means has been proposed to automatically partition sites into groups based on the similarity of evolutionary rates of sites. Despite obtaining compelling results in terms of AICc and BIC measurements, the k-means method forms a group of all and only invariant sites that results in bias/wrong phylogenetic trees. In this paper, we improve the k-means algorithm by re-classifying invariant sites into different sub-alignments based on their likelihood values. Experimental results on simulated and empirical DNA datasets showed that the new method, called iK-means, overcame the pitfall of the K-means method, consequently, helps improve the quality of the partitioning sub-alignments. We recommend using the iK-means method to level up the accuracy in inferring phylogenetic trees.

查看原文本刊更多论文

k-means:迭代k-means划分算法的改进

进化过程在同一序列的不同位点之间存在差异，这严重影响了系统发育树重建的准确性。将一个序列划分为位点的子序列，使得同一子序列中的位点的进化过程高度相似，这是一种适当的策略。基因特征可以作为划分亲缘的合理指标。然而，基因特征信息并不总是可用的，人们提出了迭代k-means等高效的计算划分方法，根据位点进化速率的相似性将位点自动划分为组。尽管在AICc和BIC测量方面获得了令人信服的结果，但k-means方法形成了一组所有且唯一的不变位点，导致偏差/错误的系统发育树。在本文中，我们改进了k-means算法，根据它们的似然值将不变位点重新分类为不同的子序列。在模拟和经验DNA数据集上的实验结果表明，K-means方法克服了K-means方法的缺陷，有助于提高划分子序列的质量。我们建议使用iK-means方法来提高推断系统发育树的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 12th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量