An Assessment of PC-mer's Performance in Alignment-Free Phylogenetic Tree Construction

arXiv - CS - Mathematical Software Pub Date : 2023-11-21 DOI:arxiv-2311.12898

Saeedeh Akbari Rokn Abadi, Melika Honarmand, Ali Hajialinaghi, Somayyeh Koohi

{"title":"An Assessment of PC-mer's Performance in Alignment-Free Phylogenetic Tree Construction","authors":"Saeedeh Akbari Rokn Abadi, Melika Honarmand, Ali Hajialinaghi, Somayyeh Koohi","doi":"arxiv-2311.12898","DOIUrl":null,"url":null,"abstract":"Background: Sequence comparison is essential in bioinformatics, serving\nvarious purposes such as taxonomy, functional inference, and drug discovery.\nThe traditional method of aligning sequences for comparison is time-consuming,\nespecially with large datasets. To overcome this, alignment-free methods have\nemerged as an alternative approach, prioritizing comparison scores over\nalignment itself. These methods directly compare sequences without the need for\nalignment. However, accurately representing the relationships between sequences\nis a significant challenge in the design of these tools. Methods:One of the\nalignment-free comparison approaches utilizes the frequency of fixed-length\nsubstrings, known as K-mers, which serves as the foundation for many sequence\ncomparison methods. However, a challenge arises in these methods when\nincreasing the length of the substring (K), as it leads to an exponential\ngrowth in the number of possible states. In this work, we explore the PC-mer\nmethod, which utilizes a more limited set of words that experience slower\ngrowth 2^k instead of 4^k compared to K. We conducted a comparison of sequences\nand evaluated how the reduced input vector size influenced the performance of\nthe PC-mer method. Results: For the evaluation, we selected the Clustal Omega\nmethod as our reference approach, alongside three alignment-free methods:\nkmacs, FFP, and alfpy (word count). These methods also leverage the frequency\nof K-mers. We applied all five methods to 9 datasets for comprehensive\nanalysis. The results were compared using phylogenetic trees and metrics such\nas Robinson-Foulds and normalized quartet distance (nQD). Conclusion: Our\nfindings indicate that, unlike reducing the input features in other\nalignment-independent methods, the PC-mer method exhibits competitive\nperformance when compared to the aforementioned methods especially when input\nsequences are very varied.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"10 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2311.12898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Sequence comparison is essential in bioinformatics, serving various purposes such as taxonomy, functional inference, and drug discovery. The traditional method of aligning sequences for comparison is time-consuming, especially with large datasets. To overcome this, alignment-free methods have emerged as an alternative approach, prioritizing comparison scores over alignment itself. These methods directly compare sequences without the need for alignment. However, accurately representing the relationships between sequences is a significant challenge in the design of these tools. Methods:One of the alignment-free comparison approaches utilizes the frequency of fixed-length substrings, known as K-mers, which serves as the foundation for many sequence comparison methods. However, a challenge arises in these methods when increasing the length of the substring (K), as it leads to an exponential growth in the number of possible states. In this work, we explore the PC-mer method, which utilizes a more limited set of words that experience slower growth 2^k instead of 4^k compared to K. We conducted a comparison of sequences and evaluated how the reduced input vector size influenced the performance of the PC-mer method. Results: For the evaluation, we selected the Clustal Omega method as our reference approach, alongside three alignment-free methods: kmacs, FFP, and alfpy (word count). These methods also leverage the frequency of K-mers. We applied all five methods to 9 datasets for comprehensive analysis. The results were compared using phylogenetic trees and metrics such as Robinson-Foulds and normalized quartet distance (nQD). Conclusion: Our findings indicate that, unlike reducing the input features in other alignment-independent methods, the PC-mer method exhibits competitive performance when compared to the aforementioned methods especially when input sequences are very varied.

查看原文本刊更多论文

PC-mer在无比对系统发育树构建中的性能评价

背景:序列比较在生物信息学中是必不可少的，服务于各种目的，如分类、功能推断和药物发现。传统的序列比对比较方法非常耗时，特别是对于大型数据集。为了克服这个问题，无对齐方法作为一种替代方法出现了，它优先考虑比较分数高于对齐本身。这些方法直接比较序列而不需要对齐。然而，在这些工具的设计中，准确地表示序列之间的关系是一个重大挑战。方法:一种无比对比较方法利用固定长度子串的频率，称为K-mers，它是许多序列比较方法的基础。然而，当增加子串(K)的长度时，这些方法会出现一个挑战，因为它会导致可能状态的数量呈指数增长。在这项工作中，我们探索了PC-mer方法，该方法使用了一组更有限的单词，与k相比，这些单词的增长速度较慢，为2^k，而不是4^k。我们对序列进行了比较，并评估了减少的输入向量大小如何影响PC-mer方法的性能。结果:在评估中，我们选择了集群omega方法作为我们的参考方法，以及三种不需要对齐的方法:kmacs、FFP和alfpy(字数统计)。这些方法还利用了K-mers的频率。我们将这五种方法应用于9个数据集进行综合分析。使用系统发育树和指标(如Robinson-Foulds和归一化四重奏距离(nQD))对结果进行比较。结论:我们的研究结果表明，与其他与对齐无关的方法中减少输入特征不同，PC-mer方法与上述方法相比，表现出具有竞争力的性能，特别是当输入序列变化很大时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Mathematical Software

自引率

0.00%

发文量