Saeedeh Akbari Rokn Abadi, Melika Honarmand, Ali Hajialinaghi, Somayyeh Koohi
{"title":"An Assessment of PC-mer's Performance in Alignment-Free Phylogenetic Tree Construction","authors":"Saeedeh Akbari Rokn Abadi, Melika Honarmand, Ali Hajialinaghi, Somayyeh Koohi","doi":"arxiv-2311.12898","DOIUrl":null,"url":null,"abstract":"Background: Sequence comparison is essential in bioinformatics, serving\nvarious purposes such as taxonomy, functional inference, and drug discovery.\nThe traditional method of aligning sequences for comparison is time-consuming,\nespecially with large datasets. To overcome this, alignment-free methods have\nemerged as an alternative approach, prioritizing comparison scores over\nalignment itself. These methods directly compare sequences without the need for\nalignment. However, accurately representing the relationships between sequences\nis a significant challenge in the design of these tools. Methods:One of the\nalignment-free comparison approaches utilizes the frequency of fixed-length\nsubstrings, known as K-mers, which serves as the foundation for many sequence\ncomparison methods. However, a challenge arises in these methods when\nincreasing the length of the substring (K), as it leads to an exponential\ngrowth in the number of possible states. In this work, we explore the PC-mer\nmethod, which utilizes a more limited set of words that experience slower\ngrowth 2^k instead of 4^k compared to K. We conducted a comparison of sequences\nand evaluated how the reduced input vector size influenced the performance of\nthe PC-mer method. Results: For the evaluation, we selected the Clustal Omega\nmethod as our reference approach, alongside three alignment-free methods:\nkmacs, FFP, and alfpy (word count). These methods also leverage the frequency\nof K-mers. We applied all five methods to 9 datasets for comprehensive\nanalysis. The results were compared using phylogenetic trees and metrics such\nas Robinson-Foulds and normalized quartet distance (nQD). Conclusion: Our\nfindings indicate that, unlike reducing the input features in other\nalignment-independent methods, the PC-mer method exhibits competitive\nperformance when compared to the aforementioned methods especially when input\nsequences are very varied.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"10 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2311.12898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Sequence comparison is essential in bioinformatics, serving
various purposes such as taxonomy, functional inference, and drug discovery.
The traditional method of aligning sequences for comparison is time-consuming,
especially with large datasets. To overcome this, alignment-free methods have
emerged as an alternative approach, prioritizing comparison scores over
alignment itself. These methods directly compare sequences without the need for
alignment. However, accurately representing the relationships between sequences
is a significant challenge in the design of these tools. Methods:One of the
alignment-free comparison approaches utilizes the frequency of fixed-length
substrings, known as K-mers, which serves as the foundation for many sequence
comparison methods. However, a challenge arises in these methods when
increasing the length of the substring (K), as it leads to an exponential
growth in the number of possible states. In this work, we explore the PC-mer
method, which utilizes a more limited set of words that experience slower
growth 2^k instead of 4^k compared to K. We conducted a comparison of sequences
and evaluated how the reduced input vector size influenced the performance of
the PC-mer method. Results: For the evaluation, we selected the Clustal Omega
method as our reference approach, alongside three alignment-free methods:
kmacs, FFP, and alfpy (word count). These methods also leverage the frequency
of K-mers. We applied all five methods to 9 datasets for comprehensive
analysis. The results were compared using phylogenetic trees and metrics such
as Robinson-Foulds and normalized quartet distance (nQD). Conclusion: Our
findings indicate that, unlike reducing the input features in other
alignment-independent methods, the PC-mer method exhibits competitive
performance when compared to the aforementioned methods especially when input
sequences are very varied.