Lawrence Buckingham, Timothy Chappell, J. Hogan, S. Geva
{"title":"Similarity Projection: A Geometric Measure for Comparison of Biological Sequences","authors":"Lawrence Buckingham, Timothy Chappell, J. Hogan, S. Geva","doi":"10.1109/eScience.2017.46","DOIUrl":null,"url":null,"abstract":"Sequence comparison is a fundamental task in computational biology, traditionally dominated by alignment-based methods such as the Smith-Waterman and Needleman-Wunsch algorithms, or by alignment based heuristics such as BLAST, the ubiquitous Basic Local Alignment Search Tool. For more than a decade researchers have examined a range of alignment-free alternatives to these approaches, citing concerns over scalability in the era of Next Generation Sequencing, the emergence of petascale sequence archives, and a lack of robustness of alignment methods in the face of structural sequence rearrangements. While some of these approaches have proven successful for particular tasks, many continue to exhibit a marked decline in sensitivity as closely related sequence sets diverge. Avoiding the alignment step allows the methods to scale to the challenges of modern sequence collections, but only at the cost of noticeably inferior search. In this paper we re-examine the problem of similarity measures for alignment-free sequence comparison, and introduce a new method which we term Similarity Projection. Similarity Projection offers markedly enhanced sensitivity – comparable to alignment based methods – while retaining the scalability characteristic of alignment-free approaches. As before, we rely on collections of k-mers; overlapping substrings of the molecular sequence of length k, collected without reference to position, but similarity relies on variants of the Hausdorff set distance, allowing similarity to be scored more effectively to the reflect those components which match, while lessening the impact of those which do not. Formally, the algorithm generates a large mutual similarity matrix between sequence pairs based on their component fragments; successive reduction steps yield a final score over the sequences. However, only a small fraction of these underlying comparisons need be performed, and by use of an approximate scheme based on vector quantization, we are able to achieve an order of magnitude improvement in execution time over the naive approach. We evaluate the approach on two large protein collections obtained from UniProtKB, showing that Similarity Projection achieves accuracy rivalling, and at times clearly exceeding, that of BLAST, while exhibiting markedly superior execution speed.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 13th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2017.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Sequence comparison is a fundamental task in computational biology, traditionally dominated by alignment-based methods such as the Smith-Waterman and Needleman-Wunsch algorithms, or by alignment based heuristics such as BLAST, the ubiquitous Basic Local Alignment Search Tool. For more than a decade researchers have examined a range of alignment-free alternatives to these approaches, citing concerns over scalability in the era of Next Generation Sequencing, the emergence of petascale sequence archives, and a lack of robustness of alignment methods in the face of structural sequence rearrangements. While some of these approaches have proven successful for particular tasks, many continue to exhibit a marked decline in sensitivity as closely related sequence sets diverge. Avoiding the alignment step allows the methods to scale to the challenges of modern sequence collections, but only at the cost of noticeably inferior search. In this paper we re-examine the problem of similarity measures for alignment-free sequence comparison, and introduce a new method which we term Similarity Projection. Similarity Projection offers markedly enhanced sensitivity – comparable to alignment based methods – while retaining the scalability characteristic of alignment-free approaches. As before, we rely on collections of k-mers; overlapping substrings of the molecular sequence of length k, collected without reference to position, but similarity relies on variants of the Hausdorff set distance, allowing similarity to be scored more effectively to the reflect those components which match, while lessening the impact of those which do not. Formally, the algorithm generates a large mutual similarity matrix between sequence pairs based on their component fragments; successive reduction steps yield a final score over the sequences. However, only a small fraction of these underlying comparisons need be performed, and by use of an approximate scheme based on vector quantization, we are able to achieve an order of magnitude improvement in execution time over the naive approach. We evaluate the approach on two large protein collections obtained from UniProtKB, showing that Similarity Projection achieves accuracy rivalling, and at times clearly exceeding, that of BLAST, while exhibiting markedly superior execution speed.