Similarity Projection: A Geometric Measure for Comparison of Biological Sequences

2017 IEEE 13th International Conference on e-Science (e-Science) Pub Date : 2017-10-01 DOI:10.1109/eScience.2017.46

Lawrence Buckingham, Timothy Chappell, J. Hogan, S. Geva

{"title":"Similarity Projection: A Geometric Measure for Comparison of Biological Sequences","authors":"Lawrence Buckingham, Timothy Chappell, J. Hogan, S. Geva","doi":"10.1109/eScience.2017.46","DOIUrl":null,"url":null,"abstract":"Sequence comparison is a fundamental task in computational biology, traditionally dominated by alignment-based methods such as the Smith-Waterman and Needleman-Wunsch algorithms, or by alignment based heuristics such as BLAST, the ubiquitous Basic Local Alignment Search Tool. For more than a decade researchers have examined a range of alignment-free alternatives to these approaches, citing concerns over scalability in the era of Next Generation Sequencing, the emergence of petascale sequence archives, and a lack of robustness of alignment methods in the face of structural sequence rearrangements. While some of these approaches have proven successful for particular tasks, many continue to exhibit a marked decline in sensitivity as closely related sequence sets diverge. Avoiding the alignment step allows the methods to scale to the challenges of modern sequence collections, but only at the cost of noticeably inferior search. In this paper we re-examine the problem of similarity measures for alignment-free sequence comparison, and introduce a new method which we term Similarity Projection. Similarity Projection offers markedly enhanced sensitivity – comparable to alignment based methods – while retaining the scalability characteristic of alignment-free approaches. As before, we rely on collections of k-mers; overlapping substrings of the molecular sequence of length k, collected without reference to position, but similarity relies on variants of the Hausdorff set distance, allowing similarity to be scored more effectively to the reflect those components which match, while lessening the impact of those which do not. Formally, the algorithm generates a large mutual similarity matrix between sequence pairs based on their component fragments; successive reduction steps yield a final score over the sequences. However, only a small fraction of these underlying comparisons need be performed, and by use of an approximate scheme based on vector quantization, we are able to achieve an order of magnitude improvement in execution time over the naive approach. We evaluate the approach on two large protein collections obtained from UniProtKB, showing that Similarity Projection achieves accuracy rivalling, and at times clearly exceeding, that of BLAST, while exhibiting markedly superior execution speed.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 13th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2017.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Sequence comparison is a fundamental task in computational biology, traditionally dominated by alignment-based methods such as the Smith-Waterman and Needleman-Wunsch algorithms, or by alignment based heuristics such as BLAST, the ubiquitous Basic Local Alignment Search Tool. For more than a decade researchers have examined a range of alignment-free alternatives to these approaches, citing concerns over scalability in the era of Next Generation Sequencing, the emergence of petascale sequence archives, and a lack of robustness of alignment methods in the face of structural sequence rearrangements. While some of these approaches have proven successful for particular tasks, many continue to exhibit a marked decline in sensitivity as closely related sequence sets diverge. Avoiding the alignment step allows the methods to scale to the challenges of modern sequence collections, but only at the cost of noticeably inferior search. In this paper we re-examine the problem of similarity measures for alignment-free sequence comparison, and introduce a new method which we term Similarity Projection. Similarity Projection offers markedly enhanced sensitivity – comparable to alignment based methods – while retaining the scalability characteristic of alignment-free approaches. As before, we rely on collections of k-mers; overlapping substrings of the molecular sequence of length k, collected without reference to position, but similarity relies on variants of the Hausdorff set distance, allowing similarity to be scored more effectively to the reflect those components which match, while lessening the impact of those which do not. Formally, the algorithm generates a large mutual similarity matrix between sequence pairs based on their component fragments; successive reduction steps yield a final score over the sequences. However, only a small fraction of these underlying comparisons need be performed, and by use of an approximate scheme based on vector quantization, we are able to achieve an order of magnitude improvement in execution time over the naive approach. We evaluate the approach on two large protein collections obtained from UniProtKB, showing that Similarity Projection achieves accuracy rivalling, and at times clearly exceeding, that of BLAST, while exhibiting markedly superior execution speed.

查看原文本刊更多论文

相似投影:生物序列比较的几何度量

序列比较是计算生物学中的一项基本任务，传统上主要是基于比对的方法，如Smith-Waterman和Needleman-Wunsch算法，或基于比对的启发式方法，如BLAST，普遍存在的基本局部比对搜索工具。十多年来，研究人员已经研究了一系列无需比对的替代方法，理由是下一代测序时代的可扩展性，千万亿次序列档案的出现，以及面对结构序列重排时缺乏比对方法的鲁棒性。虽然这些方法中的一些已被证明对特定任务是成功的，但随着密切相关的序列集的分化，许多方法继续表现出灵敏度的显着下降。避免对齐步骤允许方法扩展到现代序列集合的挑战，但代价是明显较差的搜索。本文重新研究了无比对序列比较中的相似性度量问题，并引入了一种新的相似性投影方法。相似性投影显着提高了灵敏度–与基于对齐的方法相当–同时保留了无对齐方法的可伸缩性特性。和以前一样，我们依靠k-mers的集合;长度为k的分子序列的重叠子串，收集时不参考位置，但相似性依赖于Hausdorff集合距离的变体，允许更有效地对相似性进行评分，以反映匹配的组件，同时减少不匹配的组件的影响。形式上，该算法根据序列对的组成片段生成一个较大的序列对相互相似性矩阵;连续的约简步骤产生序列的最终分数。然而，只需要执行这些底层比较的一小部分，并且通过使用基于矢量量化的近似方案，我们能够在执行时间上实现比原始方法的数量级改进。我们对从UniProtKB中获得的两个大型蛋白质集合进行了评估，结果表明相似性投影的准确性与BLAST相当，有时甚至明显超过BLAST，同时显示出明显优越的执行速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 13th International Conference on e-Science (e-Science)

自引率

0.00%

发文量