选择性树生长:用于模式发现和计算多序列比对的确定性常空间线性时间算法

Proceedings. IEEE Computer Society Bioinformatics Conference Pub Date : 2002-08-14 DOI:10.1109/CSB.2002.1039367

Mashilamani Sambasivam

{"title":"选择性树生长:用于模式发现和计算多序列比对的确定性常空间线性时间算法","authors":"Mashilamani Sambasivam","doi":"10.1109/CSB.2002.1039367","DOIUrl":null,"url":null,"abstract":"Summary form only given. Given a set of n sequences, the multiple sequence alignment problem is to align these n sequences, with gaps or otherwise, such that the commonality of the sequences is projected appropriately. If m is the total sum of the lengths of the input sequences, A is the alphabet size of the input sequences, and P is the final number of unique patterns, fixed by the user, that cause an alignment between sequences, then the algorithm runs in time bound O(m(A + P)), linear worst case time. Our algorithm runs on both sequences where A is small and large. Our algorithm forms the alignment by first discovering patterns, and thus is also a pattern discovery solution. We support our theoretical conclusions with experimental results obtained from running our algorithm on GenPept sequences and human genome sequences from the GenBank public domain database. Our algorithm uses direct n-wise alignment and constant memory space irrespective of the value of m. What differentiates this algorithm from most others is that it is deterministic; it is guaranteed and theoretically proved that all patterns of any arbitrary length that occur in at least k sequences and that are responsible for multiple sequence alignment are found by the algorithm, where k is specified by the user.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"344-"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039367","citationCount":"0","resultStr":"{\"title\":\"Selective tree growing: a deterministic constant-space linear-time algorithm for pattern discovery and for computing multiple sequence alignment\",\"authors\":\"Mashilamani Sambasivam\",\"doi\":\"10.1109/CSB.2002.1039367\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. Given a set of n sequences, the multiple sequence alignment problem is to align these n sequences, with gaps or otherwise, such that the commonality of the sequences is projected appropriately. If m is the total sum of the lengths of the input sequences, A is the alphabet size of the input sequences, and P is the final number of unique patterns, fixed by the user, that cause an alignment between sequences, then the algorithm runs in time bound O(m(A + P)), linear worst case time. Our algorithm runs on both sequences where A is small and large. Our algorithm forms the alignment by first discovering patterns, and thus is also a pattern discovery solution. We support our theoretical conclusions with experimental results obtained from running our algorithm on GenPept sequences and human genome sequences from the GenBank public domain database. Our algorithm uses direct n-wise alignment and constant memory space irrespective of the value of m. What differentiates this algorithm from most others is that it is deterministic; it is guaranteed and theoretically proved that all patterns of any arbitrary length that occur in at least k sequences and that are responsible for multiple sequence alignment are found by the algorithm, where k is specified by the user.\",\"PeriodicalId\":87204,\"journal\":{\"name\":\"Proceedings. IEEE Computer Society Bioinformatics Conference\",\"volume\":\"1 1\",\"pages\":\"344-\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/CSB.2002.1039367\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE Computer Society Bioinformatics Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSB.2002.1039367\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computer Society Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSB.2002.1039367","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

只提供摘要形式。给定一组n个序列，多序列比对问题是将这n个序列以间隙或其他方式进行比对，使序列的共性得到适当的投影。如果m是输入序列长度的总和，A是输入序列的字母表大小，P是用户固定的唯一模式的最终数量，导致序列之间的对齐，那么算法在时间限制O(m(A + P))内运行，线性最坏情况时间。我们的算法在两个序列上运行，其中A是小的和大的。我们的算法通过首先发现模式来形成对齐，因此也是一种模式发现解决方案。通过在GenPept序列和GenBank公共数据库中的人类基因组序列上运行我们的算法得到的实验结果支持了我们的理论结论。我们的算法使用直接的n-wise对齐和恒定的内存空间，而不考虑m的值。与大多数其他算法不同的是，它是确定性的;保证并从理论上证明，算法可以找到至少出现在k个序列中且负责多个序列比对的任意长度的所有模式，其中k由用户指定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Selective tree growing: a deterministic constant-space linear-time algorithm for pattern discovery and for computing multiple sequence alignment

Summary form only given. Given a set of n sequences, the multiple sequence alignment problem is to align these n sequences, with gaps or otherwise, such that the commonality of the sequences is projected appropriately. If m is the total sum of the lengths of the input sequences, A is the alphabet size of the input sequences, and P is the final number of unique patterns, fixed by the user, that cause an alignment between sequences, then the algorithm runs in time bound O(m(A + P)), linear worst case time. Our algorithm runs on both sequences where A is small and large. Our algorithm forms the alignment by first discovering patterns, and thus is also a pattern discovery solution. We support our theoretical conclusions with experimental results obtained from running our algorithm on GenPept sequences and human genome sequences from the GenBank public domain database. Our algorithm uses direct n-wise alignment and constant memory space irrespective of the value of m. What differentiates this algorithm from most others is that it is deterministic; it is guaranteed and theoretically proved that all patterns of any arbitrary length that occur in at least k sequences and that are responsible for multiple sequence alignment are found by the algorithm, where k is specified by the user.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. IEEE Computer Society Bioinformatics Conference

自引率

0.00%

发文量