A Linear Time Algorithm that Infers Hidden Strings from Their Concatenations

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics Pub Date : 2008-01-01 DOI:10.2197/IPSJTBIO.1.13

Tomohiro Yasuda

{"title":"A Linear Time Algorithm that Infers Hidden Strings from Their Concatenations","authors":"Tomohiro Yasuda","doi":"10.2197/IPSJTBIO.1.13","DOIUrl":null,"url":null,"abstract":"Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"1 1","pages":"13-22"},"PeriodicalIF":0.0000,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.1.13","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IPSJ Transactions on Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/IPSJTBIO.1.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}

引用次数: 0

Abstract

Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.

查看原文本刊更多论文

从字符串的连接中推断隐藏字符串的线性时间算法

设T是隐藏字符串的集合，S是它们的连接的集合。我们解决了从S中推断T的问题，任何将问题形式化为优化问题的计算都是困难的，因为即使确定是否存在小于S的T也是np完全的，并且因为仅将两个字符串划分为最小的公共子字符串集合也是np完全的。在本文中，我们设计了一种新的算法，通过在S中寻找公共子串并拆分它们来推断T。该算法具有可扩展性，无论S的基数如何，都可以在O(L)时间内完成，其中L是S中所有字符串长度的总和。在计算实验中，我们成功地分解了40,000个随机生成的字符串的随机连接，并将我们的方法与多个序列对齐程序的有效性进行了比较。我们还介绍了针对智人转录组的初步实验结果，并描述了在分析真正大规模cDNA序列的应用中存在的问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IPSJ Transactions on Bioinformatics Biochemistry, Genetics and Molecular Biology-Biochemistry, Genetics and Molecular Biology (miscellaneous)

CiteScore

1.90

自引率

0.00%

发文量