Lucas B. Rocha, S. S. Adi, M. A. Stefanes, Elói Araújo
{"title":"Heuristics for the Specific Substring Problem with Hamming Distance","authors":"Lucas B. Rocha, S. S. Adi, M. A. Stefanes, Elói Araújo","doi":"10.1109/BIBE.2019.00052","DOIUrl":null,"url":null,"abstract":"An important problem in Computational Biology is to determine genetic markers, substrings of a set of sequences that do not occur on sequences of other sets. Applications for this problem include finding small specific regions for primer design and to find specific organisms or sequences in metagenomes. Genetic markers can be addressed by the Specific Substring Problem - SSP which consists of finding all minimal substrings in a given set of sequences with at least k differences among all the substrings in another sequence set. Since this problem spend quadratic time when Hamming distance is considered and we have, in general, a large volume of data to be processed, this solution becomes impractical. With this in mind, the main focus of this work is to propose and investigate the use of heuristic and parallel approaches for the SSP whose effectiveness were verified with artificial and real data experiments.","PeriodicalId":318819,"journal":{"name":"2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2019.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
An important problem in Computational Biology is to determine genetic markers, substrings of a set of sequences that do not occur on sequences of other sets. Applications for this problem include finding small specific regions for primer design and to find specific organisms or sequences in metagenomes. Genetic markers can be addressed by the Specific Substring Problem - SSP which consists of finding all minimal substrings in a given set of sequences with at least k differences among all the substrings in another sequence set. Since this problem spend quadratic time when Hamming distance is considered and we have, in general, a large volume of data to be processed, this solution becomes impractical. With this in mind, the main focus of this work is to propose and investigate the use of heuristic and parallel approaches for the SSP whose effectiveness were verified with artificial and real data experiments.
计算生物学中的一个重要问题是确定遗传标记,一组序列的子串不会出现在其他集合的序列上。该问题的应用包括寻找引物设计的小特定区域,以及在宏基因组中寻找特定的生物体或序列。遗传标记可以通过特定子串问题(Specific Substring Problem - SSP)来解决,该问题包括在给定的序列集合中找到所有最小子串,并且在另一个序列集合中的所有子串之间至少有k个差异。由于这个问题在考虑汉明距离的情况下花费了二次的时间,而且我们通常有大量的数据需要处理,因此这个解决方案变得不切实际。考虑到这一点,这项工作的主要重点是提出和研究启发式和并行方法在SSP中的使用,其有效性已通过人工和真实数据实验验证。