Parallel processing of approximate sequence matching using disk-based suffix tree on multi-core CPU

2013 IEEE 6th International Workshop on Computational Intelligence and Applications (IWCIA) Pub Date : 2013-07-13 DOI:10.1109/IWCIA.2013.6624801

Yosuke Watanuki, Keiichi Tamura, H. Kitakami, Yoshifumi Takahashi

{"title":"Parallel processing of approximate sequence matching using disk-based suffix tree on multi-core CPU","authors":"Yosuke Watanuki, Keiichi Tamura, H. Kitakami, Yoshifumi Takahashi","doi":"10.1109/IWCIA.2013.6624801","DOIUrl":null,"url":null,"abstract":"Suffix trees, which are trie structures that present the suffixes of given sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, web intelligence, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a suffix tree and define a task as an approximate sequence matching on one suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.","PeriodicalId":257474,"journal":{"name":"2013 IEEE 6th International Workshop on Computational Intelligence and Applications (IWCIA)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 6th International Workshop on Computational Intelligence and Applications (IWCIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWCIA.2013.6624801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Suffix trees, which are trie structures that present the suffixes of given sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, web intelligence, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a suffix tree and define a task as an approximate sequence matching on one suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.

查看原文本刊更多论文

基于磁盘后缀树的多核CPU近似序列匹配并行处理

后缀树是一种表示给定序列(如字符串)后缀的树形结构，广泛用于文本数据挖掘、网络智能、生物信息学和计算生物学等不同应用领域的序列搜索。特别是，后缀树在生物信息学应用中非常有用，因为它们可以有效地搜索相似的子序列并提取频繁的序列模式。近年来，由于存储在序列数据库中的数据的数量和大小呈指数级增长，高效构建后缀树以实现更快的序列搜索已成为最重要的挑战之一。本文提出了一种基于磁盘后缀树的近似序列匹配并行化模型，该模型建立在多核CPU上的硬盘上，而不是内存上。在提出的并行化模型中，我们将整个序列数据库划分为两个或多个称为分区的子数据库。对于每个分区，我们构建一个后缀树，并将任务定义为一个后缀树上的近似序列匹配。此外，所提出的并行化模型包含了一个多缓冲管理系统，以避免cpu内核之间的冲突。我们使用PC上的实际氨基酸序列数据库来评估所提出的并行化模型。实验结果表明，该方法在计算性能上有很大的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 6th International Workshop on Computational Intelligence and Applications (IWCIA)

自引率

0.00%

发文量