Parallel processing of approximate sequence matching using disk-based suffix tree on multi-core CPU

Yosuke Watanuki, Keiichi Tamura, H. Kitakami, Yoshifumi Takahashi
{"title":"Parallel processing of approximate sequence matching using disk-based suffix tree on multi-core CPU","authors":"Yosuke Watanuki, Keiichi Tamura, H. Kitakami, Yoshifumi Takahashi","doi":"10.1109/IWCIA.2013.6624801","DOIUrl":null,"url":null,"abstract":"Suffix trees, which are trie structures that present the suffixes of given sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, web intelligence, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a suffix tree and define a task as an approximate sequence matching on one suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.","PeriodicalId":257474,"journal":{"name":"2013 IEEE 6th International Workshop on Computational Intelligence and Applications (IWCIA)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 6th International Workshop on Computational Intelligence and Applications (IWCIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWCIA.2013.6624801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Suffix trees, which are trie structures that present the suffixes of given sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, web intelligence, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a suffix tree and define a task as an approximate sequence matching on one suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.
基于磁盘后缀树的多核CPU近似序列匹配并行处理
后缀树是一种表示给定序列(如字符串)后缀的树形结构,广泛用于文本数据挖掘、网络智能、生物信息学和计算生物学等不同应用领域的序列搜索。特别是,后缀树在生物信息学应用中非常有用,因为它们可以有效地搜索相似的子序列并提取频繁的序列模式。近年来,由于存储在序列数据库中的数据的数量和大小呈指数级增长,高效构建后缀树以实现更快的序列搜索已成为最重要的挑战之一。本文提出了一种基于磁盘后缀树的近似序列匹配并行化模型,该模型建立在多核CPU上的硬盘上,而不是内存上。在提出的并行化模型中,我们将整个序列数据库划分为两个或多个称为分区的子数据库。对于每个分区,我们构建一个后缀树,并将任务定义为一个后缀树上的近似序列匹配。此外,所提出的并行化模型包含了一个多缓冲管理系统,以避免cpu内核之间的冲突。我们使用PC上的实际氨基酸序列数据库来评估所提出的并行化模型。实验结果表明,该方法在计算性能上有很大的提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信