{"title":"Suffix Tree Construction based Mapreduce","authors":"Sihem Klai Soukehal, Karima Chibane, M. Khadir","doi":"10.1109/ICTAACS48474.2019.8988123","DOIUrl":null,"url":null,"abstract":"The genome sequence indexing is a primary step in order to facilitate other further treatments such as patterns search or assembly with a reference genome etc. And the suffix tree is one of the most used data structures for indexing the genome sequence. However, the memory required for running the suffix tree construction algorithms may exceed the amount of available main memory. Despite the efforts made by the researchers, the construction of suffix tree remains very expensive with the use of data centres to ensure optimal parallelization of treatments and reduce the execution time without forgetting the risks of breakdown and the problems that it breeds. The parallelization performed by Hadoop and MapReduce gives solutions to storage and data processing capacity limitations as well as fault tolerance, all that at reasonable costs. The emergence of Hadoop, a framework related to big data and the paradigm MapReduce that allows to model parallel and distributed processing, is investigating many domains of science in order to effectively parallel their treatments. PWOTD (Partition and Write Only Top Down) algorithm, is chosen here as it has proven itself in textual algorithms for genome sequencing. In this paper, an approach to model the parallel construction of the suffix tree using the MapReduce paradigm is designed for implementation in Hadoop with a java API.","PeriodicalId":406766,"journal":{"name":"2019 International Conference on Theoretical and Applicative Aspects of Computer Science (ICTAACS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Theoretical and Applicative Aspects of Computer Science (ICTAACS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAACS48474.2019.8988123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The genome sequence indexing is a primary step in order to facilitate other further treatments such as patterns search or assembly with a reference genome etc. And the suffix tree is one of the most used data structures for indexing the genome sequence. However, the memory required for running the suffix tree construction algorithms may exceed the amount of available main memory. Despite the efforts made by the researchers, the construction of suffix tree remains very expensive with the use of data centres to ensure optimal parallelization of treatments and reduce the execution time without forgetting the risks of breakdown and the problems that it breeds. The parallelization performed by Hadoop and MapReduce gives solutions to storage and data processing capacity limitations as well as fault tolerance, all that at reasonable costs. The emergence of Hadoop, a framework related to big data and the paradigm MapReduce that allows to model parallel and distributed processing, is investigating many domains of science in order to effectively parallel their treatments. PWOTD (Partition and Write Only Top Down) algorithm, is chosen here as it has proven itself in textual algorithms for genome sequencing. In this paper, an approach to model the parallel construction of the suffix tree using the MapReduce paradigm is designed for implementation in Hadoop with a java API.
基因组序列索引是促进其他进一步治疗如模式搜索或与参考基因组组装等的首要步骤。后缀树是基因组序列索引中最常用的数据结构之一。但是,运行后缀树构造算法所需的内存可能会超过可用的主内存。尽管研究人员做出了努力,但后缀树的构建仍然非常昂贵,需要使用数据中心来确保处理的最佳并行化,减少执行时间,同时还要考虑崩溃的风险和由此产生的问题。Hadoop和MapReduce执行的并行化解决了存储和数据处理容量限制以及容错问题,所有这些都是在合理的成本下完成的。Hadoop的出现,一个与大数据相关的框架,以及允许并行和分布式处理建模的范式MapReduce,正在研究许多科学领域,以便有效地并行它们的处理。这里选择PWOTD (Partition and Write Only Top Down)算法,因为它已经在基因组测序的文本算法中证明了自己。本文设计了一种使用MapReduce范式对后缀树的并行构建建模的方法,并通过java API在Hadoop中实现。