HTSFinder:基于并行和分布式计算的DNA签名发现的强大管道

Evolutionary Bioinformatics Online Pub Date : 2016-01-01 DOI:10.4137/EBO.S35545

Ramin Karimi, A. Hajdu

{"title":"HTSFinder:基于并行和分布式计算的DNA签名发现的强大管道","authors":"Ramin Karimi, A. Hajdu","doi":"10.4137/EBO.S35545","DOIUrl":null,"url":null,"abstract":"Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.","PeriodicalId":136690,"journal":{"name":"Evolutionary Bioinformatics Online","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing\",\"authors\":\"Ramin Karimi, A. Hajdu\",\"doi\":\"10.4137/EBO.S35545\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.\",\"PeriodicalId\":136690,\"journal\":{\"name\":\"Evolutionary Bioinformatics Online\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Evolutionary Bioinformatics Online\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4137/EBO.S35545\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Bioinformatics Online","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4137/EBO.S35545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

在过去的几年中，对低成本测序的全面努力导致了全基因组数据库的增长。与此同时，迫切需要开发快速、经济高效的方法和应用来加速序列分析。识别是这项任务的第一步。由于基于对准的方法的困难、高成本和计算挑战，迫切需要一种替代的通用识别方法。像不需要比对的方法一样，DNA签名为快速识别物种提供了新的机会。在本文中，我们提出了一个有效的管道HTSFinder(高通量签名查找器)和相应的k-mer生成器GkmerG(基因组k-mers生成器)。使用这个管道，我们确定了k-mers的频率从现有的全基因组数据库检测广泛的DNA特征在相当短的时间内。我们的应用程序可以在任意选择的目标和非目标数据库中检测唯一和公共签名。在这个管道中使用了Hadoop和MapReduce作为并行和分布式计算工具以及商用硬件。这种方法将高性能计算的能力引入到普通的台式个人计算机中，用于发现大型数据库(如细菌基因组)中的DNA特征。大量检测到的目标数据库中独特和共同的DNA特征不仅为聚合酶链反应和微阵列分析带来了改进鉴定过程的机会，而且还为更复杂的场景(如宏基因组学和下一代测序分析)带来了机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing

Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Evolutionary Bioinformatics Online

自引率

0.00%

发文量