Scaling up genome annotation using MAKER and work queue.

Q4 Health Professions

International Journal of Bioinformatics Research and Applications Pub Date : 2014-01-01 DOI:10.1504/IJBRA.2014.062994

Andrew Thrasher, Zachary Musgrave, Brian Kachmarck, Douglas Thain, Scott Emrich

{"title":"Scaling up genome annotation using MAKER and work queue.","authors":"Andrew Thrasher, Zachary Musgrave, Brian Kachmarck, Douglas Thain, Scott Emrich","doi":"10.1504/IJBRA.2014.062994","DOIUrl":null,"url":null,"abstract":"<p><p>Next generation sequencing technologies have enabled sequencing many genomes. Because of the overall increasing demand and the inherent parallelism available in many required analyses, these bioinformatics applications should ideally run on clusters, clouds and/or grids. We present a modified annotation framework that achieves a speed-up of 45x using 50 workers using a Caenorhabditis japonica test case. We also evaluate these modifications within the Amazon EC2 cloud framework. The underlying genome annotation (MAKER) is parallelised as an MPI application. Our framework enables it to now run without MPI while utilising a wide variety of distributed computing resources. This parallel framework also allows easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that often rely on shared file systems. Combined, our proposed framework can be used, even during early stages of development, to easily run sequence analysis tools on clusters, grids and clouds. </p>","PeriodicalId":35444,"journal":{"name":"International Journal of Bioinformatics Research and Applications","volume":"10 4-5","pages":"447-60"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJBRA.2014.062994","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Bioinformatics Research and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/IJBRA.2014.062994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Health Professions","Score":null,"Total":0}

引用次数: 9

Abstract

Next generation sequencing technologies have enabled sequencing many genomes. Because of the overall increasing demand and the inherent parallelism available in many required analyses, these bioinformatics applications should ideally run on clusters, clouds and/or grids. We present a modified annotation framework that achieves a speed-up of 45x using 50 workers using a Caenorhabditis japonica test case. We also evaluate these modifications within the Amazon EC2 cloud framework. The underlying genome annotation (MAKER) is parallelised as an MPI application. Our framework enables it to now run without MPI while utilising a wide variety of distributed computing resources. This parallel framework also allows easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that often rely on shared file systems. Combined, our proposed framework can be used, even during early stages of development, to easily run sequence analysis tools on clusters, grids and clouds.

查看原文本刊更多论文

使用MAKER和工作队列扩展基因组注释。

下一代测序技术使许多基因组测序成为可能。由于总体需求的增加和许多所需分析的内在并行性，这些生物信息学应用程序应该理想地运行在集群、云和/或网格上。我们提出了一个修改后的注释框架，使用Caenorhabditis japonica测试用例使用50个worker实现了45倍的加速。我们还在Amazon EC2云框架中评估了这些修改。底层基因组注释(MAKER)作为MPI应用程序并行化。我们的框架使它现在可以在没有MPI的情况下运行，同时利用各种分布式计算资源。这种并行框架还允许简单的显式数据传输，这有助于克服通常依赖于共享文件系统的生物信息学工具的主要限制。结合起来，我们提出的框架可以使用，甚至在开发的早期阶段，很容易在集群、网格和云上运行序列分析工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Bioinformatics Research and Applications Health Professions-Health Information Management

CiteScore

0.60

自引率

0.00%

发文量

期刊介绍： Bioinformatics is an interdisciplinary research field that combines biology, computer science, mathematics and statistics into a broad-based field that will have profound impacts on all fields of biology. The emphasis of IJBRA is on basic bioinformatics research methods, tool development, performance evaluation and their applications in biology. IJBRA addresses the most innovative developments, research issues and solutions in bioinformatics and computational biology and their applications. Topics covered include Databases, bio-grid, system biology Biomedical image processing, modelling and simulation Bio-ontology and data mining, DNA assembly, clustering, mapping Computational genomics/proteomics Silico technology: computational intelligence, high performance computing E-health, telemedicine Gene expression, microarrays, identification, annotation Genetic algorithms, fuzzy logic, neural networks, data visualisation Hidden Markov models, machine learning, support vector machines Molecular evolution, phylogeny, modelling, simulation, sequence analysis Parallel algorithms/architectures, computational structural biology Phylogeny reconstruction algorithms, physiome, protein structure prediction Sequence assembly, search, alignment Signalling/computational biomedical data engineering Simulated annealing, statistical analysis, stochastic grammars.