Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer

Supercomput. Front. Innov. Pub Date : 2020-03-01 DOI:10.14529/jsfi200102

Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, T. Ishida, M. Ohue, Y. Akiyama

{"title":"Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer","authors":"Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, T. Ishida, M. Ohue, Y. Akiyama","doi":"10.14529/jsfi200102","DOIUrl":null,"url":null,"abstract":"Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Supercomput. Front. Innov.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14529/jsfi200102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.

查看原文本刊更多论文

K计算机基因组/外显子组分析计算流水线软件的开发

流水线软件包括用于特定数据处理的工具和应用链，在生物信息学研究中广泛应用于几种数据类型的分析，例如基因组。基因组分析的最新趋势要求使用流水线软件来优化利用计算资源，从而促进对每天积累的大规模生物数据的有效处理。然而，在生物信息学中使用流水线软件往往是有问题的，因为它们需要大量的内存和存储容量，越来越多的作业提交，以及广泛的软件依赖。本文提出了一个大规模的并行基因组/外显子组分析流水线软件，解决了这些困难。此外，它可以在大量的K个计算机节点上执行。建议的管道包含工作流管理功能，当考虑通过扩展动态任务分布框架的内部执行的任务依赖图时，该功能可以有效地执行。通过使用实际的外显子组数据集进行评估实验，获得了与核心管道功能相关的性能结果，在使用超过1000个节点时显示出良好的可扩展性。此外，本研究提出了几种解决管道性能瓶颈的方法，通过考虑与内部管道执行相关的领域知识作为管道并行化面临的主要挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Supercomput. Front. Innov.

自引率

0.00%

发文量