Performance evaluation and tuning of BioPig for genomic analysis

International Symposium on Design and Implementation of Symbolic Computation Systems Pub Date : 2015-11-15 DOI:10.1145/2831244.2831252

Lizhen Shi, Zhong Wang, Weikuan Yu, Xiandong Meng

引用次数: 2

Abstract

In this study, we aim to optimize Hadoop parameters to improve the performance of BioPig on Amazon Web Service (AWS). BioPig is a toolkit for large-scale sequencing data analysis and is built on Hadoop and Pig that enables easy parallel programming and scaling to datasets of terabyte sizes. AWS is the most popular cloud-computing platform offered by Amazon. When running BioPig jobs on AWS, the default configuration parameters may lead to high computational costs. We select the k-mer counting as it is used in a large number of next generation sequence (NGS) data analysis tools. We tuned Hadoop parameters from five different perspectives based on a baseline configuration. We found tuning different Hadoop parameters led to various performance improvements. The overall job execution time of k-mer counting on BioPig was reduced by 50% using an optimized set of parameters. This paper documents our tuning experiments as a valuable reference for future Hadoop-based analytics applications on genomics datasets.

查看原文本刊更多论文

用于基因组分析的BioPig性能评估和调整

在本研究中，我们的目标是优化Hadoop参数，以提高BioPig在亚马逊网络服务(AWS)上的性能。BioPig是一个用于大规模测序数据分析的工具包，它建立在Hadoop和Pig的基础上，可以轻松地并行编程并扩展到tb大小的数据集。AWS是亚马逊提供的最受欢迎的云计算平台。当在AWS上运行BioPig作业时，默认配置参数可能会导致较高的计算成本。我们选择k-mer计数，因为它在大量的下一代序列(NGS)数据分析工具中使用。我们基于基线配置从五个不同的角度调优Hadoop参数。我们发现，调整不同的Hadoop参数会带来各种性能改进。使用一组优化的参数，BioPig上k-mer计数的总体作业执行时间减少了50%。本文记录了我们的调优实验，为未来基于hadoop的基因组数据集分析应用提供了有价值的参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Symposium on Design and Implementation of Symbolic Computation Systems

自引率

0.00%

发文量