Parsing Next Generation Sequencing Data in Parallel Environments for Downstream Genetic Variation Analysis

The Journal of Computational Science Education Pub Date : 2018-12-01 DOI:10.22369/issn.2153-4136/9/2/5

Mariana Vasquez, J. Mohl, M. Leung

{"title":"Parsing Next Generation Sequencing Data in Parallel Environments for Downstream Genetic Variation Analysis","authors":"Mariana Vasquez, J. Mohl, M. Leung","doi":"10.22369/issn.2153-4136/9/2/5","DOIUrl":null,"url":null,"abstract":"With the recent advances in next generation sequencing technology, analysis of prevalent DNA sequence variants from patients with a particular disease has become an important tool for understanding the associations between the disease and genetic mutations. A publicly accessible bioinformatics pipeline, called OncoMiner (http://oncominer.utep.edu), was implemented in 2016 to help biomedical researchers analyze large genomic datasets from patients with cancer. However, the current version of OncoMiner can only accept input files with a highly specific format for sequence variant description. In order to handle data from a broader range of sequencing platforms, a data preprocessing tool is necessary. We have therefore implemented the OncoMiner Preprocessing (OP) program for parsing data files in the popular FastQ and BAM formats to generate an OncoMiner input file. OP involves using the open source Bowtie2 and SAMtools software, followed by a python script we developed for genetic sequence variant identification. To preprocess very large datasets efficiently, the OP program has been parallelized on two local computers and the Blue Waters system at the National Center for Supercomputing Applications using a multiprocessing approach. Although reasonable parallelization efficiency has been obtained on the local computers, the OP program’s speedup on Blue Waters has been limited, possibly due to I/O issues and individual node memory constraints. Despite these, Blue Waters has provided the necessary resources to process 35 datasets from patients with acute myeloid leukemia and demonstrated significant correlation of OP runtimes with the BAM input size and chromosome diversity.","PeriodicalId":330804,"journal":{"name":"The Journal of Computational Science Education","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Computational Science Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22369/issn.2153-4136/9/2/5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

With the recent advances in next generation sequencing technology, analysis of prevalent DNA sequence variants from patients with a particular disease has become an important tool for understanding the associations between the disease and genetic mutations. A publicly accessible bioinformatics pipeline, called OncoMiner (http://oncominer.utep.edu), was implemented in 2016 to help biomedical researchers analyze large genomic datasets from patients with cancer. However, the current version of OncoMiner can only accept input files with a highly specific format for sequence variant description. In order to handle data from a broader range of sequencing platforms, a data preprocessing tool is necessary. We have therefore implemented the OncoMiner Preprocessing (OP) program for parsing data files in the popular FastQ and BAM formats to generate an OncoMiner input file. OP involves using the open source Bowtie2 and SAMtools software, followed by a python script we developed for genetic sequence variant identification. To preprocess very large datasets efficiently, the OP program has been parallelized on two local computers and the Blue Waters system at the National Center for Supercomputing Applications using a multiprocessing approach. Although reasonable parallelization efficiency has been obtained on the local computers, the OP program’s speedup on Blue Waters has been limited, possibly due to I/O issues and individual node memory constraints. Despite these, Blue Waters has provided the necessary resources to process 35 datasets from patients with acute myeloid leukemia and demonstrated significant correlation of OP runtimes with the BAM input size and chromosome diversity.

查看原文本刊更多论文

在并行环境中解析下一代测序数据用于下游遗传变异分析

随着新一代测序技术的进步，分析特定疾病患者的流行DNA序列变异已成为了解疾病与基因突变之间关系的重要工具。OncoMiner (http://oncominer.utep.edu)是一个可公开访问的生物信息学管道，于2016年实施，旨在帮助生物医学研究人员分析来自癌症患者的大型基因组数据集。然而，当前版本的OncoMiner只能接受具有高度特定格式的序列变量描述的输入文件。为了处理来自更广泛测序平台的数据，数据预处理工具是必要的。因此，我们实现了OncoMiner预处理(OP)程序，用于解析流行的FastQ和BAM格式的数据文件，以生成OncoMiner输入文件。OP包括使用开源的Bowtie2和SAMtools软件，然后是我们开发的用于基因序列变异识别的python脚本。为了有效地预处理非常大的数据集，OP程序已经在两台本地计算机和国家超级计算应用中心的蓝水系统上使用多处理方法并行化。虽然在本地计算机上获得了合理的并行化效率，但OP程序在Blue Waters上的加速受到限制，可能是由于I/O问题和单个节点内存限制。尽管如此，Blue Waters已经提供了必要的资源来处理来自急性髓性白血病患者的35个数据集，并证明了OP运行时间与BAM输入大小和染色体多样性的显著相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Journal of Computational Science Education

自引率

0.00%

发文量