多核系统中BWA-MEM的高效体系结构感知加速

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI:10.1109/IPDPS.2019.00041

Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru

{"title":"多核系统中BWA-MEM的高效体系结构感知加速","authors":"Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru","doi":"10.1109/IPDPS.2019.00041","DOIUrl":null,"url":null,"abstract":"Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"525","resultStr":"{\"title\":\"Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems\",\"authors\":\"Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru\",\"doi\":\"10.1109/IPDPS.2019.00041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2\",\"PeriodicalId\":403406,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"525\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2019.00041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 525

摘要

新一代测序技术的创新使DNA序列数据的生成速度更快，成本更低。例如，Illumina NovaSeq 6000测序仪可以在不到两天的时间内产生6tb的数据，测序近200亿个称为reads的短DNA片段，每个人类基因组的成本低至1000美元。大型测序中心通常使用数百个这样的系统。这种高通量和低成本的数据生成强调了在测序数据的下游计算分析中需要相应的加速。下游分析的一个基本步骤是将reads映射到较长的参考DNA序列，例如参考人类基因组。序列定位是一个计算密集型的步骤，占GATK (Genome Analysis ToolKit)最佳实践工作流程总时间的30%以上。BWA-MEM是应用最广泛的序列映射工具之一，拥有数以万计的用户。在这项工作中，我们专注于通过有效的体系结构感知实现来加速BWA-MEM，同时保持相同的输出。大量的数据需要分布式计算，通常在集群或云部署上处理，多核处理器通常是选择的平台。由于应用程序可以很容易地跨多个套接字并行化(甚至跨分布式内存系统)，只需均匀地分配读操作，因此我们将重点放在单个套接字多核处理器的性能改进上。BWA-MEM运行时由三个内核主导，它们总共占总计算时间的85%以上。我们通过以下方式提高了这三个内核的性能:1)使用技术来提高缓存重用，2)简化算法，3)用几个大的连续内存分配替换许多小的内存分配，以改进数据的硬件预取，4)软件预取数据，以及5)在适用的情况下利用SIMD，并对源代码进行大规模重组以实现这些改进。因此，我们在三个内核上分别实现了近2倍、183倍和8倍的速度提升，在单线程和Intel至强Skylake处理器的单插槽上，与原始的BWA-MEM相比，端到端计算时间的速度提升了3:5倍和2:4倍。据我们所知，这是在使用单个CPU或单个CPU-单个GPGPU/FPGA组合时比BWA-MEM(在单个CPU上运行)报告的最高加速。源代码:https://github.com/bwa-mem2/bwa-mem2

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量