ArrowSAM:使用Apache Arrow进行内存基因组数据处理

2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) Pub Date : 2020-03-01 DOI:10.1109/ICCAIS48893.2020.9096725

Tanveer Ahmad, Nauman Ahmed, J. Peltenburg, Z. Al-Ars

{"title":"ArrowSAM:使用Apache Arrow进行内存基因组数据处理","authors":"Tanveer Ahmad, Nauman Ahmed, J. Peltenburg, Z. Al-Ars","doi":"10.1109/ICCAIS48893.2020.9096725","DOIUrl":null,"url":null,"abstract":"The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.","PeriodicalId":422184,"journal":{"name":"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow\",\"authors\":\"Tanveer Ahmad, Nauman Ahmed, J. Peltenburg, Z. Al-Ars\",\"doi\":\"10.1109/ICCAIS48893.2020.9096725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.\",\"PeriodicalId\":422184,\"journal\":{\"name\":\"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCAIS48893.2020.9096725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAIS48893.2020.9096725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

在测序技术进步的推动下，基因组学数据库的规模迅速增长，需要快速和具有成本效益的处理。然而，处理这些数据带来了许多挑战，特别是在选择合适的算法和计算平台方面。计算系统需要离处理器更近的数据来进行快速处理。传统上，由于成本、波动性和DRAM的其他物理限制，将大量工作数据集放在内存中是不可行的。然而，新兴的存储类存储器允许在更靠近处理器的地方存储和处理大数据。在这项工作中，我们展示了常用的基因组数据格式，序列对齐/映射(SAM)，如何在Apache Arrow内存数据表示中呈现，以受益于内存处理，并通过共享内存对象确保更好的可伸缩性，避免跨语言互操作性中的大量(反)序列化开销。为了证明这种系统的好处，我们提出了ArrowSAM，这是一种使用Apache Arrow框架的内存SAM格式，并将其集成到基因组预处理管道中，包括BWA-MEM, Picard和Sambamba。结果显示，与Picard和Sambamba相比，前者的速度分别提高了15倍和2.4倍。运行所有工作流的代码和脚本可以在https://github.com/abs-tudelft/ArrowSAM上免费获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)

自引率

0.00%

发文量