SARVAVID:一种用于开发可扩展计算基因组学应用的领域特定语言

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926283

K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji

{"title":"SARVAVID:一种用于开发可扩展计算基因组学应用的领域特定语言","authors":"K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji","doi":"10.1145/2925426.2926283","DOIUrl":null,"url":null,"abstract":"Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications\",\"authors\":\"K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji\",\"doi\":\"10.1145/2925426.2926283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.\",\"PeriodicalId\":422112,\"journal\":{\"name\":\"Proceedings of the 2016 International Conference on Supercomputing\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2016 International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2925426.2926283\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

摘要

基因测序技术的突破导致了基因组数据量的指数级增长。快速处理如此大量数据的有效工具对于基因功能、疾病、进化和种群变异的研究至关重要。这些工具是以特别的方式设计的，需要程序员付出大量的努力来开发和优化它们。通常，这样的工具是在考虑当前可用数据大小的情况下编写的，并且由于数据的指数级增长，很快就开始表现不佳。此外，为了获得高性能，这些工具需要并行实现，这增加了开发的复杂性。本文观察到，大多数此类工具都包含一组反复出现的软件模块或内核。这种内核的有效实现的可用性可以提高程序员的工作效率，并为不断增长的数据提供有效的可伸缩性。为了实现这一目标，本文提出了一种特定于领域的语言，称为Sarvavid，它将这些内核作为语言结构提供。Sarvavid附带了一个编译器，可以执行特定于领域的优化，这超出了库和通用编译器的范围。此外，Sarvavid本身就支持跨多个节点的并行性。为了证明Sarvavid的有效性，我们使用Sarvavid实现了五个著名的基因组学应用程序-BLAST, MUMmer, E-MEM, SPAdes和SGA。在单个节点上运行时，我们的BLAST、MUMmer和E-MEM版本分别比手工优化的实现加速2.4倍、2.5倍和2.1倍，而SPAdes和SGA的性能与手工编写的代码相同。此外，使用Hadoop后端，Sarvavid应用程序可以扩展到1024核。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications

Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量