在GPU上实现多精度模幂运算的自动优化代码生成

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.149

Niall Emmart, C. Weems

{"title":"在GPU上实现多精度模幂运算的自动优化代码生成","authors":"Niall Emmart, C. Weems","doi":"10.1109/IPDPSW.2013.149","DOIUrl":null,"url":null,"abstract":"Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve a significant number of simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is a significant problem that fits this profile. We have developed a framework that enables generation of highly efficient NVIDIA PTX implementations of exponentiation operations for different GPU architectures and problem instances. One of the challenges in generating such code is that PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the exponentiation values changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating thousands of lines of carefully constructed PTX code for each combination, executing it, validating the numerical result, and evaluating its actual performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations. More importantly, the framework enables users to relatively quickly find the best configuration for each new GPU architecture. Our framework is also the basis for the eventual creation of a multiprocessing matrix arithmetic package for GPU cluster systems that will be portable across multiple generations of GPU.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Toward Automatic Optimized Code Generation for Multiprecision Modular Exponentiation on a GPU\",\"authors\":\"Niall Emmart, C. Weems\",\"doi\":\"10.1109/IPDPSW.2013.149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve a significant number of simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is a significant problem that fits this profile. We have developed a framework that enables generation of highly efficient NVIDIA PTX implementations of exponentiation operations for different GPU architectures and problem instances. One of the challenges in generating such code is that PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the exponentiation values changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating thousands of lines of carefully constructed PTX code for each combination, executing it, validating the numerical result, and evaluating its actual performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations. More importantly, the framework enables users to relatively quickly find the best configuration for each new GPU architecture. Our framework is also the basis for the eventual creation of a multiprocessing matrix arithmetic package for GPU cluster systems that will be portable across multiple generations of GPU.\",\"PeriodicalId\":234552,\"journal\":{\"name\":\"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2013.149\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

多处理模块幂有多种用途，包括密码学、素数测试和计算数论。这也是一个非常昂贵的计算操作。GPU的并行性可以用来加速这些计算，但要有效地使用GPU，一个问题必须涉及大量的同时幂运算。处理数据中心中大量的TLS/SSL加密会话是适合此配置文件的一个重要问题。我们已经开发了一个框架，可以为不同的GPU架构和问题实例生成高效的NVIDIA PTX幂运算实现。生成这种代码的挑战之一是PTX不是一种真正的汇编语言，而是一种虚拟指令集，针对不同代的GPU硬件以不同的方式进行编译和优化。因此，相同的PTX代码在不同的机器上以不同的效率级别运行。随着幂运算精度的变化，每个体系结构都有自己的平衡点，在这个平衡点上必须采用不同的算法或并行化策略。因此，为了使代码对给定的问题实例和体系结构有效，需要搜索算法和配置的多维空间，为此需要为每个组合生成数千行精心构造的PTX代码，执行它，验证数值结果，并评估其实际性能。我们的框架自动化了这个过程的大部分，并生成了比最著名的手工编码实现快六倍的求幂代码。更重要的是，该框架使用户能够相对快速地找到每个新GPU架构的最佳配置。我们的框架也是最终为GPU集群系统创建的多处理矩阵算法包的基础，它将在多代GPU之间进行移植。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Toward Automatic Optimized Code Generation for Multiprecision Modular Exponentiation on a GPU

Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve a significant number of simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is a significant problem that fits this profile. We have developed a framework that enables generation of highly efficient NVIDIA PTX implementations of exponentiation operations for different GPU architectures and problem instances. One of the challenges in generating such code is that PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the exponentiation values changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating thousands of lines of carefully constructed PTX code for each combination, executing it, validating the numerical result, and evaluating its actual performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations. More importantly, the framework enables users to relatively quickly find the best configuration for each new GPU architecture. Our framework is also the basis for the eventual creation of a multiprocessing matrix arithmetic package for GPU cluster systems that will be portable across multiple generations of GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量