GCM在gpu上的并行实现

IF 4.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ICT Express Pub Date : 2025-02-14 DOI:10.1016/j.icte.2025.01.006

JaeSeok Lee , DongCheon Kim , Seog Chung Seo

{"title":"GCM在gpu上的并行实现","authors":"JaeSeok Lee , DongCheon Kim , Seog Chung Seo","doi":"10.1016/j.icte.2025.01.006","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of <span><math><mrow><mo>×</mo><mn>15</mn><mo>.</mo><mn>38</mn></mrow></math></span> compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a <span><math><mrow><mo>×</mo><mn>17</mn><mo>.</mo><mn>87</mn></mrow></math></span> improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves <span><math><mrow><mo>×</mo><mn>1</mn><mo>.</mo><mn>11</mn></mrow></math></span> better performance. For not only throughput but also power efficiency also better than other implementation, it achieves <span><math><mrow><mo>×</mo><mn>3</mn><mo>.</mo><mn>33</mn></mrow></math></span> compared to CPU implementation on Intel Xeon E3-1220, also it achieves <span><math><mrow><mo>×</mo><mn>21</mn><mo>.</mo><mn>09</mn></mrow></math></span> compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.</div></div>","PeriodicalId":48526,"journal":{"name":"ICT Express","volume":"11 2","pages":"Pages 310-316"},"PeriodicalIF":4.1000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Parallel implementation of GCM on GPUs\",\"authors\":\"JaeSeok Lee , DongCheon Kim , Seog Chung Seo\",\"doi\":\"10.1016/j.icte.2025.01.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of <span><math><mrow><mo>×</mo><mn>15</mn><mo>.</mo><mn>38</mn></mrow></math></span> compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a <span><math><mrow><mo>×</mo><mn>17</mn><mo>.</mo><mn>87</mn></mrow></math></span> improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves <span><math><mrow><mo>×</mo><mn>1</mn><mo>.</mo><mn>11</mn></mrow></math></span> better performance. For not only throughput but also power efficiency also better than other implementation, it achieves <span><math><mrow><mo>×</mo><mn>3</mn><mo>.</mo><mn>33</mn></mrow></math></span> compared to CPU implementation on Intel Xeon E3-1220, also it achieves <span><math><mrow><mo>×</mo><mn>21</mn><mo>.</mo><mn>09</mn></mrow></math></span> compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.</div></div>\",\"PeriodicalId\":48526,\"journal\":{\"name\":\"ICT Express\",\"volume\":\"11 2\",\"pages\":\"Pages 310-316\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICT Express\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2405959525000062\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICT Express","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405959525000062","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

本文首次在GPU环境下实现了GCM的完全并行化优化。随着物联网时代的到来，大量的客户端与服务器通信，为了安全需要加密通信。GCM是AEAD的一种，目前用于各种安全协议中，包括TLS 1.3和IPsec。由于与众多客户端进行加密通信的负担，利用gpu进行加密中的高速并行处理已经得到了大量的研究。然而，到目前为止，还没有GCM在gpu上的完全并行化实现。本文提出了一种在GCM模式下并行处理具有挑战性的GHASH计算的方法，使AES-GCM的高速并行实现速度超过400Gb/s，满足下一代通信系统的要求。该方法与算法无关，可应用于任何分组密码。与多线程Intel(R) Core(TM) i7-13700K的最大处理吞吐量相比，我们在RTX 4090上的实现显示了×15.38的性能改进。与混合CPU-GPU系统相比，它还实现了×17.87改进。与研究最多的GCM FPGA实现，特别是Xilinx Ultrascale FPGA相比，我们的实现实现了×1.11更好的性能。在吞吐量和功耗方面也优于其他实现，与Intel Xeon E3-1220上的CPU实现相比，它达到×3.33，与Xilinx Virtex 7系列上的AES FPGA实现相比，它达到×21.09，后者不包括完整的GCM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallel implementation of GCM on GPUs

This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of

\times 15.38

compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a

\times 17.87

improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves

\times 1.11

better performance. For not only throughput but also power efficiency also better than other implementation, it achieves

\times 3.33

compared to CPU implementation on Intel Xeon E3-1220, also it achieves

\times 21.09

compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ICT Express Multiple-

CiteScore

10.20

自引率

1.90%

发文量

167

审稿时长

35 weeks

期刊介绍： The ICT Express journal published by the Korean Institute of Communications and Information Sciences (KICS) is an international, peer-reviewed research publication covering all aspects of information and communication technology. The journal aims to publish research that helps advance the theoretical and practical understanding of ICT convergence, platform technologies, communication networks, and device technologies. The technology advancement in information and communication technology (ICT) sector enables portable devices to be always connected while supporting high data rate, resulting in the recent popularity of smartphones that have a considerable impact in economic and social development.