{"title":"Parallel implementation of GCM on GPUs","authors":"JaeSeok Lee , DongCheon Kim , Seog Chung Seo","doi":"10.1016/j.icte.2025.01.006","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of <span><math><mrow><mo>×</mo><mn>15</mn><mo>.</mo><mn>38</mn></mrow></math></span> compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a <span><math><mrow><mo>×</mo><mn>17</mn><mo>.</mo><mn>87</mn></mrow></math></span> improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves <span><math><mrow><mo>×</mo><mn>1</mn><mo>.</mo><mn>11</mn></mrow></math></span> better performance. For not only throughput but also power efficiency also better than other implementation, it achieves <span><math><mrow><mo>×</mo><mn>3</mn><mo>.</mo><mn>33</mn></mrow></math></span> compared to CPU implementation on Intel Xeon E3-1220, also it achieves <span><math><mrow><mo>×</mo><mn>21</mn><mo>.</mo><mn>09</mn></mrow></math></span> compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.</div></div>","PeriodicalId":48526,"journal":{"name":"ICT Express","volume":"11 2","pages":"Pages 310-316"},"PeriodicalIF":4.1000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICT Express","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405959525000062","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves better performance. For not only throughput but also power efficiency also better than other implementation, it achieves compared to CPU implementation on Intel Xeon E3-1220, also it achieves compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.
期刊介绍:
The ICT Express journal published by the Korean Institute of Communications and Information Sciences (KICS) is an international, peer-reviewed research publication covering all aspects of information and communication technology. The journal aims to publish research that helps advance the theoretical and practical understanding of ICT convergence, platform technologies, communication networks, and device technologies. The technology advancement in information and communication technology (ICT) sector enables portable devices to be always connected while supporting high data rate, resulting in the recent popularity of smartphones that have a considerable impact in economic and social development.