Parallel implementation of GCM on GPUs

IF 4.1 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
JaeSeok Lee , DongCheon Kim , Seog Chung Seo
{"title":"Parallel implementation of GCM on GPUs","authors":"JaeSeok Lee ,&nbsp;DongCheon Kim ,&nbsp;Seog Chung Seo","doi":"10.1016/j.icte.2025.01.006","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of <span><math><mrow><mo>×</mo><mn>15</mn><mo>.</mo><mn>38</mn></mrow></math></span> compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a <span><math><mrow><mo>×</mo><mn>17</mn><mo>.</mo><mn>87</mn></mrow></math></span> improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves <span><math><mrow><mo>×</mo><mn>1</mn><mo>.</mo><mn>11</mn></mrow></math></span> better performance. For not only throughput but also power efficiency also better than other implementation, it achieves <span><math><mrow><mo>×</mo><mn>3</mn><mo>.</mo><mn>33</mn></mrow></math></span> compared to CPU implementation on Intel Xeon E3-1220, also it achieves <span><math><mrow><mo>×</mo><mn>21</mn><mo>.</mo><mn>09</mn></mrow></math></span> compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.</div></div>","PeriodicalId":48526,"journal":{"name":"ICT Express","volume":"11 2","pages":"Pages 310-316"},"PeriodicalIF":4.1000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICT Express","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405959525000062","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents the first fully parallelized optimization of GCM in a GPU environment. As the era of IoT emerges, a large number of clients communicate with servers, necessitating encrypted communications for security. GCM is a type of AEAD and is currently used in various security protocols, including TLS 1.3 and IPsec. Due to the burden of performing encrypted communication with numerous clients, there has been significant research on utilizing GPUs for high-speed parallel processing in encryption. However, to date, there has been no fully parallelized implementation of GCM on GPUs. This paper proposes a method for parallelizing the challenging GHASH computation in GCM mode, leading to a high-speed parallel implementation of AES-GCM that can exceed 400Gb/s, meeting the requirements of next-generation communication systems. The proposed approach is algorithm-independent and can be applied to any block ciphers. Our implementation on an RTX 4090 demonstrates a performance improvement of ×15.38 compared to the maximum processing throughput of a multi-threaded Intel(R) Core(TM) i7-13700K. It also achieves a ×17.87 improvement compared to a hybrid CPU–GPU system. Compared to the most researched FPGA implementation for GCM, specifically Xilinx Ultrascale FPGA, our implementation achieves ×1.11 better performance. For not only throughput but also power efficiency also better than other implementation, it achieves ×3.33 compared to CPU implementation on Intel Xeon E3-1220, also it achieves ×21.09 compared to FPGA implementation for AES on Xilinx Virtex 7 series, which is not including full GCM.
求助全文
约1分钟内获得全文 求助全文
来源期刊
ICT Express
ICT Express Multiple-
CiteScore
10.20
自引率
1.90%
发文量
167
审稿时长
35 weeks
期刊介绍: The ICT Express journal published by the Korean Institute of Communications and Information Sciences (KICS) is an international, peer-reviewed research publication covering all aspects of information and communication technology. The journal aims to publish research that helps advance the theoretical and practical understanding of ICT convergence, platform technologies, communication networks, and device technologies. The technology advancement in information and communication technology (ICT) sector enables portable devices to be always connected while supporting high data rate, resulting in the recent popularity of smartphones that have a considerable impact in economic and social development.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信