重新审视霍夫曼编码:在现代GPU架构上实现极致性能

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-20 DOI:10.1109/IPDPS49936.2021.00097

Jiannan Tian, Cody Rivera, S. Di, Jieyang Chen, Xin Liang, Dingwen Tao, F. Cappello

{"title":"重新审视霍夫曼编码:在现代GPU架构上实现极致性能","authors":"Jiannan Tian, Cody Rivera, S. Di, Jieyang Chen, Xin Liang, Dingwen Tao, F. Cappello","doi":"10.1109/IPDPS49936.2021.00097","DOIUrl":null,"url":null,"abstract":"Today’s high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today’s HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is fourfold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multithreaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0× and 6.8× on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3× over the multithread encoder on two 28-core Xeon Platinum 8280 CPUs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures\",\"authors\":\"Jiannan Tian, Cody Rivera, S. Di, Jieyang Chen, Xin Liang, Dingwen Tao, F. Cappello\",\"doi\":\"10.1109/IPDPS49936.2021.00097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today’s high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today’s HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is fourfold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multithreaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0× and 6.8× on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3× over the multithread encoder on two 28-core Xeon Platinum 8280 CPUs.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"81 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

当今的高性能计算(HPC)应用程序正在产生大量数据，这些数据在执行过程中难以有效地存储和传输，因此数据压缩成为减轻存储负担和数据移动成本的关键技术。霍夫曼编码可以说是信息论中最有效的熵编码算法，因此它可以作为许多现代压缩算法(如DEFLATE)的基本步骤。另一方面，当今的高性能计算应用越来越依赖于超级计算机上的GPU等加速器，而霍夫曼编码在GPU上的吞吐量很低，导致整个数据处理出现了明显的瓶颈。在本文中，我们提出并实现了一种基于现代GPU架构的高效霍夫曼编码方法，该方法解决了两个关键挑战:(1)如何并行化整个霍夫曼编码算法，包括码本构造;(2)如何充分利用现代GPU架构的高内存带宽特性。具体的贡献有四倍。(1)我们在gpu上开发了一种高效的并行码本结构，该结构可以随着输入符号的数量有效地扩展。(2)提出了一种新的基于约简的编码方案，可以有效地合并gpu上的码字。(3)我们通过利用最先进的CUDA api(如Cooperative Groups)来优化GPU的整体性能。(4)我们在两个高级gpu上使用六个实际应用数据集彻底评估了我们的霍夫曼编码器，并与我们实现的多线程霍夫曼编码器进行了比较。实验表明，该方案在NVIDIA RTX 5000和V100上的编码吞吐量分别比最先进的GPU霍夫曼编码器提高了5.0倍和6.8倍，在两个28核Xeon Platinum 8280 cpu上的多线程编码器提高了3.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Today’s high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today’s HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is fourfold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multithreaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0× and 6.8× on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3× over the multithread encoder on two 28-core Xeon Platinum 8280 CPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量