GPULZ:在现代gpu上优化LZSS多字节数据无损压缩

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2023-04-14 DOI:10.1145/3577193.3593706

Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello

{"title":"GPULZ:在现代gpu上优化LZSS多字节数据无损压缩","authors":"Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello","doi":"10.1145/3577193.3593706","DOIUrl":null,"url":null,"abstract":"Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs\",\"authors\":\"Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello\",\"doi\":\"10.1145/3577193.3593706\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593706\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

当今的图形处理单元(GPU)应用程序产生大量数据，这对高效存储和传输具有挑战性。因此，数据压缩成为减轻存储负担和通信成本的关键技术。LZSS算法是Deflate等许多应用广泛的压缩器的核心算法。然而，由于LZSS算法的顺序性，现有的基于gpu的LZSS压缩器的吞吐量很低。此外，许多GPU应用程序产生多字节数据(例如，int16/int32索引，浮点数)，而当前的LZSS压缩只接受单字节数据作为输入。为此，在这项工作中，我们提出了gpuLZ，这是一种在现代gpu上用于多字节数据的高效LZSS压缩。我们的工作贡献有四个方面:首先，我们对现有的gpu LZ压缩器进行了深入分析，并调查了它们的主要问题。然后，我们提出了两个主要的算法级优化。具体而言，我们(1)将前缀和从一次传递改为两次传递，并融合多个核，以减少共享内存和全局内存之间的数据移动;(2)优化现有的多字节符号模式匹配方法，以降低计算复杂度，探索更长的重复模式。第三，我们执行架构性能优化，例如通过适应不同GPU架构的数据分区来最大化共享内存利用率。最后，我们使用NVIDIA A100和A4000 gpu在6个不同类型的数据集上对gpuLZ进行了评估。结果表明，与最先进的解决方案相比，gpuLZ在A4000上实现了高达272.1倍的加速和高达1.4倍的压缩比。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量