The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code

Proc. VLDB Endow. Pub Date : 2023-05-01 DOI:10.14778/3598581.3598587

Azim Afroozeh, P. Boncz

{"title":"The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code","authors":"Azim Afroozeh, P. Boncz","doi":"10.14778/3598581.3598587","DOIUrl":null,"url":null,"abstract":"\n The open-source FastLanes project aims to improve big data formats, such as Parquet, ORC and columnar database formats, in multiple ways. In this paper, we significantly accelerate decoding of all common Light-Weight Compression (LWC) schemes: DICT, FOR, DELTA and RLE through better data-parallelism. We do so by re-designing the compression layout using two main ideas: (i) generalizing the\n value interleaving\n technique in the basic operation of bit-(un)packing by targeting a virtual 1024-bits SIMD register, (ii) reordering the tuples in all columns of a table in the same Unified Transposed Layout that puts tuple chunks in a common \"04261537\" order (explained in the paper); allowing for maximum independent work for all possible basic SIMD lane widths: 8, 16, 32, and 64 bits.\n \n We address the software development, maintenance and future-proofness challenges of increasing hardware diversity, by defining a virtual 1024-bits instruction set that consists of simple operators supported by all SIMD dialects; and also, importantly, by scalar code. The interleaved and tuple-reordered layout actually makes scalar decoding faster, extracting more data-parallelism from today's wide-issue CPUs. Importantly, the scalar version can be fully auto-vectorized by modern compilers, eliminating technical debt in software caused by platform-specific SIMD intrinsics.\n Micro-benchmarks on Intel, AMD, Apple and AWS CPUs show that FastLanes accelerates decoding by factors (decoding >40 values per CPU cycle). FastLanes can make queries faster, as compressing the data reduces bandwidth needs, while decoding is almost free.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"6 1","pages":"2132-2144"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3598581.3598587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The open-source FastLanes project aims to improve big data formats, such as Parquet, ORC and columnar database formats, in multiple ways. In this paper, we significantly accelerate decoding of all common Light-Weight Compression (LWC) schemes: DICT, FOR, DELTA and RLE through better data-parallelism. We do so by re-designing the compression layout using two main ideas: (i) generalizing the value interleaving technique in the basic operation of bit-(un)packing by targeting a virtual 1024-bits SIMD register, (ii) reordering the tuples in all columns of a table in the same Unified Transposed Layout that puts tuple chunks in a common "04261537" order (explained in the paper); allowing for maximum independent work for all possible basic SIMD lane widths: 8, 16, 32, and 64 bits. We address the software development, maintenance and future-proofness challenges of increasing hardware diversity, by defining a virtual 1024-bits instruction set that consists of simple operators supported by all SIMD dialects; and also, importantly, by scalar code. The interleaved and tuple-reordered layout actually makes scalar decoding faster, extracting more data-parallelism from today's wide-issue CPUs. Importantly, the scalar version can be fully auto-vectorized by modern compilers, eliminating technical debt in software caused by platform-specific SIMD intrinsics. Micro-benchmarks on Intel, AMD, Apple and AWS CPUs show that FastLanes accelerates decoding by factors (decoding >40 values per CPU cycle). FastLanes can make queries faster, as compressing the data reduces bandwidth needs, while decoding is almost free.

查看原文本刊更多论文

FastLanes压缩布局:用标量码每秒解码> 1000亿个整数

开源的FastLanes项目旨在以多种方式改进大数据格式，如Parquet、ORC和柱状数据库格式。在本文中，我们通过更好的数据并行性显著加快了所有常见的轻量级压缩(LWC)方案:DICT, FOR, DELTA和RLE的解码速度。我们通过使用两个主要思想重新设计压缩布局来做到这一点:(i)通过针对虚拟1024位SIMD寄存器，在位(非)打包的基本操作中推广值交错技术，(ii)在相同的统一转置布局中重新排序表的所有列中的元组，将元组块置于共同的“04261537”顺序中(在论文中解释);允许所有可能的基本SIMD通道宽度的最大独立工作:8,16,32和64位。我们通过定义一个由所有SIMD方言支持的简单操作符组成的虚拟1024位指令集，解决了硬件多样性增加带来的软件开发、维护和面向未来的挑战;重要的是，用标量编码。交错和元重排序的布局实际上使标量解码更快，从今天的大问题cpu中提取更多的数据并行性。重要的是，标量版本可以由现代编译器完全自动向量化，从而消除了由特定于平台的SIMD内在特性引起的软件技术债务。在英特尔、AMD、苹果和AWS CPU上的微基准测试表明，FastLanes加速了解码速度(每个CPU周期解码>40个值)。FastLanes可以使查询更快，因为压缩数据减少了带宽需求，而解码几乎是免费的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量